python读取gzip格式及普通格式网页的方法

来源：互联网发布：淘宝网臭豆腐4号碗编辑：程序博客网时间：2024/05/16 08:23

一般情况下，我们读取网页分析去返回内容时是这样子的：

#!/usr/bin/python#coding:utf-8import urllib2headers = {"User-Agent": 'Opera/9.25 (Windows NT 5.1; U; en)'}request = urllib2.Request(url='http://www.baidu.com', headers=headers)response = urllib2.urlopen(request).read()

一般情况下，你可以看到返回的网页源码：

<html><head>        <meta http-equiv="content-type" content="text/html;charset=utf-8">    <meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer">    <meta name="theme-color" content="#2932e1">    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" />     <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/>        <title>百度一下，你就知道</title>    <style id="css_index" index="index" type="text/css">html,body{height:100%}html{overflow-y:auto}

但是有时候访问有些网页时，也会返回乱码，ok，你会首先考虑编码的问题（这不是本文的重点，一笔带过），查看网页的编码（可以参考我的文章-python获取网页编码的方法），然后用它的编码方式方式去decode内容，这样会解决一部分的网页乱码，但是有时候可以肯定不是编码的问题，怎么还是乱码？

当然，我们还有一种情况没有考虑，同时也是我们最容易忽略的一点，返回的网页格式
一般网页返回数据的格式会是text/html和gzip两种，text/html格式的数据是可以直接read的，而gzip格式的数据不能直接read，需要使用专门的gzip模块进行读取，废话不多说，亮代码：

from StringIO import StringIOimport gzipimport urllib2headers = {"User-Agent": 'Opera/9.25 (Windows NT 5.1; U; en)'}request = urllib2.Request(url='gzip格式的网页', headers=headers)response = urllib2.urlopen(request)buf = StringIO( response.read())f = gzip.GzipFile(fileobj=buf)data = f.read()#处理.........f.close()

注意：一定要关闭f，即必须要有f.close() ，特别是做爬虫，进行多线程抓取。
当然这还不完美，有的服务器会主动更换返回网页的格式（如果你对text/html格式数据进行gzip模块解读，给你输出的也是一堆乱码），有可能是白天test/html，晚上gzip，也有可能每天轮换（本人亲身经历，一个网站在中午12点后是text/html格式，上午还是gzip格式），所以很理所当然的我们就要想到读到网页后先进行检测内容格式（通过info()里面的Content-Encoding项），在进行相应的处理。
最终代码：

#coding:utf-8from StringIO import StringIOimport gzipimport urllib2headers = {"User-Agent": 'Opera/9.25 (Windows NT 5.1; U; en)'}request = urllib2.Request(url='网址', headers=headers)response = urllib2.urlopen(request)if response.info().get('Content-Encoding') == 'gzip':    buf = StringIO( response.read())    f = gzip.GzipFile(fileobj=buf)    data = f.read()    #处理    f.close()else:   data = response .read()

阅读全文

1 0