给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持
来源:互联网 发布:打鼓谱软件 编辑:程序博客网 时间:2024/06/07 13:09
之前已经实现了用python获取网页的内容,相关已实现代码为:
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def
getUrlResponse(url, postDict
=
{}, headerDict
=
{}) :
# makesure url is string, not unicode, otherwise urllib2.urlopen will error
url
=
str
(url);
if
(postDict) :
postData
=
urllib.urlencode(postDict);
req
=
urllib2.Request(url, postData);
req.add_header(
'Content-Type'
,
"application/x-www-form-urlencoded"
);
else
:
req
=
urllib2.Request(url);
if
(headerDict) :
print
"added header:"
,headerDict;
for
key
in
headerDict.keys() :
req.add_header(key, headerDict[key]);
req.add_header(
'User-Agent'
, gConst[
'userAgentIE9'
]);
req.add_header(
'Cache-Control'
,
'no-cache'
);
req.add_header(
'Accept'
,
'*/*'
);
#req.add_header('Accept-Encoding', 'gzip, deflate');
req.add_header(
'Connection'
,
'Keep-Alive'
);
resp
=
urllib2.urlopen(req);
return
resp;
#------------------------------------------------------------------------------
# get response html==body from url
def
getUrlRespHtml(url, postDict
=
{}, headerDict
=
{}) :
resp
=
getUrlResponse(url, postDict, headerDict);
respHtml
=
resp.read();
return
respHtml;
其中,是不支持html的压缩已解压缩的。
现在想要支持相关的压缩与解压缩。
其中,关于这部分内容,之前就已经通过C#实现了对应的功能,了解了对应的逻辑。所以,此处主要是具体是如何用python实现而已,对于内部机制,基本已经了解过了。
【解决过程】
1.之前就简单找过相关的帖子看,但是当时没来得及解决。
现在知道了,是要先对http的request添加gzip的header的,具体python代码是:
req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);
然后返回的http的response中,read所得到的数据,就是gzip后的压缩的数据了。
接下来就是想要搞懂,如何将其解压出来。
2.先去找了下gzip的解释,发现python官方文档中,是这样说的:
12.2. gzip — Support for gzip files
This module provides a simple interface to compress and decompress files just like the GNU programsgzip and gunzip would.
The data compression is provided by thezlib module.
即,gzip是针对文件来压缩与解压缩的。,而对于数据压缩与解压,是用zlib。
所以又去查看zlib:
zlib.decompress(string[,wbits[, bufsize]]) Decompresses the data in string, returning a string containing the uncompressed data. Thewbits parameter controls the size of the window buffer, and is discussed further below. Ifbufsize is given, it is used as the initial size of the output buffer. Raises theerror exception if any error occurs.
The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream,wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. Whenwbits is negative, the standard gzip header is suppressed.
bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls tomalloc(). The default size is 16384.
然后程序中直接去用:zlib.decompress,结果出错,后来解决了,具体过程见:
【已解决】Python中用zlib.decompress出错:error: Error -3 while decompressing data: incorrect header check
然后,就可以实现将返回的html解压了。
3.参考了这里:
http://flyash.itcao.com/post_1117.html
才知道可以去判断其中返回的http的response中,是否包含Content-Encoding: gzip,然后再决定是否去调用zlib去解压缩的。
4.最后实现了对应的全部代码,如下:
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def
getUrlResponse(url, postDict
=
{}, headerDict
=
{}, timeout
=
0
, useGzip
=
False
) :
# makesure url is string, not unicode, otherwise urllib2.urlopen will error
url
=
str
(url);
if
(postDict) :
postData
=
urllib.urlencode(postDict);
req
=
urllib2.Request(url, postData);
req.add_header(
'Content-Type'
,
"application/x-www-form-urlencoded"
);
else
:
req
=
urllib2.Request(url);
if
(headerDict) :
#print "added header:",headerDict;
for
key
in
headerDict.keys() :
req.add_header(key, headerDict[key]);
defHeaderDict
=
{
'User-Agent'
: gConst[
'userAgentIE9'
],
'Cache-Control'
:
'no-cache'
,
'Accept'
:
'*/*'
,
'Connection'
:
'Keep-Alive'
,
};
# add default headers firstly
for
eachDefHd
in
defHeaderDict.keys() :
#print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
if
(useGzip) :
#print "use gzip for",url;
req.add_header(
'Accept-Encoding'
,
'gzip, deflate'
);
# add customized header later -> allow overwrite default header
if
(headerDict) :
#print "added header:",headerDict;
for
key
in
headerDict.keys() :
req.add_header(key, headerDict[key]);
if
(timeout >
0
) :
# set timeout value if necessary
resp
=
urllib2.urlopen(req, timeout
=
timeout);
else
:
resp
=
urllib2.urlopen(req);
return
resp;
#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def
getUrlRespHtml(url, postDict
=
{}, headerDict
=
{}, timeout
=
0
, useGzip
=
True
) :
resp
=
getUrlResponse(url, postDict, headerDict, timeout, useGzip);
respHtml
=
resp.read();
if
(useGzip) :
#print "---before unzip, len(respHtml)=",len(respHtml);
respInfo
=
resp.info();
# Server: nginx/1.0.8
# Date: Sun, 08 Apr 2012 12:30:35 GMT
# Content-Type: text/html
# Transfer-Encoding: chunked
# Connection: close
# Vary: Accept-Encoding
# ...
# Content-Encoding: gzip
# sometime, the request use gzip,deflate, but actually returned is un-gzip html
# -> response info not include above "Content-Encoding: gzip"
# eg:http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
# -> so here only decode when it is indeed is gziped data
if
( (
"Content-Encoding"
in
respInfo)
and
(respInfo[
'Content-Encoding'
]
=
=
"gzip"
)) :
respHtml
=
zlib.decompress(respHtml,
16
+
zlib.MAX_WBITS);
#print "+++ after unzip, len(respHtml)=",len(respHtml);
return
respHtml;
【总结】
关于给python中的urllib2.urlopen添加gzip支持,其中主要逻辑就是:
1. 给request添加对应的gzip的header:
req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);
2. 然后获得返回的html后,用zlib对其解压缩:
respHtml = zlib.decompress(respHtml,16+zlib.MAX_WBITS);其中解压缩之前,先要判断返回的内容,是否是真正的gzip后的数据,即“Content-Encoding: gzip”,因为可能出现你的http的请求中支持其返回gzip的数据,但是其返回的是原始的没有用gzip压缩的html数据。
- 给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持
- 给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持
- Python的网页下载器:urllib2.urlopen
- urllib2.urlopen的post与get
- Python中Http的GET或POST请求支持Gzip压缩
- C# 中压缩与解压缩的方法
- python webservice gzip压缩与解压缩
- java实现linux中gzip压缩解压缩算法:byte[]字节数组,文件,字符串,数据流的压缩解压缩
- c#实现linux中gzip压缩解压缩算法:byte[]字节数组,文件,字符串,数据流的压缩解压缩
- 关于urllib2.urlopen()的理解
- 内存里面GZip压缩与解压缩的函数
- 获取urllib2.urlopen失败时的错误页面
- 获取urllib2.urlopen失败时的错误页面
- 【转】Python urllib2.urlopen打开中文url的编码处理
- Delphi中获取IE网页后,对GZIP方式的网页解压(gzip,deflate)
- 基于Gzip的压缩解压缩帮助类
- JAVA字符串的GZIP压缩解压缩方法
- JAVA字符串的GZIP压缩解压缩方法
- 通过HTTP请求,将XML以SOAP消息的方式发给JWSDP、.NET的webservice
- 移动应用开发辅助服务推荐
- ARM9存储器
- python
- 游戏化学习法:牛人教你如何赢得谷歌面试
- 给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持
- 利用搜狗输入法构建企业级云输入法平台
- 模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
- 黑马程序员_学习记录11:多线程
- HBASE的shell使用
- 黑马程序员_学习记录12:String、StringBuffer、基本数据类型对象包装类
- python发送post请求
- 黑马程序员_学习记录13:集合框架
- vim无法安装,更新又提示Ubuntu无法获得锁/var/lib/dpkg/lock