python 2.6/2.7 Requests网页编码问题

来源：互联网发布：手机网络雷达编辑：程序博客网时间：2024/06/06 05:45

Requests: HTTP for Humans
Python+Requests抓取中文乱码改进方案
Python requests库中文乱码问题
Python+Requests编码识别Bug

Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写，更友好，更易用。

Requests 使用的是 urllib3，因此继承了它的所有特性。Requests 支持 HTTP 连接保持和连接池，支持使用 cookie 保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码。现代、国际化、人性化。

最近在使用Requests的过程中发现一个问题，就是抓去某些中文网页的时候，出现乱码，打印encoding是ISO-8859-1。为什么会这样呢？通过查看源码，我发现默认的编码识别比较简单，直接从响应头文件的Content-Type里获取，如果存在charset，则可以正确识别，如果不存在charset但是存在text就认为是ISO-8859-1，见utils.py。

def get_encoding_from_headers(headers):    """Returns encodings from given HTTP Header Dict.    :param headers: dictionary to extract encoding from.    """    content_type = headers.get('content-type')    if not content_type:        return None    content_type, params = cgi.parse_header(content_type)    if 'charset' in params:        return params['charset'].strip("'\"")    if 'text' in content_type:        return 'ISO-8859-1'

其实Requests提供了从内容获取编码，只是在默认中没有使用，见utils.py：

def get_encodings_from_content(content):    """Returns encodings from given content string.    :param content: bytestring to extract encodings from.    """    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')    return (charset_re.findall(content) +            pragma_re.findall(content) +            xml_re.findall(content))

还提供了使用chardet的编码检测，见models.py:

@propertydef apparent_encoding(self):    """The apparent encoding, provided by the lovely Charade library    (Thanks, Ian!)."""    return chardet.detect(self.content)['encoding']

如何修复这个问题呢？先来看一下示例：

>>> r = requests.get('http://cn.python-requests.org/en/latest/')>>> r.headers['content-type']'text/html'>>> r.encoding'ISO-8859-1'>>> r.apparent_encoding'utf-8'>>> requests.utils.get_encodings_from_content(r.content)['utf-8']>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')>>> r.headers['content-type']'text/html'>>> r.encoding'ISO-8859-1'>>> r.apparent_encoding'gb2312'>>> requests.utils.get_encodings_from_content(r.content)['gb2312']

以下对这个问题进行分析并提供解决的方法：

分析requests的源代码发现，text返回的是处理过的Unicode型的数据，而使用content返回的是bytes型的原始数据。也就是说，r.content相对于r.text来说节省了计算资源，content是把内容bytes返回. 而text是decode成Unicode. 如果headers没有charset字符集的化,text()会调用chardet来计算字符集.

《HTTP权威指南》里第16章国际化里提到，如果HTTP响应中Content-Type字段没有指定charset，则默认页面是’ISO-8859-1’编码。这处理英文页面当然没有问题，但是中文页面，就会有乱码了！

在确定使用text前，已经得知该站的字符集编码时，可以使用 r.encoding = ‘xxx’ 模式，当你指定编码后，requests在text时会根据你设定的字符集编码进行转换. 使用apparent_encoding可以获得真实编码，这是程序自己分析的，会比较慢。还可以从html的meta中抽取，如：requests.utils.get_encodings_from_content(response.text)

if r.encoding == 'ISO-8859-1':    encodings = requests.utils.get_encodings_from_content(r.content)    if encodings:        r.encoding = encodings[0]    else:        r.encoding = r.apparent_encoding    r._content = r.content.decode(r.encoding, 'replace').encode('utf8', 'replace')

0 0