解决Requests中文乱码

来源：互联网发布：汽车传动比知乎编辑：程序博客网时间：2024/06/16 01:09

都在推荐用Requests库，而不是Urllib，但是读取网页的时候中文会出现乱码。Python Requests 在作为代理爬虫节点抓取不同字符集网站时会遇到一些问题，如中文乱码。如果单纯的抓取微博，微信，电商，那么字符集charset是很容易被确认的，你甚至可以单方面吧encoding给固定住。但作为舆情数据来说，他每天要抓取几十万个不同网站的敏感数据，所以这就需要我们更好确认字符集编码，避免中文的乱码情况。

我们首先看这个例子. 你会发现一些有意思的事情.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 
#blog: xiaorui.cc
 
In [9]: r = requests.get('http://cn.python-requests.org/en/latest/')
 
In [10]: r.encoding
Out[10]: 'ISO-8859-1'
 
In [11]: type(r.text)
Out[11]: unicode
 
In [12]: type(r.content)
Out[12]: str
 
In [13]: r.apparent_encoding
Out[13]: 'utf-8'
 
In [14]: chardet.detect(r.content)
Out[14]: {'confidence': 0.99, 'encoding': 'utf-8'}

第一个问题是，为什么会有ISO-8859-1这样的字符集编码？

iso-8859是什么？他又被叫做Latin-1或“西欧语言” . 对于我来说，这属于requests的一个bug，在requests库的github里可以看到不只是中国人提交了这个issue. 但官方的回复说是按照http rfc设计的。

下面通过查看requests源代码，看这问题是如何造成的 !

requests会从服务器返回的响应头的 Content-Type 去获取字符集编码，如果content-type有charset字段那么requests才能正确识别编码，否则就使用默认的 ISO-8859-1. 一般那些不规范的页面往往有这样的问题.

Python
1
2
3
 
In[52]:r.headers
Out[52]:{'content-length':'16907','via':'BJ-H-NX-116(EXPIRED), http/1.1 BJ-UNI-1-JCS-116 ( [cHs f ])','ser':'3.81','content-encoding':'gzip','age':'23','expires':'Fri, 19 Feb 2016 07:36:25 GMT','vary':'Accept-Encoding','server':'JDWS','last-modified':'Fri, 19 Feb 2016 07:35:25 GMT','connection':'keep-alive','cache-control':'max-age=60','date':'Fri, 19 Feb 2016 07:35:31 GMT','content-type':'text/html;'}

文件: requests.utils.py

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 
#blog: xiaorui.cc
def get_encoding_from_headers(headers):
    """通过headers头部的dict中获取编码格式"""
 
    content_type = headers.get('content-type')
 
    if not content_type:
        return None
 
    content_type, params = cgi.parse_header(content_type)
 
    if 'charset' in params:
        return params['charset'].strip("'\"")
 
    if 'text' in content_type:
        return 'ISO-8859-1'

第二个问题，那么如何获取正确的编码？

requests的返回结果对象里有个apparent_encoding函数, apparent_encoding通过调用chardet.detect()来识别文本编码. 但是需要注意的是，这有些消耗计算资源.
至于为毛，可以看看chardet的源码实现.

Python
1
2
3
4
5
6
 
#blog: xiaorui.cc
@property
defapparent_encoding(self):
    """使用chardet来计算编码"""
    returnchardet.detect(self.content)['encoding']

第三个问题，requests的text() 跟 content() 有什么区别？

requests在获取网络资源后，我们可以通过两种模式查看内容。一个是r.text，另一个是r.content，那他们之间有什么区别呢？

分析requests的源代码发现，r.text返回的是处理过的Unicode型的数据，而使用r.content返回的是bytes型的原始数据。也就是说，r.content相对于r.text来说节省了计算资源，r.content是把内容bytes返回. 而r.text是decode成Unicode. 如果headers没有charset字符集的话,text()会调用chardet来计算字符集，这又是消耗cpu的事情.

通过看requests代码来分析text() content()的区别.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
 
文件: requests.models.py
@property
def apparent_encoding(self):
    """The apparent encoding, provided by the chardet library"""
    return chardet.detect(self.content)['encoding']
 
@property
def content(self):
    """Content of the response, in bytes."""
 
    if self._content is False:
        # Read the contents.
        try:
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')
 
            if self.status_code == 0:
                self._content = None
            else:
                self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
 
        except AttributeError:
            self._content = None
 
    self._content_consumed = True
    # don't need to release the connection; that's been handled by urllib3
    # since we exhausted the data.
    return self._content
 
@property
def text(self):
    """Content of the response, in unicode.
    If Response.encoding is None, encoding will be guessed using
    ``chardet``.
    The encoding of the response content is determined based solely on HTTP
    headers, following RFC 2616 to the letter. If you can take advantage of
    non-HTTP knowledge to make a better guess at the encoding, you should
    set ``r.encoding`` appropriately before accessing this property.
    """
 
    # Try charset from content-type
    content = None
    encoding = self.encoding
 
    if not self.content:
        return str('')
 
    # 当为空的时候会使用chardet来猜测编码.
    if self.encoding is None:
        encoding = self.apparent_encoding
 
    # Decode unicode from given encoding.
    try:
        content = str(self.content, encoding, errors='replace')
    except (LookupError, TypeError):
        # A LookupError is raised if the encoding was not found which could
        # indicate a misspelling or similar mistake.
        #
        # A TypeError can be raised if encoding is None
        #
        # So we try blindly encoding.
        content = str(self.content, errors='replace')

分析：
r = requests.get(“http://www.baidu.com“)
**r.text返回的是Unicode型的数据。
使用r.content返回的是bytes型的数据。
也就是说，如果你想取文本，可以通过r.text。
如果想取图片，文件，则可以通过r.content。**

获取一个网页的内容

方法1：使用r.content，得到的是bytes型，再转为str

url='http://music.baidu.com'r = requests.get(url)html=r.contenthtml_doc=str(html,'utf-8') #html_doc=html.decode("utf-8","ignore")print(html_doc)1
2
3
4
5
1
2
3
4
5

方法2：使用r.text
Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。请求发出后，Requests 会基于 HTTP 头部对响应的编码作出有根据的推测。当你访问 r.text 之时，Requests 会使用其推测的文本编码。你可以找出 Requests 使用了什么编码，并且能够使用 r.encoding 属性来改变它.
但是Requests库的自身编码为: r.encoding = ‘ISO-8859-1’
可以 r.encoding 修改编码

url='http://music.baidu.com'r=requests.get(url)r.encoding='utf-8'print(r.text)1
2
3
4
1
2
3
4

获取一个网页的内容后存储到本地

方法1：r.content为bytes型，则open时需要open(filename,”wb”)

r=requests.get("music.baidu.com")html=r.contentwith open('test5.html','wb') as f:    f.write(html)1
2
3
4
1
2
3
4

方法2：r.content为bytes型，转为str后存储

r = requests.get("http://www.baidu.com")html=r.contenthtml_doc=str(html,'utf-8') #html_doc=html.decode("utf-8","ignore")# print(html_doc)with open('test5.html','w',encoding="utf-8") as f:    f.write(html_doc)1
2
3
4
5
6
1
2
3
4
5
6

方法3：r.text为str，可以直接存储

r=requests.get("http://www.baidu.com")r.encoding='utf-8'html=r.textwith open('test6.html','w',encoding="utf-8") as f:    f.write(html)1
2
3
4
5
1
2
3
4
5

Requests+lxml

# -*-coding:utf8-*-import requestsfrom lxml import etreeurl="http://music.baidu.com"r=requests.get(url)r.encoding="utf-8"html=r.text# print(html)selector = etree.HTML(html)title=selector.xpath('//title/text()')print (title[0])1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12

结果为：百度音乐-听到极致

终极解决方法

以上的方法虽然不会出现乱码，但是保存下来的网页，图片不显示，只显示文本。而且打开速度慢，找到了一篇博客，提出了一个终极方法，非常棒。

来自博客
http://blog.chinaunix.net/uid-13869856-id-5747417.html的解决方案：

# -*-coding:utf8-*-import requestsreq = requests.get("http://news.sina.com.cn/")if req.encoding == 'ISO-8859-1':    encodings = requests.utils.get_encodings_from_content(req.text)    if encodings:        encoding = encodings[0]    else:        encoding = req.apparent_encoding    # encode_content = req.content.decode(encoding, 'replace').encode('utf-8', 'replace')    global encode_content    encode_content = req.content.decode(encoding, 'replace') #如果设置为replace，则会用?取代非法字符；print(encode_content)with open('test.html','w',encoding='utf-8') as f:    f.write(encode_content)对于requests中文乱码解决方法有这么几种. 
方法一: 
由于content是HTTP相应的原始字节串，可以根据headers头部的charset把content decode为unicode，前提别是ISO-8859-1编码.
Python
1
2
3
4
5
6
 
In[96]:r.encoding
Out[96]:'gbk'
 
In[98]:printr.content.decode(r.encoding)[200:300]
="keywords"content="Python数据分析与挖掘实战,,机械工业出版社,9787111521235,,在线购买,折扣,打折"/>

另外有一种特别粗暴方式，就是直接根据chardet的结果来encode成utf-8格式. 
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
 
#http://xiaorui.cc
 
In [22]: r  = requests.get('http://item.jd.com/1012551875.html')
 
In [23]: print r.content
KeyboardInterrupt
 
In [23]: r.apparent_encoding
Out[23]: 'GB2312'
 
In [24]: r.encoding
Out[24]: 'gbk'
 
In [25]: r.content.decode(r.encoding).encode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-25-918324cdc053> in <module>()
----> 1 r.content.decode(r.apparent_encoding).encode('utf-8')
 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 49882-49883: illegal multibyte sequence
 
In [27]: r.content.decode(r.apparent_encoding,'replace').encode('utf-8')
如果在确定使用text，并已经得知该站的字符集编码时，可以使用 r.encoding = ‘xxx’ 模式， 当你指定编码后，requests在text时会根据你设定的字符集编码进行转换. 
Python
1
2
3
4
5
6
7
 
>>>importrequests
>>>r=requests.get('https://up.xiaorui.cc')
>>>r.text
>>>r.encoding
'gbk'
>>>r.encoding='utf-8'
方法二:
根据我抓几十万的网站的经验，大多数网站还是很规范的，如果headers头部没有charset，那么就从html的meta中抽取.
Python
1
2
3
4
5
6
7
8
 
In [78]: s
Out[78]: '    <meta http-equiv="Content-Type" content="text/html; charset=gbk"'
 
In [79]: b = re.compile("<meta.*content=.*charset=(?P<charset>[^;\s]+)", flags=re.I)
 
In [80]: b.search(s).group(1)
Out[80]: 'gbk"'
python requests的utils.py里已经有个完善的从html中获取meta charset的函数. 说白了还是一对的正则表达式.
Python
1
2
3
 
In[32]:requests.utils.get_encodings_from_content(r.content)
Out[32]:['gbk']

文件: utils.py
Python
1
2
3
4
5
6
7
8
9
 
def get_encodings_from_content(content):
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
 
    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))
最后，针对requests中文乱码的问题总结:
统一编码，要不都成utf-8, 要不就用unicode做中间码 ! 
国内的站点一般是utf-8、gbk、gb2312  , 当requests的encoding是这些字符集编码后，是可以直接decode成unicode. 
但当你判断出encoding是 ISO-8859-1 时，可以结合re正则和chardet判断出他的真实编码. 可以把这逻辑封装补丁引入进来.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 
importrequests
defmonkey_patch():
    prop=requests.models.Response.content
    defcontent(self):
        _content=prop.fget(self)
        ifself.encoding=='ISO-8859-1':
            encodings=requests.utils.get_encodings_from_content(_content)
            ifencodings:
                self.encoding=encodings[0]
            else:
                self.encoding=self.apparent_encoding
            _content=_content.decode(self.encoding,'replace').encode('utf8','replace')
            self._content=_content
        return_content
    requests.models.Response.content=property(content)
monkey_patch()
Python3.x解决了这编码问题，如果你还是python2.6 2.7，那么还需要用上面的方法解决中文乱码的问题. 
参考代码分析Python requests库中文编码问题
      Python HTTP库requests中文页面乱码解决方案！

阅读全文

0 0