python中文处理

来源：互联网发布：淘宝开店工商注册编辑：程序博客网时间：2024/05/16 15:00

1、encode和decode

python2中，使用decode()和encode()来进行解码和编码，以Unicode类型作为中间类型。即
     decode              encode
str ---------> unicode --------->str

u = u'中文'    #Unicode对象u
gb2312_str = u.encode('gb2312')    #gb2312编码字符串
gbk_str = u.encode('gbk')    #gbk编码字符串

utf8_str = u.encode('utf-8') #utf-8编码字符串

gb2312_u = gb2312_str.decode('gb2312') #gb2312编码的字符串的Unicode解码

gbk_u = gbk_str.encode('gbk') #gbk编码字符串的Unicode解码

utf8_u = utf8_str.encode('utf-8') #utf-8编码字符串Unicod解码

2、chardet

另外，使用 chardet 可以很方便的实现字符串/文件的编码检测。尤其是中文，有的使用GBK/GB2312，有的使用UTF8，如果你需要读取或者写入中文，知道文件的编码很重要

>>> import urllib
>>> import chardet
>>> html = urllib.urlopen('http://www.chinaunix.net').read()
>>> chardet.detect(html)
{'confidence': 0.98999999999999999, 'encoding': 'GB2312'}
函数返回值为字典，有2个元素，一个是检测的可信度，另外一个就是检测到的编码。也就是表示检测出来的编码方式也许不准确

0 0