python3下中文编码问题

来源:互联网 发布:手机同步翻译软件 编辑:程序博客网 时间:2024/05/22 10:46

编写了一个网易新闻的爬虫,在 Python2.7 下保存的文件中文显示没有问题。

在python 3.5下中文变成字节码。如下所示:

b'\xe5\x85\xa8\xe7\xab\x99'b'http://news.163.com/special/0001386F/rank_whole.html'b'\xe6\x96\xb0\xe9\x97\xbb'b'http://news.163.com/special/0001386F/rank_news.html'b'\xe5\xa8\xb1\xe4\xb9\x90'b'http://news.163.com/special/0001386F/rank_ent.html'b'\xe4\xbd\x93\xe8\x82\xb2'b'http://news.163.com/special/0001386F/rank_sports.html'b'\xe8\xb4\xa2\xe7\xbb\x8f'b'http://money.163.com/special/002526BH/rank.html'

爬虫代码如下:

# -*- coding: utf-8 -*-import osimport sys#import urllibimport requestsimport refrom lxml import etreefrom openpyxl import Workbookdef StringListSave(save_path, filename, slist):    if not os.path.exists(save_path):        os.makedirs(save_path)    path = save_path+"/"+filename+".txt"    with open(path, "w+") as fp:        for s in slist:            fp.write("%s\t\t%s\n" % (s[0].encode("utf8"), s[1].encode("utf8")))def Page_Info(myPage):    '''Regex'''    mypage_Info = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', myPage, re.S)    return mypage_Infodef New_Page_Info(new_page):    '''Regex(slowly) or Xpath(fast)'''    # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)\.html".*?>(.*?)</a></td>', new_page, re.S)    # # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)">(.*?)</a></td>', new_page, re.S) # bugs    # results = []    # for url, item in new_page_Info:    #     results.append((item, url+".html"))    # return results    dom = etree.HTML(new_page)    new_items = dom.xpath('//tr/td/a/text()')    new_urls = dom.xpath('//tr/td/a/@href')    assert(len(new_items) == len(new_urls))    return zip(new_items, new_urls)def Spider(url):    i = 0    print("downloading ", url)    myPage = requests.get(url).content.decode("gbk")    # myPage = urllib2.urlopen(url).read().decode("gbk")    myPageResults = Page_Info(myPage)    save_path = u"网易新闻抓取"    filename = str(i)+"_"+u"新闻排行榜"    StringListSave(save_path, filename, myPageResults)    i += 1    for item, url2 in myPageResults:        print("Downloading ", url2)        new_page = requests.get(url2).content.decode("gbk")        # new_page = urllib2.urlopen(url).read().decode("gbk")        newPageResults = New_Page_Info(new_page)        filename = str(i)+"_"+item        StringListSave(save_path, filename, newPageResults)        i += 1if __name__ == '__main__':    print("start")    start_url = "http://news.163.com/rank/"    Spider(start_url)    print("end")

解决办法:

with open(path, "w+",encoding='utf-8') as fp:        for s in slist:            fp.write("%s\t\t%s\n" % (s[0], s[1]))

再网上查到了这里的解决办法,发现了:

3:目标文件的编码 要将网络数据流的编码写入到新文件,那么我么需要指定新文件的编码。写文件代码如:

复制代码代码如下:
f.write(txt)  

,那么txt是一个字符串,它是通过decode解码过的字符串。关键点就要来了:目标文件的编码是导致标题所指问题的罪魁祸首。如果我们打开一个文件:
复制代码代码如下:
f = open("out.html","w")  

,在windows下面,新文件的默认编码是gbk,这样的话,python解释器会用gbk编码去解析我们的网络数据流txt,然而txt此时已经是decode过的unicode编码,这样的话就会导致解析不了,出现上述问题。 解决的办法就是,改变目标文件的编码:
复制代码代码如下:
f = open("out.html","w",encoding='utf-8')  

0 0