【原创】 linux-python …

来源：互联网发布：陕钢集团网络大学编辑：程序博客网时间：2024/05/21 05:23

python spider 完美解码获取title

原创所有，转载请以超链接注明本文地址以及作者，谢谢！http://blog.sina.com.cn/s/blog_83dc494d0101ccof.html

在python处理coding真叫人头疼，爬到网页后乱七八糟的编码类型。

直接贴代码上来吧，我的编写过程：

1：

--------------------------------------------------------------------

#!/usr/bin/python

#!coding:utf-8

import re

import urllib

urls=['http://www.baidu.com','http://www.hao123.com','http://www.qq.com']

i = 0

regex=r'' #定义正则规则

pattren = re.compile(regex) #正则解释器还是什么的总之python匹配正则尽量使用这个函数吧

while i < len(urls):

htmlfile =urllib.urlopen(urls[i])

htmltext =htmlfile.read()

print urls[i],

#printregex,htmltext[:200]

titles =re.findall(pattren,htmltext) #findall匹配()返回一个列表，列表中只有()中的内容

for f in titles:

print f

i+=1

------------------

zhangzhipeng@zhangzhipeng-K53SD:~/py/spider_test$ pythonget-title.py

http://www.baidu.com 百度一下，你就知道

http://www.hao123.com hao123_上网从这里开始

http://www.qq.com ��Ѷ��ҳ

qq.com 源码使用的是gb2312 这需要解码，而且获取title的列表也是定死的，扩展一下：

--------------------------------------------------------------------

#!/usr/bin/python

#!coding:utf-8

import re

import urllib

import sys

def get_title(urls):

i = 0

regex_title = r''#匹配title正则

regex_char = r''

#pattren =re.compile(regex) #我暂且注释了

while i <len(urls):

htmlfile = urllib.urlopen(urls[i])

htmltext = htmlfile.read()

print urls[i],

#print regex,htmltext[:200]

titles = re.findall(regex_title,htmltext)

iflen(titles)>0:#如果能匹配到title内容（一般不会位空）

titles =titles[0]

charset = re.findall(regex_char,htmltext)

iflen(charset)>0:#如果能在源码找到编码类型

iflen(charset[0]) == 2 :

charset = charset[0][1]

eliflen(charset[0]) == 1 :

charset = charset[0][0]

else:

charset = ''

else:

charset=''

if charset.lower() == 'gb2312':如果是2313进行解码

titles =titles.decode('gb2312')

elif charset.lower() == '':

titles =titles.decode('iso-8859-1')

print titles,charset

i+=1

if __name__== '__main__':

get_title(sys.argv[1:])

------------------

zhangzhipeng@zhangzhipeng-K53SD:~/py/spider_test$ pythonget-title-argv.py http://www.baidu.com http://www.qq.comhttp://www.sina.com.cn http://www.china.com

http://www.baidu.com 百度一下，你就知道 utf-8

http://www.qq.com 腾讯首页 gb2312

http://www.sina.com.cn ÐÂÀËÊ×Ò³

http://www.china.com �л�� - ��ҳ "GB2312

过不其然，这就出问题了，百度源码utf-8 ，读取成功；qq gb2312解码成功；新浪源码木有写编码，解码错误；中华网GB2312 解码错误。

那怎么办...python有一个很不错的库，chardet：

--------------------------------------------------------------------

#!/usr/bin/python

#!coding:utf-8

import re

import urllib

import chardet

urls=['http://www.baidu.com','http://www.hao123.com','http://www.qq.com','http://www.china.com','http://www.sina.com.cn']

i = 0

regex=''

#pattren = re.compile(regex)

while i < len(urls):

htmlfile =urllib.urlopen(urls[i])

htmltext =htmlfile.read()

print urls[i],

#printregex,htmltext[:200]

titles =re.findall(regex,htmltext)

for f in titles:

#print f

try:

encoding =chardet.detect(f)['encoding'] #判断编码类型

ifencoding.lower()=='gb2312': #如果是gb2312 ，解码用gbk

encoding = 'gbk'

ifencoding.lower()!='uf-8': 如果不是utf-8，全部解码

f = f.decode(encoding)

except Exception, e: #捕获异常

printe

else:

printf

i+=1

------------------

zhangzhipeng@zhangzhipeng-K53SD:~/py/spider_test$ pythonget-title01.py

http://www.baidu.com 百度一下，你就知道

http://www.hao123.com hao123_上网从这里开始

http://www.qq.com 腾讯首页

http://www.china.com 中华网 - 首页

http://www.sina.com.cn 新浪首页

别说没还真就成了吧~移植到第二种灵活的获取title代码中：

--------------------------------------------------------------------

#!/usr/bin/python

#!coding:utf-8

import re

import urllib

import sys

import chardet

def get_title(urls):

i = 0

regex_title = ''

#pattren =re.compile(regex)

while i <len(urls):

htmlfile = urllib.urlopen(urls[i])

htmltext = htmlfile.read()

print urls[i],

titles = re.findall(regex_title,htmltext)

if len(titles)>0:title =titles[0]

try:

charset =chardet.detect(title)['encoding']

ifcharset.lower()=='gb2312':

charset = 'gbk'

ifcharset.lower()!='utf-8':

title =title.decode(charset)

except Exception,e:

printe

else:

print title,charset

i+=1

if __name__== '__main__':

get_title(sys.argv[1:])

------------------

zhangzhipeng@zhangzhipeng-K53SD:~/py/spider_test$ pythonget-title-argv01.py http://www.baidu.com http://www.qq.comhttp://www.sina.com.cn http://www.china.com

http://www.baidu.com 百度一下，你就知道 utf-8

http://www.qq.com 腾讯首页 gbk

http://www.sina.com.cn 新浪首页 gbk

http://www.china.com 中华网 - 首页 gbk

OK，这样就都搞定了.

结束语：总之，解码精华就在以下几句：

import chardet #导入chardet模块

try:

charset =chardet.detect(string_text)['encoding'] #获取编码类型

if charset.lower()=='gb2312':#如果编码类型是gb2312

charset ='gbk' 设置解码类型为gbk

if charset.lower()!='utf-8':#只要不是utf-8编码类型

title =title.decode(charset) 全部进行解码。

except Exception,e:

print e

else:

print string_text

注：将string_text替换为你获取到的文本或者是你要判断编码的类型。

原创所有，转载请以超链接注明本文地址以及作者，谢谢！http://blog.sina.com.cn/s/blog_83dc494d0101ccof.html

0 0

【原创】&nbsp;linux-python&nbsp;…

【原创】 linux-python …