python爬虫（6）爬取糗事百科

来源：互联网发布：java取地址符编辑：程序博客网时间：2024/04/28 20:35

最近学习一段时间Python了，网上找个项目练练手，网上很多写爬取糗事百科段子的例子，所以就拿过来试一试

之前看到的例子，直接down下来运行，结果好多错误，需要自己调试，但是总体思路是没错的，今天就从头到尾再次实验一下。

1.流程分析

糗事百科的页面是这个样子的

也就是说，在主页面，每一个段子是由图片，文字，组成，对于我们的爬取任务来说，既得处理文字，还要处理图片，太麻烦了，我们先来一个简单的

就只获取文字，不处理图片内容了。

因此，我们爬取页面的入口是这个： http://www.qiushibaike.com/text/

这个页面的段子，只有文字，因此就会减少我们一部分工作量。

那么定好了我们将要爬取的目标，接下里，就分析一下，我们在这个网页中需要获得什么内容。

首先，每个段子的内容是我们需要获取的，那有了内容，我们还想知道是谁发布它的，也就是作者，其次呢，有多少人点赞，多少人评论呢，这也是我们想获取的。

基本需求有了，然后呢，我们想的不只是能够获取一页的内容，获取的内容应该是连续的，看完第一页，还想看第二页，因此也需要连续获取页面内容

那总体思路如下：

1.段子作者

2.段子内容

获取主页内容——> 3.点赞人数——> 当前页面获取完毕接着下一页。

4.评论人数

好了，总体思路有了，接下来就实践吧

2.获取起始页面

直接使用 urllib2 库来获取页面内容

#!/usr/bin/python#coding:utf-8import urllib2def getPages():url="http://www.qiushibaike.com/text/"requests=urllib2.urlopen(url).read().decode('utf-8')print requestsgetPages()

这样简单的两句话，应该就能得到了起始页面的内容，接着我们就能继续分析了

但是，问题来了，这样执行并不成功，它报错如下：

Traceback (most recent call last):  File "06.qiushibaike_lianxi (复件).py", line 18, in <module>    getPages()  File "06.qiushibaike_lianxi (复件).py", line 15, in getPages    requests=urllib2.urlopen(url).read().decode('utf-8')  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen    return _opener.open(url, data, timeout)  File "/usr/lib/python2.7/urllib2.py", line 404, in open    response = self._open(req, data)  File "/usr/lib/python2.7/urllib2.py", line 422, in _open    '_open', req)  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain    result = func(*args)  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open    return self.do_open(httplib.HTTPConnection, req)  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open    r = h.getresponse(buffering=True)  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse    response.begin()  File "/usr/lib/python2.7/httplib.py", line 444, in begin    version, status, reason = self._read_status()  File "/usr/lib/python2.7/httplib.py", line 408, in _read_status    raise BadStatusLine(line)httplib.BadStatusLine: ''

为什么呢？

因为有的网站阻止了这类的访问，他们不允许这样动作，比如爬虫来访问网站

只要在请求中加上伪装成浏览器的header就可以了，同时注意处理异常，因此修改如下：

#!/usr/bin/python#coding:utf-8import urllib2def getPages():try:url="http://www.qiushibaike.com/text/"user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'headers={'User-Agent':user_agent}request=urllib2.Request(url,headers=headers)response=urllib2.urlopen(url).read().decode('utf-8')print responsereturn responseexcept urllib2.URLError,e:if hasattr(e,"reason"):print u"连接糗事百科失败，错误原因",e.reasonreturn NonegetPages()

这样就获取了起始页面的内容

3.获取关键内容

针对获取的页面进行处理，得到我们想要的内容

使用正则表达式获取内容

#!/usr/bin/python#coding:utf-8import urllib2import redef getPages():try:#页面起始网址url="http://www.qiushibaike.com/text/"#设置页面代理，否则获取不到页面内容user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'headers={'User-Agent':user_agent}#将header打包到request里面request=urllib2.Request(url,headers=headers)#获取页面内容，并将其重新编码html=urllib2.urlopen(request).read().decode('utf-8')#print htmlreturn htmlexcept urllib2.URLError,e:if hasattr(e,"reason"):print u"连接糗事百科失败，错误原因",e.reasonreturn Nonedef getPageItem(html):pageStories=[]pattern_author=re.compile(u'<h2>(.*?)</h2>',re.S)pattern_content=re.compile(u'<span>(.*?)</span>',re.S)pattern_support=re.compile(u'<i class="number">(\d*)</i>\s*好笑',re.S)pattern_comment=re.compile(u'<i class="number">(\d*)</i>\s*评论',re.S)find_author=re.findall(pattern_author,html)find_content=re.findall(pattern_content,html)find_support=re.findall(pattern_support,html)find_comment=re.findall(pattern_comment,html)if find_author:for i in xrange(len(find_author)):replaceBR=re.compile("<br/>")text=re.sub(replaceBR,"\n",find_content[i])#support=find_support[i].strip()+"个人说好笑"#comment=find_comment[i].strip()+"评论"comment="0"if i<len(find_comment):comment=find_comment[i].strip()support="0"if i<len(find_support):support=find_support[i].strip()pageStories.append([str(i+1),find_author[i].strip(),text,support,comment])print str(i+1),find_author[i].strip(),text,support,commentelse:print "数据异常"return Nonereturn pageStorieshtml=getPages()getPageItem(html)

现在的结果如下：

1 苍南下山耍流氓，黑衣格哥买红糖 记得有一次我发烧，到小区门口卫生所打针，一个姐姐给我夹上体温计以后，还关心的摸我额头，左手摸完换右手，最后俩手捂着我的脸，，，还对里面一个小护士喊；娟”快出来暖暖手，，，， 3961 982 Kiss萝卜 楼主女汉子一枚，打扮中性化。刚住进一个新的小区没几天，就听说这个小区有一对同性恋。天天同进同出，十分恩爱。后来，偶然之间才知道说的是我和老公！ 8355 2363 妹子不见了 今天去蹦极，一看价格200元一次。觉得有点贵，就问售票员能不能便宜点？她头也不抬的说了一句:不要绳 便宜50！。听完我心里这个乐啊！ 3838 1564 嘻哈妹纸 午休时间，嘴巴里含了个拉丝糖，不知不觉趴桌子上睡着了……领导来了，你能想象到，我那右半边脸被流出来的糖水粘到桌子上的模样吗……大写的    囧……啊…… 2587 465 许我三日暖 老婆去我一同学开的理发店烫头发。回来告诉我说没要钱。我打电话过去问同学怎么回事。同学说了:因为今天是过节（三 八），让她高兴高兴。你记得明天来给她交钱…… 4528 766 <糗犯监狱>～阿木 儿子上小学二年级了，今天儿子的老师终于把我叫到学校去了。老师把儿子的作业本往桌子上一摔说:“你以为我看不出来吗？你这已经是第三次帮儿子做作业了！去！面冲墙站着去！”我看着当年对我恩重如山，如今白发苍苍的老师二话没说站了过去！ 3064 64

4.控制功能

一次性输出这么多，看着不是很舒服，要做到的是，一次是输出一条，然后看完一天按下回车再输出下一条。

把程序变成下面这样就可以了

def getOneJoke(pageStories):i =0for story in pageStories:i +=1input =raw_input()if input=="Q":enable=Falsereturnelse:print "第%d篇\t发布人:%s\t\n%s\n赞:%s  评论人数:%s\n" % (i,story[1],story[2],story[3],story[4])while enable:html=getPages()story=getPageItem(html)if len(story)>0:getOneJoke(story)

这样就能到达我们的目的了，但是还有一个问题，目前只能获得第一页的内容，后续的内容怎么获得呢？

5.获取连续页面

http://www.qiushibaike.com/text/ 这个页面如果翻页的话，就会发现规律

每页后面加上 page/num num 是页码，组合起来就是每一页的网址

即

http://www.qiushibaike.com/text/page/2

http://www.qiushibaike.com/text/page/3

因此，我们只需要对第一个函数稍加变形就能获取连续的页面了

终章

经过上面的一些小步骤，我们自己再调试一下程序，一个灵活的小程序就在我们手下诞生了～

亲测有效，不管是windows 还是linux都能运行

#!/usr/bin/python#coding:utf-8import urllib2import reimport timeimport sysimport datetimeclass MyQiuBai:#初始化方法，定义一些变量def __init__(self):self.pageIndex=1self.user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'#初始化Headersself.headers={'User-Agent':self.user_agent}#存放段子的变量，每一个元素是每一页的段子self.stories=[]#存放程序是否继续运行的变量self.enable=False#将读过的段子保存到本地，这是本地文件名字self.filename='qiubai.txt'self.filesymbol=open(self.filename,'wb')#传入某一页面的索引获得页面代码def getPages(self,pageIndex):#print "翻页 %d" % (pageIndex)try:#构建新的URL地址url="http://www.qiushibaike.com/text/page/"+str(pageIndex)#构建请求的requestrequest=urllib2.Request(url,headers=self.headers)#利用urlopen获取页面代码response=urllib2.urlopen(request)#将页面转化为UTF-8编码格式html=response.read().decode('utf-8')return html#捕捉异常，防止程序直接死掉except urllib2.URLError,e:if hasattr(e,"reason"):print u"连接糗事百科失败，错误原因",e.reasonreturn Nonedef getPageItem(self,html):#定义存贮list，保存所需内容pageStories=[]#通过正则暴力匹配获取内容，依次是作者、内容、点赞人数、评论人数pattern_author=re.compile(u'<h2>(.*?)</h2>',re.S)pattern_content=re.compile(u'<span>(.*?)</span>',re.S)pattern_support=re.compile(u'<i class="number">(\d*)</i>\s*好笑',re.S)pattern_comment=re.compile(u'<i class="number">(\d*)</i>\s*评论',re.S)find_author=re.findall(pattern_author,html)find_content=re.findall(pattern_content,html)find_support=re.findall(pattern_support,html)find_comment=re.findall(pattern_comment,html)#有的可能没有作者，提前做一个判断if find_author:for i in xrange(len(find_author)):#对段子内容简单的做一个处理，将换行符替换为真正的换行replaceBR=re.compile("<br/>")text=re.sub(replaceBR,"\n",find_content[i])comment="0"if i<len(find_comment):comment=find_comment[i].strip()support="0"if i<len(find_support):support=find_support[i].strip()#将获得到的内容，存放到list中,此处的i，也代表了这是本页的第几条pageStories.append([str(i+1),find_author[i].strip(),text,support,comment])else:print "数据异常"return Nonereturn pageStories#加载并提取页面的内容，加入到列表中def loadPage(self,pageCode):if self.enable==True:#当前加载页面小于2页就再加载一页if len(self.stories)<2:pageStories=self.getPageItem(pageCode)if pageStories:#将该页的段子存放到全局list中self.stories.append(pageStories)#调用该方法，每次敲回车打印输出一个段子def getOneJoke(self,pageStories,page):for story in pageStories:#获取当前时间writetime=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S ')#打印输出一条段子print "第%d页第%s篇\t发布人:%s\t%s\n%s\n赞:%s  评论人数:%s\n" % (page,story[0],story[1],str(writetime),story[2],story[3],story[4])#输出之后，将其写到文件中content="第%d页第%s篇\t发布人:%s\t%s\n%s\n赞:%s  评论人数:%s\n" % (page,story[0],story[1],str(writetime),story[2],story[3],story[4])self.filesymbol.write(content)self.filesymbol.write('\n')input=raw_input()#如果输入"Q"，那就退出程序，同时关闭文件描述符if input=="Q":self.enable=Falseself.filesymbol.close()returndef begin(self):print u"正在读取糗事百科,按页数查看新段子,Q退出，按Enter读取下一条"self.enable= True#自定义新的起始页面nowPage=1input=raw_input('输入开始看的页面，默认是第一页开始')try:nowPage=int(input)except Exception,e:print "input what %s" % (input)if input=="Q":self.enable=Falseself.filesymbol.close()returnwhile self.enable:#获取起始页面pageCode=self.getPages(nowPage)if not pageCode:print("页面加载失败...")return None#多缓存一页self.loadPage(pageCode)if len(self.stories)>0:#从全局list中获取一页内容pageStories=self.stories[0]##将全局list中第一个元素删除，因为已经取出del self.stories[0]#获取这一页的内容self.getOneJoke(pageStories,nowPage)nowPage +=1reload(sys)sys.setdefaultencoding( "utf-8" )qiubai=MyQiuBai()qiubai.begin()

最后可以通过 py2exe 工具将其做成一个小应用程序，这样不用安装python 也能使用这个了

参考：http://cuiqingcai.com/990.html

1 0