Python爬虫实战:百度贴吧—妈妈吧
来源:互联网 发布:钱龙软件使用说明 编辑:程序博客网 时间:2024/05/29 12:29
上次,我们用requests 和 xpath爬取了极客学院的课程,感觉还是不过瘾,今天我们再来爬一下百度贴吧妈妈吧里面的话题,看看妈妈们都喜欢讨论什么吧!
爬取前我们先看一下我们的目标:
1.抓取百度贴吧妈妈吧的话题
2.抓取每一个话题的发布人、发布时间、发布标题、发布内容和回贴数目
1.确定URL
如何找URL,前面都说过的,我想大家都知道了,直接上URLhttp://tieba.baidu.com/f?kw=%E5%A6%88%E5%A6%88&ie=utf-8&pn=0,这次后面的数字稍微有些不同,第二页是50,第三页是100,以此类推。
2.Requests下载网页
这里就不啰嗦了,直接上代码了,不同的可以向前翻看别的文章:
import requestsurl = 'http://tieba.baidu.com/f?kw=%E5%A6%88%E5%A6%88&ie=utf-8&pn=0'html = requests.get(url)print html.text
3.Xpath解析网页
我们还是先来看一下妈妈吧的结构,F12审查元素,还是很漂亮的结构,我们还是先大后小的原则,每一个话题都在一个<li>.....<li>里面:
我们展开代码,依次找到我们想要的内容,发布人,创建时间,内容,回复数等,如下图:
好啦,结构和想要的信息都找到了,我们用xpath一一解析出来:
import requestsfrom lxml import etreeurl = 'http://tieba.baidu.com/f?kw=%E5%A6%88%E5%A6%88&ie=utf-8&pn=0'html = requests.get(url)selector = etree.HTML(html.text)content_field = selector.xpath('//li[@class=" j_thread_list clearfix"]')item = {}for each in content_field: reply_num = each.xpath('div/div[@class="col2_left j_threadlist_li_left"]/span/text()')[0] list_title = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_title pull_left j_th_tit "]/a/text()')[0] author = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="tb_icon_author "]/a/text()')[0] create_time = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="pull-right is_show_create_time"]/text()')[0] content = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_detail clearfix"]/div[@class="threadlist_text pull_left"]/div/text()')[0] print reply_num print list_title print author print author print create_time print content
运行一下,咦!!这是什么情况:
<span style="color:#ff0000;"> list_title = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_title pull_left j_th_tit "]/a/text()')[0]IndexError: list index out of range</span>好吧,看来有的地方匹配不出来,出现了错误,我们把这个过滤掉就好了,这个我们用一下try.....except.....,什么意思呢,出现问题,过,继续往下走。。。
import requestsfrom lxml import etreeurl = 'http://tieba.baidu.com/f?kw=%E5%A6%88%E5%A6%88&ie=utf-8&pn=0'html = requests.get(url)selector = etree.HTML(html.text)content_field = selector.xpath('//li[@class=" j_thread_list clearfix"]')item = {}for each in content_field: try: reply_num = each.xpath('div/div[@class="col2_left j_threadlist_li_left"]/span/text()')[0] list_title = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_title pull_left j_th_tit "]/a/text()')[0] author = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="tb_icon_author "]/a/text()')[0] create_time = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="pull-right is_show_create_time"]/text()')[0] content = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_detail clearfix"]/div[@class="threadlist_text pull_left"]/div/text()')[0] print reply_num print list_title print author print author print create_time print content except Exception, e: continue
好啦,一页的内容就这么爬出来了,就这么简简单单的几行代码!
4.面向对象完整代码
跟上次一下,我们美观一下,做一个面向对象的代码,这次还是爬取10页吧!
# _*_ coding:utf-8 _*_from lxml import etreeimport requestsimport sysreload(sys)sys.setdefaultencoding('utf-8')#内容输出def towrite(contentdict): f.writelines(u'回帖数目:' + unicode(contentdict['reply_num']) + '\n') f.writelines(u'发布标题:' + unicode(contentdict['topic_title']) + '\n') f.writelines(u'发布内容:' + unicode(contentdict['topic_content']) + '\n') f.writelines(u'发布人:' + unicode(contentdict['user_name']) + '\n') f.writelines(u'发布时间:' + str(contentdict['topic_time']) + '\n\n')#爬虫主体def spider(url): html = requests.get(url) selector = etree.HTML(html.text) content_field = selector.xpath('//li[@class=" j_thread_list clearfix"]') item = {} for each in content_field: try: item['reply_num'] = each.xpath('div/div[@class="col2_left j_threadlist_li_left"]/span/text()')[0] item['topic_title'] = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_title pull_left j_th_tit "]/a/text()')[0] item['user_name'] = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="tb_icon_author "]/a/text()')[0] item['topic_time'] = each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_lz clearfix"]/div[@class="threadlist_author pull_right"]/span[@class="pull-right is_show_create_time"]/text()')[0] content = (each.xpath('div/div[@class="col2_right j_threadlist_li_right "]/div[@class="threadlist_detail clearfix"]/div[@class="threadlist_text pull_left"]/div/text()')[0]).split() item['topic_content'] = ''.join(content) towrite(item) except Exception,e: continueif __name__ == '__main__': f = open('content.txt','a') page = [] #循环用来生产不同页数的链接 for x in range(10): i = x*50 newpage = 'http://tieba.baidu.com/f?kw=%E5%A6%88%E5%A6%88&ie=utf-8&pn=' + str(i) print u"第%d页"%(x+1) print newpage BDTBspier = spider(newpage)
0 0
- Python爬虫实战:百度贴吧—妈妈吧
- python爬虫实战2-百度贴吧
- Python爬虫实战:百度贴吧帖子
- [Python]实战——百度贴吧爬虫
- [Python]实战——百度贴吧爬虫
- 实战——百度贴吧爬虫
- Python爬虫实战(2):百度贴吧帖子
- Python爬虫实战(2):百度贴吧帖子
- Python爬虫实战之爬取百度贴吧帖子
- python爬虫入门 实战(二)---爬百度贴吧
- python百度贴吧爬虫
- python 百度贴吧爬虫
- python- 百度贴吧爬虫
- [python]百度贴吧爬虫
- Python爬虫实战(1)——百度贴吧抓取帖子并保存内容和图片
- Python爬虫实战二之爬取百度贴吧帖子
- Python爬虫实战二:下载百度贴吧帖子内的壁纸
- Python爬虫实战二之爬取百度贴吧帖子
- 亚洲科技(TECH IN ASIA )2016摘要
- javascript处理HTML的Encode(转码)和Decode(解码)总结
- Xamarin Evolve 2016 Keynote回顾
- 【Ural1041】Nikifor【拟阵】【线性无关】【高斯消元】【矩阵的秩】
- Android——布局
- Python爬虫实战:百度贴吧—妈妈吧
- java图片处理工具类
- Windows server 2012网络负载均衡NLB
- .dylib与.tbd
- Android——TabWidget(切换卡)
- mysql shell 查看 所有用户的 授权列表
- VR教程来了!谷歌设计师出品的VR设计入门指南
- swift瀑布流自定义布局实现
- CacheCloud-资源归档