Python爬虫实例1-抓取百度贴吧
来源:互联网 发布:eclipse java窗体程序 编辑:程序博客网 时间:2024/05/02 04:48
采集 网络爬虫吧 的所有贴吧信息
http://tieba.baidu.com/f?ie=utf-8&kw=%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB&fr=search
解决问题思路:
确认需求数据在哪
右键查看源代码
Fidder模拟发送数据
# -*- coding:utf-8 -*-import urllib2import chardetfrom lxml import etreeimport jsonimport urllibdef GetTimeByArticle(url): request = urllib2.Request(url) response = urllib2.urlopen(request) resHtml = response.read() html = etree.HTML(resHtml) return html.xpath('.//*[@class="tail-info"]')[1].textdef main(): output = open('tieba0628.json', 'w') queryUrl = {'kw': '网络爬虫'} request = urllib2.Request('http://tieba.baidu.com/f?ie=utf-8&'+ urllib.urlencode(queryUrl) +'&fr=search') response = urllib2.urlopen(request) print 'response start' resHtml = response.read() print 'response read' print chardet.detect(resHtml) html = etree.HTML(resHtml) result = html.xpath('//li[@data-field]') print result print len(result) for site in result: #print etree.tostring(site, encoding='utf-8') title = site.xpath('.//a[@title]')[0].text #title = site.xpath('.//a/@title')[0] author = site.xpath('.//*[@class="frs-author-name-wrap"]/a')[0].text lastName = site.xpath('.//*[@class="tb_icon_author_rely j_replyer"]/a')[0].text reply_date = site.xpath('.//span[@class="threadlist_reply_date pull_right j_reply_data"]')[0].text.strip() Article_url = site.xpath('.//*[@class ="j_th_tit "]')[0].attrib['href'] reply_date = GetTimeByArticle('http://tieba.baidu.com/'+Article_url) rep_num = site.xpath('.//*[@class="threadlist_rep_num center_text"]')[0].text field = json.loads(site.attrib['data-field']) print title,author,lastName,reply_date,rep_num,field item = {} item['title'] = title item['author'] = author item['lastName'] = lastName item['reply_date'] = reply_date item['rep_num'] = rep_num item['field'] = field print item line = json.dumps(item, ensure_ascii=False) print line print type(line) output.write(line.encode('utf-8') + "\n") break output.close() print 'end'if __name__ == '__main__': main()
1 0
- Python爬虫实例1-抓取百度贴吧
- python爬虫--抓取百度贴吧
- 抓取百度贴吧python小爬虫 (2015最新版)
- 爬虫实践---抓取百度贴吧
- Python爬虫实例--爬取百度贴吧小说
- Python爬虫实战(1)——百度贴吧抓取帖子并保存内容和图片
- Python爬虫----实例: 抓取百度百科Python词条相关1000个页面数据
- Python爬虫学习笔记二:百度贴吧网页图片抓取
- Python爬虫学习(1):百度贴吧
- python爬虫(抓取百度新闻列表)
- python爬虫(抓取百度图片)
- Python爬虫抓取百度搜索图片
- python百度贴吧爬虫
- python 百度贴吧爬虫
- python- 百度贴吧爬虫
- [python]百度贴吧爬虫
- python爬虫实例--百度风云榜
- Python爬虫抓取贴吧所有标题
- 422. Valid Word Square
- Android开发资源收集
- 一个页面,两个分页,JS写法
- Linux内核--网络栈实现分析(一)--网络栈初始化
- Swift3.0创建简单的TableView
- Python爬虫实例1-抓取百度贴吧
- java接口的继承问题
- HDU 1004
- DP-POJ 1163 Triangle(简单数字三角形)
- javascript学习笔记
- java默认的无参构造方法中应该注意的问题。
- 【codevs】基础题合集(三)
- 将文件内容先替换后复制
- Service 开机接受广播,启动服务的问题