python爬虫xpath针对json代码的分析方法
来源:互联网 发布:pymongo 查询数据 编辑:程序博客网 时间:2024/05/19 15:43
本文学会使用多进程爬取的map方法,json提取页面内容方法,xpath解析页面的方法:
http://tieba.baidu.com/p/3522395718?pn=1
页面代码:
<div class="l_post j_l_post l_post_bright " data-field="{"author":{"user_id":503570759,"user_name":"\u9893\u5e9f\u4e86\u8c01\u7684\u6e05\u7eaf","name_u":"%E9%A2%93%E5%BA%9F%E4%BA%86%E8%B0%81%E7%9A%84%E6%B8%85%E7%BA%AF&ie=utf-8","user_sex":2,"portrait":"47e1e9a293e5ba9fe4ba86e8b081e79a84e6b885e7baaf031e","is_like":1,"level_id":14,"level_name":"\u4f20\u5947\u679c\u7c89","cur_score":20947,"bawu":0,"props":null},"content":{"post_id":62866847607,"is_anonym":false,"open_id":"tbclient","open_type":"apple","date":"2015-01-11 16:39","vote_crypt":"","post_no":6,"type":"0","comment_num":123,"ptype":"0","is_saveface":false,"props":null,"post_index":4,"pb_tpoint":null}}">
编程代码:
def spider(url): html = requests.get(url) selector = etree.HTML(html.text) content_field = selector.xpath('//div[@class="l_post l_post_bright "]') item = {} for each in content_field: reply_info = json.loads(each.xpath('@data-field')[0].replace('"','')) author = reply_info['author']['user_name'] content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0] reply_time = reply_info['content']['date'] #print content #print reply_time #print author item['user_name'] = author item['topic_reply_content'] = content item['topic_reply_time'] = reply_time towrite(item)
针对json开发的页面,我们可以使用json.loads方法加载提取,如:
reply_info = json.loads(each.xpath('@data-field')[0].replace('"',''))
此页面也涉及多个字典,字典里含有字典,解析方法为:
author = reply_info['author']['user_name']
后附爬取贴吧user_name,内容,发表时间等内容的完整代码:
#-*-coding:utf8-*-from lxml import etreefrom multiprocessing.dummy import Pool as ThreadPoolimport requestsimport jsonimport sysreload(sys)sys.setdefaultencoding('utf-8')'''重新运行之前请删除content.txt,因为文件操作使用追加方式,会导致内容太多。'''def towrite(contentdict): f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n') f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n') f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')def spider(url): html = requests.get(url) selector = etree.HTML(html.text) content_field = selector.xpath('//div[@class="l_post l_post_bright "]') item = {} for each in content_field: reply_info = json.loads(each.xpath('@data-field')[0].replace('"','')) author = reply_info['author']['user_name'] content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0] reply_time = reply_info['content']['date'] #print content #print reply_time #print author item['user_name'] = author item['topic_reply_content'] = content item['topic_reply_time'] = reply_time towrite(item)if __name__ == '__main__': pool = ThreadPool(4) f = open('content.txt','a') page = [] for i in range(1,22): newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i) page.append(newpage) results = pool.map(spider, page) pool.close() pool.join() f.close()
- python爬虫xpath针对json代码的分析方法
- python爬虫xpath的语法
- python爬虫xpath的语法
- python之xpath爬虫
- python爬虫之xpath
- Python爬虫XPATH
- python爬虫里信息提取的核心方法: Beautifulsoup、Xpath和正则表达式
- Python爬虫:Xpath语法笔记
- Python爬虫:Xpath语法笔记
- python爬虫利器-xpath使用
- Python爬虫---数据的提取---正则/Xpath/beautifulsoup--正则
- Python爬虫抓取马蜂窝游记的照片 基于xpath
- 三.Python爬虫Xpath语法与lxml库的用法
- python爬虫之XPath与lxml的使用
- Python爬虫之Xpath与lxml库的用法
- python爬虫系列(七):XPath的使用
- Python爬虫之<XPath与多线程爬虫>
- Python爬虫——web前端基础XPath、Json和HTTP
- 关于滴滴出行产品类实习三个面试
- WIN2008加入域提示“找不到网络路径”
- String类
- leetcode 45.Jump GameII
- python write和writelines的区别
- python爬虫xpath针对json代码的分析方法
- 仿糖护士曲线图写的一个CurveChartView
- JS延迟执行下一条语句/action页面跳转
- oracle练习题
- ELK(八)ElasticSearch2.X版本优化
- C#中获取当前时间:System.DateTime.Now.ToString()用法
- javascript获取style(兼容ie和w3c)
- eclipse中的 PermGen space 异常解决方法
- 背包板子(留个纪念)