python爬虫xpath针对json代码的分析方法

来源:互联网 发布:pymongo 查询数据 编辑:程序博客网 时间:2024/05/19 15:43

本文学会使用多进程爬取的map方法,json提取页面内容方法,xpath解析页面的方法:

http://tieba.baidu.com/p/3522395718?pn=1

页面代码:

<div class="l_post j_l_post l_post_bright  " data-field="{&quot;author&quot;:{&quot;user_id&quot;:503570759,&quot;user_name&quot;:&quot;\u9893\u5e9f\u4e86\u8c01\u7684\u6e05\u7eaf&quot;,&quot;name_u&quot;:&quot;%E9%A2%93%E5%BA%9F%E4%BA%86%E8%B0%81%E7%9A%84%E6%B8%85%E7%BA%AF&amp;ie=utf-8&quot;,&quot;user_sex&quot;:2,&quot;portrait&quot;:&quot;47e1e9a293e5ba9fe4ba86e8b081e79a84e6b885e7baaf031e&quot;,&quot;is_like&quot;:1,&quot;level_id&quot;:14,&quot;level_name&quot;:&quot;\u4f20\u5947\u679c\u7c89&quot;,&quot;cur_score&quot;:20947,&quot;bawu&quot;:0,&quot;props&quot;:null},&quot;content&quot;:{&quot;post_id&quot;:62866847607,&quot;is_anonym&quot;:false,&quot;open_id&quot;:&quot;tbclient&quot;,&quot;open_type&quot;:&quot;apple&quot;,&quot;date&quot;:&quot;2015-01-11 16:39&quot;,&quot;vote_crypt&quot;:&quot;&quot;,&quot;post_no&quot;:6,&quot;type&quot;:&quot;0&quot;,&quot;comment_num&quot;:123,&quot;ptype&quot;:&quot;0&quot;,&quot;is_saveface&quot;:false,&quot;props&quot;:null,&quot;post_index&quot;:4,&quot;pb_tpoint&quot;:null}}">         


编程代码:

def spider(url):    html = requests.get(url)    selector = etree.HTML(html.text)    content_field = selector.xpath('//div[@class="l_post l_post_bright "]')    item = {}    for each in content_field:        reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))        author = reply_info['author']['user_name']        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]        reply_time = reply_info['content']['date']        #print content        #print reply_time        #print author        item['user_name'] = author        item['topic_reply_content'] = content        item['topic_reply_time'] = reply_time        towrite(item)

针对json开发的页面,我们可以使用json.loads方法加载提取,如:

reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))

此页面也涉及多个字典,字典里含有字典,解析方法为:

 author = reply_info['author']['user_name']



后附爬取贴吧user_name,内容,发表时间等内容的完整代码:

#-*-coding:utf8-*-from lxml import etreefrom multiprocessing.dummy import Pool as ThreadPoolimport requestsimport jsonimport sysreload(sys)sys.setdefaultencoding('utf-8')'''重新运行之前请删除content.txt,因为文件操作使用追加方式,会导致内容太多。'''def towrite(contentdict):    f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n')    f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n')    f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')def spider(url):    html = requests.get(url)    selector = etree.HTML(html.text)    content_field = selector.xpath('//div[@class="l_post l_post_bright "]')    item = {}    for each in content_field:        reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))        author = reply_info['author']['user_name']        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]        reply_time = reply_info['content']['date']        #print content        #print reply_time        #print author        item['user_name'] = author        item['topic_reply_content'] = content        item['topic_reply_time'] = reply_time        towrite(item)if __name__ == '__main__':    pool = ThreadPool(4)    f = open('content.txt','a')    page = []    for i in range(1,22):        newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)        page.append(newpage)    results = pool.map(spider, page)    pool.close()    pool.join()    f.close()




















原创粉丝点击