爬取糗事百科

来源：互联网发布：java类方法实例化编辑：程序博客网时间：2024/06/08 10:56

糗事百科网址：https://www.qiushibaike.com，获取每个帖子里的用户头像链接、用户姓名、段子内容、点赞次数和评论次数,并保存在json文件中

网站分析

Firefox通过XPath Checker插件用了检查你写的xpath是否正确。下载安装：https://addons.mozilla.org/zh-cn/firefox/addon/xpath-checker/
1. 段子 //div[contains(@id, “qiushi_tag”)]
这里写图片描述
2. 标题 ./div/a/@title
3. 图片连接 .//div[@class=”thumb”]//@src
4. 段子内容 .//div[@class=”content”]/span’
5. 点赞和评论次数 .//i

代码

获取内容

url = "http://www.qiushibaike.com/"headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}request = urllib2.Request(url, headers = headers)html = urllib2.urlopen(request).read()

解析内容并存储

# 响应返回的是字符串，解析为HTML DOM模式 text = etree.HTML(html)text = etree.HTML(html)# contains()模糊查询方法，第一个参数是要匹配的标签，第二个参数是标签名部分内容node_list = text.xpath('//div[contains(@id, "qiushi_tag")]')items ={}for node in node_list:    # xpath返回的列表，这个列表就这一个参数，用索引方式取出来，用户名    #标题     username = node.xpath('./div/a/@title')[0]    # 图片连接    image = node.xpath('.//div[@class="thumb"]//@src')#[0]    # 段子内容    content = node.xpath('.//div[@class="content"]/span')[0].text    # 点赞    zan = node.xpath('.//i')[0].text    # 评论    comments = node.xpath('.//i')[1].text    items = {        "username" : username,        "image" : image,        "content" : content,        "zan" : zan,        "comments" : comments    }    with open("qiushi.json", "a") as f:        f.write(json.dumps(items, ensure_ascii = False).encode("utf-8") + "\n")

存储结果

这里写图片描述

阅读全文

0 0