【开源】scrapy爬取亚马逊商品评论

来源:互联网 发布:手机单怎么预防淘宝客 编辑:程序博客网 时间:2024/05/16 18:16

一、前言

       同样在此声明,爬取亚马逊商品评论仅为学习,若用于商业用途,后果自负。

       上一篇博文http://blog.csdn.net/c_son/article/details/43267551对亚马逊商品的爬取,这次在上一篇的基础之上,对爬取到的商品,我们再进行用户评论的爬取。源码见github https://github.com/jerry-sc/AmazonIphone6CommentsSpider.git

二、items.py

       这个文件不多说了,只是对数据的封装,我们要爬取的是用户评论的内容,时间以及用户的打分。

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Amazoniphone6CommentsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    comment_content = scrapy.Field()    comment_time = scrapy.Field()    # 评分    comment_star = scrapy.Field()      # # 多少人认为该评论有用,亚马逊特有的一种方式,格式例如5/7,说明七个人中五个人认为该评论有用    # comment_useful = scrapy.Field()

三、Spider

       这次主要多了一个循环嵌套,因为每一页有好多商品,而每一商品的评论又有好多页。

#/usr/bin/python#-*-coding:utf-8-*-from scrapy.spider import Spiderfrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom scrapy import logfrom AmazonIphone6Comments.items import Amazoniphone6CommentsItemclass AmazonIphone6CommentsSpider(Spider):    """docstring for AmazonIphone6CommentsSpider"""        name = "AmazonIphone6CommentsSpider"    download_delay = 3    allowed_domains = ["amazon.cn"]    global count    count = 0    start_urls = [        "http://www.amazon.cn/s/ref=nb_sb_noss_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&url=node%3D665002051&field-keywords=iphone6&rh=n%3A664978051%2Cn%3A665002051%2Ck%3Aiphone6"    ]    def parse(self,response):        sel = Selector(response)        if "product-reviews" in response.url:            comments = sel.xpath("//table[@id='productReviews']//div[@style='margin-left:0.5em;']")            for comment in comments:                                global count                count = count + 1                print count                item = Amazoniphone6CommentsItem()                comment_content = comment.xpath("div[@class='reviewText']/text()").extract()                comment_time = comment.xpath("div/span/nobr/text()").extract()                comment_star = comment.xpath("div/span/span/span/text()").extract()                # comment_useful = comment.xpath("")                item["comment_content"] = [n.encode('utf-8') for n in comment_content]                item["comment_time"] = [n.encode('utf-8') for n in comment_time]                item["comment_star"] = [n.encode('utf-8') for n in comment_star]                yield item            for next_url in sel.xpath("//table[2]//div[@class='CMpaginate']/span/a[last()]/@href").extract():                yield Request(next_url,callback=self.parse)        else:            goods = sel.xpath("//li[@class='s-result-item']")                    for good in goods:                for comment_url in good.xpath("div/div/a[@class='a-size-small a-link-normal a-text-normal']/@href").extract():                    yield Request(comment_url,callback=self.parse)            for url in sel.xpath("//a[@id='pagnNextLink']/@href").extract():                yield Request("http://www.amazon.cn"+url,callback=self.parse)

四、运行结果

       费时一个小时左右,爬取12370条评论。


       不过当我粗略检查下里面的数据时发现,有些竟然是重复的,而且还很多次,开始还以为是自己的代码有问题,后来慢慢调试发现,亚马逊上商品的评论是针对商家而言的不是商品,也就是只要是这家商铺的东西,所有商品评论都是相同的,所以造成了许多评论是一样的。

0 0
原创粉丝点击