【开源】scrapy爬取亚马逊商品评论

来源：互联网发布：手机单怎么预防淘宝客编辑：程序博客网时间：2024/05/16 18:16

一、前言

同样在此声明，爬取亚马逊商品评论仅为学习，若用于商业用途，后果自负。

上一篇博文http://blog.csdn.net/c_son/article/details/43267551对亚马逊商品的爬取，这次在上一篇的基础之上，对爬取到的商品，我们再进行用户评论的爬取。源码见github https://github.com/jerry-sc/AmazonIphone6CommentsSpider.git

二、items.py

这个文件不多说了，只是对数据的封装，我们要爬取的是用户评论的内容，时间以及用户的打分。

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Amazoniphone6CommentsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    comment_content = scrapy.Field()    comment_time = scrapy.Field()    # 评分    comment_star = scrapy.Field()      # # 多少人认为该评论有用，亚马逊特有的一种方式，格式例如5/7，说明七个人中五个人认为该评论有用    # comment_useful = scrapy.Field()

三、Spider

这次主要多了一个循环嵌套，因为每一页有好多商品，而每一商品的评论又有好多页。

#/usr/bin/python#-*-coding:utf-8-*-from scrapy.spider import Spiderfrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom scrapy import logfrom AmazonIphone6Comments.items import Amazoniphone6CommentsItemclass AmazonIphone6CommentsSpider(Spider):    """docstring for AmazonIphone6CommentsSpider"""        name = "AmazonIphone6CommentsSpider"    download_delay = 3    allowed_domains = ["amazon.cn"]    global count    count = 0    start_urls = [        "http://www.amazon.cn/s/ref=nb_sb_noss_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&url=node%3D665002051&field-keywords=iphone6&rh=n%3A664978051%2Cn%3A665002051%2Ck%3Aiphone6"    ]    def parse(self,response):        sel = Selector(response)        if "product-reviews" in response.url:            comments = sel.xpath("//table[@id='productReviews']//div[@style='margin-left:0.5em;']")            for comment in comments:                                global count                count = count + 1                print count                item = Amazoniphone6CommentsItem()                comment_content = comment.xpath("div[@class='reviewText']/text()").extract()                comment_time = comment.xpath("div/span/nobr/text()").extract()                comment_star = comment.xpath("div/span/span/span/text()").extract()                # comment_useful = comment.xpath("")                item["comment_content"] = [n.encode('utf-8') for n in comment_content]                item["comment_time"] = [n.encode('utf-8') for n in comment_time]                item["comment_star"] = [n.encode('utf-8') for n in comment_star]                yield item            for next_url in sel.xpath("//table[2]//div[@class='CMpaginate']/span/a[last()]/@href").extract():                yield Request(next_url,callback=self.parse)        else:            goods = sel.xpath("//li[@class='s-result-item']")                    for good in goods:                for comment_url in good.xpath("div/div/a[@class='a-size-small a-link-normal a-text-normal']/@href").extract():                    yield Request(comment_url,callback=self.parse)            for url in sel.xpath("//a[@id='pagnNextLink']/@href").extract():                yield Request("http://www.amazon.cn"+url,callback=self.parse)

四、运行结果

费时一个小时左右，爬取12370条评论。

不过当我粗略检查下里面的数据时发现，有些竟然是重复的，而且还很多次，开始还以为是自己的代码有问题，后来慢慢调试发现，亚马逊上商品的评论是针对商家而言的不是商品，也就是只要是这家商铺的东西，所有商品评论都是相同的，所以造成了许多评论是一样的。

0 0