爬虫实践---Scrapy-豆瓣电影影评&深度爬取

来源：互联网发布：发泥推荐知乎编辑：程序博客网时间：2024/04/30 06:00

Link Extractors
Link Extractors 是那些目的仅仅是从网页(scrapy.http.Response 对象)中抽取最终将会被follow链接的对象｡

Scrapy提供了 scrapy.linkextractors import LinkExtractor , 但你通过实现一个简单的接口创建自己定制的Link Extractor来满足需求｡

每个link extractor有唯一的公共方法是 extract_links ,它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象｡Link Extractors,要实例化一次并且 extract_links 方法会根据不同的response调用多次提取链接｡

Link Extractors在 CrawlSpider 类(在Scrapy可用)中使用, 通过一套规则,但你也可以用它在你的Spider中, 即使你不是从 CrawlSpider 继承的子类, 因为它的目的很简单: 提取链接｡

内置Link Extractor 参考
Scrapy提供的Link Extractor类在 scrapy.linkextractors 模块提供｡默认的link extractor是 LinkExtractor , 其实就是 LxmlLinkExtractor:

from scrapy.linkextractors import LinkExtractor

例如,从这段代码中提取链接:<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>你可以使用下面的这个 process_value 函数:def process_value(value):    m = re.search("javascript:goToPage\('(.*?)'", value)    if m:        return m.group(1)

正则表达式中---

‘.’匹配任意除换行符意外的字符

'*'匹配前一个字符0次或无限次

'?'匹配前一个字符0次或1次

LxmlLinkExtractorclass scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)---allow（正则表达式（或列表）） - （绝对）URL必须匹配才能被提取的单个正则表达式（或正则表达式列表）。如果没有给出（或空），它将匹配所有链接。

首先建立一个项目，项目列表如下：

$ tree.├── douban_Music│   ├── __init__.py│   ├── items.py│   ├── middlewares.py│   ├── pipelines.py│   ├── __pycache__│   │   ├── __init__.cpython-36.pyc│   │   ├── items.cpython-36.pyc│   │   ├── pipelines.cpython-36.pyc│   │   └── settings.cpython-36.pyc│   ├── settings.py│   └── spiders│       ├── __init__.py│       ├── __pycache__│       │   ├── __init__.cpython-36.pyc│       │   └── reviemspider.cpython-36.pyc│       └── reviemspider.py├── Movie.txt   #这个是最终生成的txt文档├── Music.txt    #音乐抓取，我后来改为电影影评抓取了└── scrapy.cfg4 directories, 16 files

最近《战狼2》比较燃，就是它了，抓取它的热门影评---https://movie.douban.com/subject/26363254/

看了这个评价条数啊！头有点大，所以就初步想只扣取影评三星以上的。

$ cat items.py # -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item,Field# 影评class MovieReviewItem(Item):    review_movie = Field()    review_title = Field()  # 评论标题    review_content = Field()    # 评论正文    review_author = Field() #评论ID    review_useful = Field()  # 评论有用数    review_rating = Field()   # 影评星级    review_time = Field()   # 评论时间    review_url = Field()    # 评论链接

老套路了，下来就是配置文件

BOT_NAME = 'douban_Music'SPIDER_MODULES = ['douban_Music.spiders']NEWSPIDER_MODULE = 'douban_Music.spiders'DOWNLOAD_DELAY = 3DEPTH_LIMIT = 4USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'ITEM_PIPELINES = {    'douban_Music.pipelines.DoubanMusicPipeline': 300,}

上面出现了一个之前从来没用过的

DEPTH_LIMIT = 4

我的理解是，https://movie.douban.com/subject/26363254/，这个影评界面是四级深度链接，就在这个基础上进行链接，我的理解是这样，不知道对不对？

$ cat reviemspider.py #!/usr/bin/env python# coding=utf-8from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban_Music.items import MovieReviewItemfrom scrapy import logimport reimport osclass ReviewSpider(CrawlSpider):    name = 'review'    allowed_domains = ['movie.douban.com']    start_urls = ['https://movie.douban.com/subject/26363254/']    rules = (        Rule(LinkExtractor(allow=r"/subject/\d+/reviews$")), # 开始页面的下一个页面---影评        Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=hotest$")),# 选择最受欢迎选项        Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=hotest\&start=\d+$")),# 遍历页面        Rule(LinkExtractor(allow=r"/review/\d+/$"), callback="parse_review", follow=True), # 影评全文界面    )    def parse_review(self, response):        try:            # 碰见影评下边有推荐自己其他影评链接的导致爬虫错误识别,判断电影名称            movie_name = response.xpath('//*[@class="main-hd"]/a[2]/text()').extract()            rating = response.xpath('//*[@property ="v:rating"]/text()').extract()            name = "战狼2"            print(movie_name[0])            if (movie_name[0] == name)&(int(rating[0]) > 3) :                item = MovieReviewItem()                item['review_movie'] = "".join(response.xpath('//*[@class="main-hd"]/a[2]/text()').extract())                item['review_title'] = "".join(response.xpath('//*[@property="v:summary"]/text()').extract())                content = "".join(response.xpath('//*[@id="link-report"]/div[@property="v:description"]/text()').extract()[0])                item['review_rating'] = "".join(response.xpath('//*[@property ="v:rating"]/text()').extract())                item['review_content'] = content.lstrip().rstrip().replace("\n"," ")                item['review_author'] = "".join(response.xpath('//*[@property = "v:reviewer"]/text()').extract())                useful = "".join(response.xpath('//*[@class="main-panel-useful"]/button[1]/text()').extract())                item['review_useful'] = useful.lstrip().rstrip().replace("\n","")                item['review_time'] = "".join(response.xpath('//*[@property="v:dtreviewed"]/text()').extract())                item['review_url'] = response.url                yield item            else:                print("电影：{}\t 星级:{}".format(movie_name[0],rating[0]))                print("链接错误影评！矫正!")        except Exception as error:            log(error)

# 碰见影评下边有推荐自己其他影评链接的导致爬虫错误识别,判断电影名称
movie_name = response.xpath('//*[@class="main-hd"]/a[2]/text()').extract()
rating = response.xpath('//*[@property ="v:rating"]/text()').extract()
name = "战狼2"
print(movie_name[0])
if (movie_name[0] == name)&(int(rating[0]) > 3) :

由于部分影评结尾处，存在自己其他电影影评的链接，所以进行判断，防止读取到其他的电影影评，但是能否在网站访问前就进行判断呢？还没有解决这个顾虑。

有的影评存在图片或者是</pr>的情况，存在影评读取错误的情况。甚是尴尬。下来慢慢填坑吧！

$ cat pipelines.py # -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport osclass DoubanMusicPipeline(object):    def process_item(self, item, spider):        base_dir = os.getcwd()        file_name = base_dir + '/Movie.txt'        with open(file_name,'a') as f:            f.write(item['review_movie']+'\n')            f.write(item['review_title']+'\t')            f.write(item['review_author']+'\t')            f.write(item['review_time']+'\n')            f.write(item['review_rating']+'颗星\t')            f.write(item['review_useful']+'\n')            #f.write(item['review_recommend']+'\n')            f.write(item['review_content']+'\n')            f.write(item['review_url']+'\n\n')        return item

上面就是简单+极简略的影评爬取，后来再陆续完成其他功能。

阅读全文

0 0