scrapy 初体验

来源：互联网发布：面朝大海春暖花开知乎编辑：程序博客网时间：2024/06/05 19:01

scrapy是python的爬虫框架，当初为了在ubuntu下配置环境，花了我两天时间>_< 原因是14.04LTS版本内置2.7的py，然后改成3.4，以及在安装pip3，OpenSSL遇到了不少坑，不过所幸在百度上终于找到了答案，并成功安装。
兴奋过后，就要开始实干了，根据scrapy1.2文档，我学习翻阅了前两章，并写了一个简单的爬虫，当然，爬的还是panda.tv 23333~

接着上代码->

命令行输入（项目目录）：

scrapy startproject pandacd panda

import scrapy#定义数据类型class Video(scrapy.Item):         title = scrapy.Field()    name = scrapy.Field()    population = scrapy.Field()    category = scrapy.Field()class PandaSpider(scrapy.Spider):    name = 'panda'    #爬虫名字，唯一    allowed_domains = ['panda.tv']    start_urls = ['http://www.panda.tv/all']    # def start_requests(self):    #     yield scrapy.Request('http://www.panda.tv/all',self.parse)    def parse(self,response):        video = Video()        # video = {}        item = response.xpath('//a[@class="video-list-item-wrap"]')        for info in item:            subinfo = info.xpath('.//div[@class="video-info"]')            video['title'] = info.xpath('.//div[@class="video-title"]/text()').extract()            video['name'] = subinfo.xpath('.//span[@class="video-nickname"]/text()').extract()            video['population'] = subinfo.xpath('.//span[@class="video-number"]/text()').extract()            video['category'] = subinfo.xpath('.//span[@class="video-cate"]/text()').extract()            yield {                'title':video['title'],                'name':video['name'],                'population':video['population'],                'category':video['category']            }        #翻页功能，有待完善        # next_page = response.css('a.j-page-next::attr(href)').extract()        # if next_page is not None:        #   next_page = response.urljoin(next_page)        #   yield scrapy.Request(next_page,self.parse)

命令行输入（爬虫名）：

scrapy crawl panda

结果显示：
这里写图片描述

相比于之前写的nodejs爬虫好像要方便很多~成功了第一步，后续还会继续学习哒！

1 0