scrapy 初体验

来源:互联网 发布:面朝大海春暖花开 知乎 编辑:程序博客网 时间:2024/06/05 19:01

scrapy是python的爬虫框架,当初为了在ubuntu下配置环境,花了我两天时间>_< 原因是14.04LTS版本内置2.7的py,然后改成3.4,以及在安装pip3,OpenSSL遇到了不少坑,不过所幸在百度上终于找到了答案,并成功安装。
兴奋过后,就要开始实干了,根据scrapy1.2文档,我学习翻阅了前两章,并写了一个简单的爬虫,当然,爬的还是panda.tv 23333~

接着上代码->

命令行输入(项目目录):

scrapy startproject pandacd panda
import scrapy#定义数据类型class Video(scrapy.Item):         title = scrapy.Field()    name = scrapy.Field()    population = scrapy.Field()    category = scrapy.Field()class PandaSpider(scrapy.Spider):    name = 'panda'    #爬虫名字,唯一    allowed_domains = ['panda.tv']    start_urls = ['http://www.panda.tv/all']    # def start_requests(self):    #     yield scrapy.Request('http://www.panda.tv/all',self.parse)    def parse(self,response):        video = Video()        # video = {}        item = response.xpath('//a[@class="video-list-item-wrap"]')        for info in item:            subinfo = info.xpath('.//div[@class="video-info"]')            video['title'] = info.xpath('.//div[@class="video-title"]/text()').extract()            video['name'] = subinfo.xpath('.//span[@class="video-nickname"]/text()').extract()            video['population'] = subinfo.xpath('.//span[@class="video-number"]/text()').extract()            video['category'] = subinfo.xpath('.//span[@class="video-cate"]/text()').extract()            yield {                'title':video['title'],                'name':video['name'],                'population':video['population'],                'category':video['category']            }        #翻页功能,有待完善        # next_page = response.css('a.j-page-next::attr(href)').extract()        # if next_page is not None:        #   next_page = response.urljoin(next_page)        #   yield scrapy.Request(next_page,self.parse)

命令行输入(爬虫名):

scrapy crawl panda

结果显示:
这里写图片描述

相比于之前写的nodejs爬虫好像要方便很多~成功了第一步,后续还会继续学习哒!

1 0
原创粉丝点击