Scrapy学习笔记(1)初探Scrapy

来源:互联网 发布:海康录像机设备域名 编辑:程序博客网 时间:2024/05/07 15:52

目标:爬取起点小说X类型小说前X页的所有小说并将所有简介做成词云

power by:

  1. Python 3.6
  2. Scrapy 1.4
  3. pymysql
  4. wordcloud
  5. jieba
  6. macOS 10.12.6

项目地址:https://github.com/Dengqlbq/NovelSpiderAndWordcloud.git


Step 1——创建project

cd YOURPATHscrapy startproject QiDian

QiDian project的默认结构

这里写图片描述


Step 2——编写item

Scrapy的主要分工:(简略版且特指本例)

spider    :爬取网页并解析内容,将内容放置到Item,将新到request放入队列item      :内容存储容器pipelines :处理Item,如存入数据库或写入文件

Item决定了我们存储什么数据,本例中存储的是 作者名,书名,简介

# items.pyimport scrapyclass QiDianNovelItem(scrapy.Item):    # define the fields for your item here     name = scrapy.Field()    author = scrapy.Field()    intro = scrapy.Field()

step 3——编写spider

在spider文件夹中创建 ‘QiDianNovelSpider.py’

# QiDianNovelSpider.pyfrom QiDian.items import QiDianNovelItemfrom scrapy.spider import Spiderfrom scrapy import Requestclass QiDianNovelSpider(Spider):    name = 'qi_dian_novel_spider'    header = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \    Chrome/53.0.2785.143 Safari/537.36'}    page = 1    url = 'http://f.qidian.com/all?chanId=21&orderId=&page=1&vip=\             hidden&style=1&pageSize=20&siteid=1&hiddenField=1&page=%d'    def start_requests(self):        yield Request(self.url % self.page, headers=self.header)    def parse(self, response):        item = QiDianNovelItem()        novels = response.xpath('//ul[@class="all-img-list cf"]/li/div[@class="book-mid-info"]')        for novel in novels:            item['name'] = novel.xpath('.//h4/a/text()').extract()[0]            item['author'] = novel.xpath('.//p[@class="author"]/a[1]/text()').extract()[0]            item['intro'] = novel.xpath('.//p[@class="intro"]/text()').extract()[0]            yield item            if self.page < 20:                self.page += 1                yield Request(self.url % self.page, headers=self.header)
 spider首次启动时由start_requests()提供request对象,以后从队列中获取 spider根据request爬取网页并封装成response对象 spider默认调用parse()处理response parse()拆封response从中提取结构化信息存储到items,并将新到request放到队列中

step 4——编写pipelines

编写好spider和item其实已经可以工作了,这时item中到信息会打印到屏幕上,
也可以通过命令行参数写入到文件中,不过我们是要把信息存储到数据库中

# pipelines.pyimport pymysqlclass QiDianPipeline(object):    def __init__(self):        self.connect = pymysql.connect(            host='127.0.0.1',            db='Scrapy_test',            user='Your_user',            passwd='Your_pass',            charset='utf8',            use_unicode=True)        self.cursor = self.connect.cursor()    def process_item(self, item, spider):        sql = 'insert into Scrapy_test.novel(name,author,intro) values (%s,%s,%s)'        self.cursor.execute(sql, (item['name'], item['author'], item['intro']))        self.connect.commit()        return item

然后在setting.py中添加itempipeline

ITEM_PIPELINES = {    'QiDian.pipelines.QiDianPipeline': 300,}
 记得先建立数据库 数据库配置信息最好写在setting.py中,读取时:host=settings['MYSQL_HOST']

step 5——爬取数据

代码已经写好,测试下爬虫是否正常工作

cd YOURPATHscrapy crawl qi_dian_novel_spider

这里写图片描述

这里写图片描述

这里写图片描述


step 6——制作词云

爬虫正常工作,数据已经入库,接下来制作词云

# mywordcloud.pyfrom wordcloud import WordCloudimport matplotlib.pyplot as pltfrom PIL import Imageimport numpyimport jiebaimport pymysqlconnect = pymysql.connect(        host='127.0.0.1',        db='Scrapy_test',        user='Your_user',        passwd='Your_pass',        charset='utf8',        use_unicode=True)cursor = connect.cursor()sql = 'select intro from Scrapy_test.novel'cursor.execute(sql)result = cursor.fetchall()txt = ''for r in result:    txt += r[0].strip()+'。'wordlist = jieba.cut(txt)ptxt = ' '.join(wordlist)image = numpy.array(Image.open('Girl.png'))     # 自定义图片# 需要换成支持中文的字体,wordcloud自带的字体不支持中文,会显示乱码wc = WordCloud(background_color='white', max_words=500, max_font_size=60, mask=image,   font_path='FangSong_GB2312.ttf').generate(ptxt)  plt.imshow(wc)plt.axis("off")plt.show()

效果如下:

这里写图片描述

这里写图片描述


原创粉丝点击