Scrapy学习笔记(1)初探Scrapy
来源:互联网 发布:海康录像机设备域名 编辑:程序博客网 时间:2024/05/07 15:52
目标:爬取起点小说X类型小说前X页的所有小说并将所有简介做成词云
power by:
- Python 3.6
- Scrapy 1.4
- pymysql
- wordcloud
- jieba
- macOS 10.12.6
项目地址:https://github.com/Dengqlbq/NovelSpiderAndWordcloud.git
Step 1——创建project
cd YOURPATHscrapy startproject QiDian
QiDian project的默认结构
Step 2——编写item
Scrapy的主要分工:(简略版且特指本例)
spider :爬取网页并解析内容,将内容放置到Item,将新到request放入队列item :内容存储容器pipelines :处理Item,如存入数据库或写入文件
Item决定了我们存储什么数据,本例中存储的是 作者名,书名,简介
# items.pyimport scrapyclass QiDianNovelItem(scrapy.Item): # define the fields for your item here name = scrapy.Field() author = scrapy.Field() intro = scrapy.Field()
step 3——编写spider
在spider文件夹中创建 ‘QiDianNovelSpider.py’
# QiDianNovelSpider.pyfrom QiDian.items import QiDianNovelItemfrom scrapy.spider import Spiderfrom scrapy import Requestclass QiDianNovelSpider(Spider): name = 'qi_dian_novel_spider' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/53.0.2785.143 Safari/537.36'} page = 1 url = 'http://f.qidian.com/all?chanId=21&orderId=&page=1&vip=\ hidden&style=1&pageSize=20&siteid=1&hiddenField=1&page=%d' def start_requests(self): yield Request(self.url % self.page, headers=self.header) def parse(self, response): item = QiDianNovelItem() novels = response.xpath('//ul[@class="all-img-list cf"]/li/div[@class="book-mid-info"]') for novel in novels: item['name'] = novel.xpath('.//h4/a/text()').extract()[0] item['author'] = novel.xpath('.//p[@class="author"]/a[1]/text()').extract()[0] item['intro'] = novel.xpath('.//p[@class="intro"]/text()').extract()[0] yield item if self.page < 20: self.page += 1 yield Request(self.url % self.page, headers=self.header)
spider首次启动时由start_requests()提供request对象,以后从队列中获取 spider根据request爬取网页并封装成response对象 spider默认调用parse()处理response parse()拆封response从中提取结构化信息存储到items,并将新到request放到队列中
step 4——编写pipelines
编写好spider和item其实已经可以工作了,这时item中到信息会打印到屏幕上,
也可以通过命令行参数写入到文件中,不过我们是要把信息存储到数据库中
# pipelines.pyimport pymysqlclass QiDianPipeline(object): def __init__(self): self.connect = pymysql.connect( host='127.0.0.1', db='Scrapy_test', user='Your_user', passwd='Your_pass', charset='utf8', use_unicode=True) self.cursor = self.connect.cursor() def process_item(self, item, spider): sql = 'insert into Scrapy_test.novel(name,author,intro) values (%s,%s,%s)' self.cursor.execute(sql, (item['name'], item['author'], item['intro'])) self.connect.commit() return item
然后在setting.py中添加itempipeline
ITEM_PIPELINES = { 'QiDian.pipelines.QiDianPipeline': 300,}
记得先建立数据库 数据库配置信息最好写在setting.py中,读取时:host=settings['MYSQL_HOST']
step 5——爬取数据
代码已经写好,测试下爬虫是否正常工作
cd YOURPATHscrapy crawl qi_dian_novel_spider
step 6——制作词云
爬虫正常工作,数据已经入库,接下来制作词云
# mywordcloud.pyfrom wordcloud import WordCloudimport matplotlib.pyplot as pltfrom PIL import Imageimport numpyimport jiebaimport pymysqlconnect = pymysql.connect( host='127.0.0.1', db='Scrapy_test', user='Your_user', passwd='Your_pass', charset='utf8', use_unicode=True)cursor = connect.cursor()sql = 'select intro from Scrapy_test.novel'cursor.execute(sql)result = cursor.fetchall()txt = ''for r in result: txt += r[0].strip()+'。'wordlist = jieba.cut(txt)ptxt = ' '.join(wordlist)image = numpy.array(Image.open('Girl.png')) # 自定义图片# 需要换成支持中文的字体,wordcloud自带的字体不支持中文,会显示乱码wc = WordCloud(background_color='white', max_words=500, max_font_size=60, mask=image, font_path='FangSong_GB2312.ttf').generate(ptxt) plt.imshow(wc)plt.axis("off")plt.show()
效果如下:
阅读全文
0 0
- Scrapy学习笔记(1)初探Scrapy
- Scrapy学习笔记(1)
- scrapy文档学习笔记(scrapy tutorial)
- Scrapy学习笔记(0)---Scrapy一瞥
- scrapy学习笔记--scrapy命令
- scrapy学习(1)
- Scrapy学习笔记(二)
- Scrapy学习笔记(三)
- Scrapy初探
- Scrapy 初探
- 我的Python学习笔记(6) 初探网络爬虫scrapy
- Scrapy 学习笔记(一)
- scrapy学习笔记--Items
- Scrapy框架学习笔记
- Scrapy-学习笔记
- scrapy学习笔记
- scrapy学习笔记
- Scrapy学习笔记一
- 1057数零壹(进制转换)
- MySQL中join详解
- MySQL 锁机制常用知识点有哪些?
- Flex布局实现圣杯布局和网格布局
- webmagic爬取腾讯nba数据
- Scrapy学习笔记(1)初探Scrapy
- 剑指offer之十一---数值的整数次方
- Java设计模式——组合模式(Composite Pattern)
- Python-Lambda Expression
- js获取指定兄弟元素
- BZOJ刷题集
- redis学习笔记(一)
- hdu 1277 全文检索
- ES6(四: 字符串新增API)