Scrapy简明教程(三)——爬取CSDN博文并写入文件
来源:互联网 发布:朝鲜扫码软件 编辑:程序博客网 时间:2024/06/07 17:57
本篇博文将介绍 Scrapy 爬取 CSDN 博文详情页并写入文件,这里以 http://blog.csdn.net/oscer2016/article/details/78007472 这篇博文为例:
1. 先执行以下几个命令:
scrapy startproject csdnblogcd csdnblog/scrapy genspider -t basic spider_csdnblog csdn.net
2. 编写 settings.py :
设置用户代理,解除 ITEM_PIPELINES 注释,用户代理可在审查元素中查看:
# 修改以下两处vim csdnblog/settings.py
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'ITEM_PIPELINES = { 'csdnblog.pipelines.CsdnblogPipeline': 300,}
3. 编写要抽取的数据域 (items.py) :
vim csdnblog/items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass CsdnblogItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() url = scrapy.Field() releaseTime = scrapy.Field() readnum = scrapy.Field() article = scrapy.Field()
4. 编写 piplines.py:
vim csdnblog/pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport sysreload(sys)sys.setdefaultencoding("utf-8")import reclass CsdnblogPipeline(object): def process_item(self, item, spider): data = re.findall("http://blog.csdn.net/(.*?)/article/details/(\d*)", item['url']) # 构造文件名 filename = data[0][0] + '_' + data[0][1] + '.txt' text = "标题: " + item['title'] + "\n博文链接: " + item['url'] + "\n发布时间: " \ + item['releaseTime'] + "\n\n正文: " + item['article'] fp = open(filename, 'w') fp.write(text) fp.close() return item
5. 编写爬虫代码:
vim csdnblog/spiders/spider_csdnblog.py
# -*- coding: utf-8 -*-import scrapyimport refrom scrapy import Requestfrom csdnblog.items import CsdnblogItemclass SpiderCsdnblogSpider(scrapy.Spider): name = 'spider_csdnblog' allowed_domains = ['csdn.net'] start_urls = ['http://blog.csdn.net/oscer2016/article/details/78007472'] def parse(self, response): item = CsdnblogItem() # 新版主题博客数据抽取 item['url'] = response.url item['title'] = response.xpath('//h1[@class="csdn_top"]/text()').extract()[0].encode('utf-8') item['releaseTime'] = response.xpath('//span[@class="time"]/text()').extract()[0].encode('utf-8') item['readnum'] = response.xpath('//button[@class="btn-noborder"]/span/text()').extract()[0] data = response.xpath('//div[@class="markdown_views"]') item['article'] = data.xpath('string(.)').extract()[0] # 将数据传入pipelines.py,然后写入文件 yield item
6. 运行项目:
scrapy crawl spider_csdnblog --nolog
至此,博文相关数据已正确抽取并存入文件,下一篇博客将介绍爬取 CSDN 全部博客专家的所有博文并存入 MongoDB。
阅读全文
0 0
- Scrapy简明教程(三)——爬取CSDN博文并写入文件
- Scrapy简明教程(四)——爬取CSDN博客专家所有博文并存入MongoDB
- Scrapy教程——搭建环境、创建项目、爬取内容、保存文件(txt)
- Scrapy简明教程(一)——简介与安装
- Python scrapy爬虫爬取伯乐在线全部文章,并写入数据库
- Scrapy爬虫(3)爬取中国高校前100名并写入MongoDB
- scrapy简明教程
- Scrapy简明教程(二)——开启Scrapy爬虫项目之旅
- Python爬虫系列:爬取小说并写入txt文件
- Python-爬取2345电影并写入文件
- Scrapy爬取图片并保存
- scrapy 写入json文件
- 使用Scrapy爬取笑话并存储到文件和MySQL
- Python爬虫——爬取中国高校排名前100名并写入MySQL
- CSDN Markdown简明教程
- CSDN Markdown简明教程
- Scrapy研究探索(五)——自动多网页爬取(抓取CSDN某人博客所有文章)
- Scrapy定向爬虫教程(三)——爬取多个页面
- Python机器学习Sklearn入门之神经网络
- Atcoder 083 Restoring Road Network(类弗洛伊德)
- 比特币基础概念入门 3
- 航天集团研发4000公里/小时高铁是一次战略考量
- PAT乙级 1048. 数字加密(20)
- Scrapy简明教程(三)——爬取CSDN博文并写入文件
- springmvc的web.xml详解
- numpy,scipy,matplotlib,pandas等简明教程
- python基础-Unit 1.认识python和基础知识
- spring boot pom文件模板
- CMake编译opencv-3.2.0出现 Downloading opencv_ffmpeg.dll...
- 设计模式_ 外观模式(12)
- java 线程的状态转换
- Hive的理论基础