python scrapy 爬博客信息

来源：互联网发布：c语言冒泡排序简单代码编辑：程序博客网时间：2024/05/24 03:46

我又要来刷自己博客了，捂脸

这次是用scrapy

先创建一个scrapy项目

命令：

scrapy startproject myblog

目录结构：

scrapy.cfg

myblog/

|----__init__.py

|----items.py

|----pipelines.py

|----settings.py

|----spiders/

|----__init__.py

settings.py中已经开启的参数：

BOT_NAME = 'myblog'SPIDER_MODULES = ['myblog.spiders']NEWSPIDER_MODULE = 'myblog.spiders'

ROBOTSTXT_OBEY = True

items.py编写，我只要一个找博客中的title：

from scrapy.item import Item,Fieldclass MyblogItem(Item):    # define the fields for your item here like:    # name = scrapy.Field()    # pass    title = Field()

pipelines.py编写，将内容写进文件中：

import jsonimport codecsclass MyblogPipeline(object):    def __init__(self):        self.file = codecs.open('myblog_title_utf8.json', 'wb', encoding='utf-8')    def process_item(self, item, spider):        line = json.dumps(dict(item)) + '\n'        self.file.write(line.decode('unicode_escape'))        return item

在settings.py中开启：

ITEM_PIPELINES = {   'myblog.pipelines.MyblogPipeline': 300,}

spiders目录下新建 myblog_spider.py编写爬虫程序：

from scrapy.spiders import Spiderfrom scrapy.selector import Selectorfrom myblog.items import MyblogItemfrom myblog.pipelines import MyblogPipelinefrom scrapy.http import Requestfrom bs4 import BeautifulSoup, NavigableStringclass MyBlogSpider(Spider):    name = 'myblog'    allowed_domains = ["blog.csdn.net"]    start_urls = [        "http://blog.csdn.net/u013055678?viewmode=contents"#起始页面url    ]    def parse(self, response):        soup = BeautifulSoup(response.body, "lxml")        page_div = soup.find_all("span", {"class": "link_title"})#从起始页中找出所有博客url        for url in page_div:            a_url = "http://blog.csdn.net" + url.find("a").attrs["href"]            print ">>newsurl: %s" % a_url            yield Request(a_url, callback=self.parse_item)#将博客url打开传给回调函数    def parse_item(self, response):        items = []        item = MyblogItem()        soup = BeautifulSoup(response.body, "lxml")        class_div = soup.find("span", {"class": "link_title"}).text.strip()#找出博客的标题        item['title'] = class_div        items.append(item)        print(item)        return items

运行命令：

scrapy crawl myblog

打开myblog_title_utf8.json，结果保存在里面

结果：

{"title": "python queue和多线程的爬虫 与 JoinableQueue和多进程的爬虫"}{"title": "21天学通Java学习笔记-Day05"}{"title": "21天学通Java学习笔记-Day06"}{"title": "21天学通Java学习笔记-Day04"}{"title": "21天学通Java学习笔记-Day09(IO流)"}{"title": "21天学通Java学习笔记-Day07(异常-断言-线程)"}{"title": "21天学通Java学习笔记-Day11(常用类)"}{"title": "21天学通Java学习笔记-Day10(网路编程)"}{"title": "21天学通Java学习笔记-Day13(javascript-ajax)"}{"title": "用python将SQL格式文件改成自己想要的格式"}{"title": "21天学通Java学习笔记-Day14(Tomcar-Servlet-JSP)"}{"title": "21天学通Java学习笔记-Day08(数据结构)"}{"title": "python解决一些错误换行问题"}{"title": "python批量删除文件"}{"title": "21天学通Java学习笔记-Day12(MYsql-JDBC)"}{"title": "python 网站爬虫 下载在线盗墓笔记小说到本地的脚本"}{"title": "21天学通Java学习笔记-Day01"}{"title": "21天学通Java学习笔记-Day02"}{"title": "21天学通Java学习笔记-Day03"}{"title": "Python 判断质数"}{"title": "python-SMTP发邮件"}{"title": "python为在线漫画网站自制非官方API(未完待续)"}{"title": "python爬虫：案例一：360指数"}{"title": "python爬虫：案例二:携程网酒店价格信息"}{"title": "flask笔记：1：安装"}{"title": "flask笔记：2：Hello World"}{"title": "flask笔记：3：模板"}{"title": "ubuntu下python模拟键盘"}{"title": "flask笔记：4：web表单"}{"title": "flask笔记：5：数据库"}{"title": "flask笔记：6：用户登入登出"}{"title": "flask笔记：7：用户资料信息页和头像"}{"title": "flask笔记：8：修复BUG"}{"title": "flask笔记：后记(附代码)"}{"title": "python爬虫：案例三：去哪儿酒店价格信息"}{"title": "python爬虫：案例四：新浪微指数"}{"title": "django学习笔记1:安装"}{"title": "django学习笔记2：基本命令"}{"title": "django学习笔记3:视图与路由"}{"title": "django学习笔记5:模型"}{"title": "django学习笔记7：django和celery实现异步"}{"title": "flask笔记：9：蓝图"}{"title": "flask笔记：10：多线程模式"}{"title": "django学习笔记4:模版"}{"title": "django学习笔记6:表单"}{"title": "python 商品名称相似度查找(difflib库和结巴分词的运用)"}{"title": "python xlrd库的简单使用"}

0 0