python+scrapy+mysql爬取故事网站

来源:互联网 发布:网络教育和函授 编辑:程序博客网 时间:2024/05/16 19:53
继续下文之前请确认你已经掌握了python、SQL、scrapy相关知识

环境准备

  • centos7 (笔者习惯了centos7,其它系统也可)
  • python3.6 (若用其他版本请注意版本区别)
  • pymysql (用于连接数据库)
  • scrapy
  • mariaDB(centos7预装mariadb,实现方式和mysql差不多)

数据库准备

CREATE TABLE `story` (  `id` int AUTO_INCREMENT NOT NULL,  `title` VARCHAR(225),  `content` text,  `type` varchar(60),  PRIMARY KEY (`id`)) ENGINE=MyISAM DEFAULT CHARSET=utf8;

创建scrapy爬虫项目

scrapy startproject story 
  • settings.py 末尾添加以下数据库配置
MYSQL_HOST = 'localhost'MYSQL_DBNAME = 'test'     # 数据库名称MYSQL_USER = 'root'       # 用户名MYSQL_PWD = 'root'        # 密码
  • 在spiders目录下新建story_spider.py
# !/usr/bin/env python# -*- coding: utf-8 -*-import scrapyfrom story.items import StoryItem class StorySpider(scrapy.Spider):    name = 'story_spider'    allowed_domins = ['www.xigushi.com']    start_urls = [        'http://www.xigushi.com/ymgs/',        'http://www.xigushi.com/thgs/',        'http://www.xigushi.com/aqgs/',        'http://www.xigushi.com/jcgs/',        'http://www.xigushi.com/lzgs/',        'http://www.xigushi.com/zlgs/',        'http://www.xigushi.com/xygs/',        'http://www.xigushi.com/rsgs/',        'http://www.xigushi.com/yygs/',        'http://www.xigushi.com/mrgs/',        'http://www.xigushi.com/qqgs/',        'http://www.xigushi.com/yqgs/',    ]    def parse(self, response):        article_links = response.xpath('//div[@class="list"]//dd//li//@href').extract()        for link in article_links:            yield response.follow(link, callback=self.parse_article)        next_page = response.xpath('//div[@class="list"]//div[@class="pages"]//li[@class="thisclass"]//following-sibling::li[@class="sy2"][1]//@href').extract_first()        if next_page != None:            yield response.follow(next_page, callback=self.parse)    def parse_article(self, response):        item = StoryItem()        item['title'] = response.xpath('//div[@class="by"]//h1[1]//text()').extract_first().strip()        item['type'] = response.xpath('//div[@class="by"]//dt[1]//a[2]/text()').extract_first().strip()        item['content'] = ''        for content in response.xpath('//div[@class="by"]//p/text()').extract():                item['content'] += content.strip()        yield item
  • 接下来修改pipelines.py实现数据插入数据库
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlimport pymysql.cursorsclass StoryPipeline(object):    def __init__(self, conn):        self._conn = conn    @classmethod    def from_settings(cls, settings):        conn = pymysql.connect( host=settings['MYSQL_HOST'],                                user=settings['MYSQL_USER'],                                password=settings['MYSQL_PWD'],                                db=settings['MYSQL_DBNAME'],                                charset='utf8')        return cls(conn)    def process_item(self, item, spider):        sql_search = 'select id from story where title="%s"' % item['title']        sql = 'insert into story(title, content, type) ' 'values("%s", "%s", "%s")' % (item['title'], item['content'], item['type'])        try:            with self._conn.cursor() as cursor:                cursor.execute(sql_search)                result_search = cursor.fetchone()                # 查重,若数据库中不存在则插入                if result_search is None or result_search[0].strip() == '':                    cursor.execute(sql)                    self._conn.commit()        except Exception as e:            print("exception >>> ")            print(e)            self._conn.rollback()        return item    def close_spider(self, spider):        this._conn.close()
  • 再修改settings.py
ITEM_PIPELINES = {    'story.pipelines.StoryPipeline': 300,}

一切准备就绪开始爬取吧!

scrapy crawl story_spider

后记

本文采用的是xpath的方式获取内容,你也可以尝试其他风格哦,比如说css

原创粉丝点击