快速上手——我用scrapy写爬虫(一)

来源:互联网 发布:开淘宝店实名认证 编辑:程序博客网 时间:2024/04/27 13:26

写在前面

用python写爬虫的人很多,python的爬虫框架也很多,诸如pyspider 和 scrapy,笔者还是笔记倾向于scrapy,本文就用python写一个小爬虫demo。
本文适用于有一定python基础的,并且对爬虫有一定了解的开发者。

安装 Scrapy

检查环境,python的版本为3.6.2,pip为9.0.1

F:\techlee\python>python --versionPython 3.6.2F:\techlee\python>pip --versionpip 9.0.1 from d:\program files\python\python36-32\lib\site-packages (python 3.6)

安装scrapy框架

F:\techlee\python>pip install scrapyCollecting scrapy  Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)    100% |████████████████████████████████| 256kB 188kB/s    // 漫长的安装过程Successfully installed Twisted-17.9.0 scrapy-1.4.0

如果报错:

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

请安装Visual C++ 2015 Build Tools
http://landinghub.visualstudio.com/visual-cpp-build-tools

安装完成

F:\techlee\python>scrapy versionScrapy 1.4.0

创建项目

F:\techlee\python>scrapy startproject scrapyDemoNew Scrapy project 'scrapyDemo', using template directory 'd:\\program files\\python\\python36-32\\lib\\site-packages\\scrapy\\templates\\project', created in:    F:\techlee\python\scrapyDemoYou can start your first spider with:    cd scrapyDemo    scrapy genspider example example.com

目录结构

scrapyDemo/    scrapy.cfg            # 部署配置文件    scrapyDemo/           # python模块        __init__.py        items.py          # 数据容器        pipelines.py      # project pipelines file        settings.py       # 配置文件        spiders/          # Spider类定义了如何爬取某个(或某些)网站            __init__.py

创建执行爬取的类ImoocSpider在 scrapyDemo/spiders

# -*- coding: utf-8 -*-import scrapyfrom urllib import parse as urlparse# 慕课网爬取class ImoocSpider(scrapy.Spider):    # spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的    name = "imooc"    # URL列表    start_urls = ['http://www.imooc.com/course/list']    #  域名不在列表中的URL不会被爬取。    allowed_domains = ['www.imooc.com']    def parse(self, response):                learn_nodes = response.css('a.course-card')        for learn_node in learn_nodes :            learn_url = learn_node.css("::attr(href)").extract_first()            yield scrapy.Request(url=urlparse.urljoin(response.url,learn_url),callback=self.parse_learn)    def parse_learn(self, response):        title = response.xpath('//h2[@class="l"]/text()').extract_first()        content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()        url = response.url        print ('标题:' + title)        print ('地址:' + url)

开始爬取

F:\techlee\python\scrapyDemo>scrapy crawl imooc

如果出现,则缺少win32api库,选择相应的版本

下载地址:https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/

import win32apiModuleNotFoundError: No module named 'win32api'

大功告成

看到如下输出,就说明爬取成功啦

F:\techlee\python\scrapyDemo>scrapy crawl imooc2017-10-17 14:28:32 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyDemo)……2017-10-17 14:28:32 [scrapy.core.engine] INFO: Spider opened2017-10-17 14:28:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2017-10-17 14:28:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60232017-10-17 14:28:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/robots.txt> (referer: None)2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/course/list> (referer: None)2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/876> (referer: http://www.imooc.com/course/list)标题:集成MultiDex项目实战地址:http://www.imooc.com/learn/8762017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/893> (referer: http://www.imooc.com/course/list)标题:阿里D2前端技术论坛——2016初心地址:http://www.imooc.com/learn/8932017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/890> (referer: http://www.imooc.com/course/list)2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/888> (referer: http://www.imooc.com/course/list)标题:Hadoop进阶地址:http://www.imooc.com/learn/890标题:Javascript实现二叉树算法地址:http://www.imooc.com/learn/8882017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/894> (referer: http://www.imooc.com/course/list)标题:Fragment应用上地址:http://www.imooc.com/learn/8942017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/887> (referer: http://www.imooc.com/course/list)标题:PHP-面向对象地址:http://www.imooc.com/learn/8872017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/900> (referer: http://www.imooc.com/course/list)2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/889> (referer: http://www.imooc.com/course/list)2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/901> (referer: http://www.imooc.com/course/list)标题:Sketch的基础实例应用地址:http://www.imooc.com/learn/900标题:ElasticSearch入门地址:http://www.imooc.com/learn/889标题:使用Google Guice实现依赖注入地址:http://www.imooc.com/learn/9012017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/867> (referer: http://www.imooc.com/course/list)标题:Docker入门地址:http://www.imooc.com/learn/8672017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/878> (referer: http://www.imooc.com/course/list)标题:Android图表绘制之直方图地址:http://www.imooc.com/learn/8782017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/892> (referer: http://www.imooc.com/course/list)标题:UI版式设计地址:http://www.imooc.com/learn/8922017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/877> (referer: http://www.imooc.com/course/list)2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/886> (referer: http://www.imooc.com/course/list)标题:RxJava与RxAndroid基础入门地址:http://www.imooc.com/learn/877标题:iOS开发之Audio特辑地址:http://www.imooc.com/learn/8862017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/861> (referer: http://www.imooc.com/course/list)标题:基于Websocket的火拼俄罗斯(基础)地址:http://www.imooc.com/learn/8612017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/895> (referer: http://www.imooc.com/course/list)2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/882> (referer: http://www.imooc.com/course/list)标题:2017AWS 技术峰会——大数据技术专场地址:http://www.imooc.com/learn/895标题:基于websocket的火拼俄罗斯(单机版)地址:http://www.imooc.com/learn/882

原文 https://www.tech1024.cn/original/2951.html

保存数据到mysql数据库 https://www.tech1024.cn/original/2959.html

原创粉丝点击