Scrapy爬虫框架
来源:互联网 发布:淘宝推广视频教学 编辑:程序博客网 时间:2024/06/06 00:43
Scrapy项目的目录结构
创建一个项目,会自动生成一个项目文件夹
scrapy startproject firstpjt
查看文件结构
.└── firstpjt ├── firstpjt //核心目录 │ ├── __init__.py //项目初始化信息 │ ├── items.py //数据容器文件 │ ├── middlewares.py │ ├── pipelines.py //对items里面的数据进行进一步加工 │ ├── __pycache__ │ ├── settings.py //设置信息 │ └── spiders │ ├── __init__.py //对爬虫初始化 │ └── __pycache__ └── scrapy.cfg //配置文件5 directories, 7 files
Scrapy项目管理
scrapy startproject命令控制参数
使用scrapy startproject -h
查看帮助
Usage===== scrapy startproject <project_name> [project_dir]Create new projectOptions=======--help, -h show this help message and exitGlobal Options----------------logfile=FILE log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL log level (default: DEBUG)--nolog disable logging completely--profile=FILE write python cProfile stats to FILE--pidfile=FILE write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated)--pdb enable pdb on failure
- –logfile=FILE:
用来指定系统的日志文件,FILE为日志文件地址
scrapy startproject --logfile="../logf.text"
会在当前目录的上级生成日志文件
2. –loglevel=LEAVEL, -L, LEAVEL:
主要用来控制日志信息的等级,默认为DEBUG模式
scrapy startproject --loglevel=WARNING
常用工具命令
全局命令
全局命令是不需要项目就可以直接运行的,scrapy -h
可以查看所有全局命令,使用scrapy <command> -h
可以具体查看某一个命令
fetch
用来显示爬虫爬取的过程
runspider
该命令可以实现不依托Scrapy爬虫项目,直接运行一个爬虫文件
settings
查看Scrapy对应的配置信息
shell
启动Scrapy的交互式终端
startproject
用于创建项目
version
显示版本相关信息
view
该命令实现下载某个网站并用浏览器查看
项目命令
基于项目的命令,必须要进入项目才可以使用,可以在项目中使用scrapy -h
查看在项目中可用的命令
Bench
测试本地硬件的性能
Gensipider
可以基于爬虫模板直接生成一个新的爬虫文件
scrapy startprojecr first //创建爬虫项目cd first //进入项目文件scrapy -h //查看项目中可以使用的命令Usage: scrapy <command> [options] [args]Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
下面使用genspider命令按照模板快速创建爬虫
scrapy genspider -h //查看具体命令scrapy genspider -l //查看目前可以使用的模板//可用模板如下Available templates: basic crawl csvfeed xmlfeedscrapy genspider -t base xiaoyao .com //依据模板创建爬虫//重新查看目录.├── first│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── pipelines.py│ ├── __pycache__│ │ ├── __init__.cpython-35.pyc│ │ └── settings.cpython-35.pyc│ ├── settings.py│ └── spiders│ ├── __init__.py│ ├── __pycache__│ │ └── __init__.cpython-35.pyc│ └── xiaoyao.py└── scrapy.cfg
Check
对某个爬虫文件进行合同检查
Crawl
用于启动某个爬虫
scrapy crawl xiaoyao
List
列出当前可用的爬虫
Edit
直接打开爬虫文件
Parse
获取指定的URL网址,并使用对应的爬虫文件进行处理和分析
编写Item
打开item.py, 如下:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass FirstItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass //往其中填写属性 urlname = scrapy.Field() urladdr = scrapy.Field()
item.py是用来装数据的,有属性名和属性表,感觉跟数据库很想;xpath是直接提取标签的内容,作用跟正则差不多吧
>>> import scrapy>>> class person(scrapy.Item): name = scrapy.Field()>>> xiaoyaogege = person(name = 'lixiaoyao')>>> print(xiaoyaogege){'name': 'lixiaoyao'}>>> print(xiaoyaogege['name'])lixiaoyao>>> print(type(xiaoyaogege))<class '__main__.person'>>>>
编写Spider
打开spider.py文件,如下:
xiaoyao爬虫将要爬去新浪的三个页面,返回网页的标题、地址
# -*- coding: utf-8 -*-import scrapyfrom first.items import FirstItemclass XiaoyaoSpider(scrapy.Spider): name = 'xiaoyao' allowed_domains = ['sina.com.cn'] start_urls = ['http://mil.news.sina.com.cn', 'http://edu.sina.com.cn/gaokao', 'http://tech.sina.com.cn',] #军事栏目、教育栏目、科技栏目 def parse(self, response): item = FirstItem() #构造一个item item['urlname'] = response.xpath('/html/head/title/text()') print(item['urlname'])
在终端用crawl命令运行爬虫
scrapy crawl xiaoyao --nolog //不需要日志whc@whc-ThinkPad-E455:~/code/7.27/first$ scrapy crawl xiaoyao --nolog[<Selector xpath='/html/head/title/text()' data='军事频道_最多军迷首选的军事门户_新浪网'>][<Selector xpath='/html/head/title/text()' data='新浪科技_新浪网'>][<Selector xpath='/html/head/title/text()' data='2017高考_2017高考政策_高考频道_新浪教育_新浪网'>]
爬虫爬取的URL默认是从start_urls列表中提取的,如果我们把想要爬去的URL放到别的列表中,就需要重写start_requests()方法,否则该方法只会到默认的地方取找URL
urls = ['http://www.baidu.com', 'http://www.jd.com', 'http://www.sina.com.cn'] #重写start_requests()方法 def start_requests(self): for url in self.urls: yield self.make_requests_from_urls(url)
XPath
<html> <head> <title>爬虫</title> </head> <body> <h1>什么是爬虫?</h1> <p>首先,.......</p> <p>其次,.......</p> </body></html>
- 用/选择某个标签,例如:
/html/head/title/text()
text()是获取标签的文本信息,提取的是“爬虫” - 使用//可以提取某个标签的所有信息,例如:
//p
提取所有的p标签 - 获取特定属性为特定值的标签的内容:
//z[@x=”y”]
获取所有x属性为y的z标签
Spider类参数传递
在Spider类中可以通过-a选项实现参数的传递。首先需要重写构造函数init(),在构造函数中设置一个变量用于接收参数,也可以通过参数传递的方式进行多个网页的爬取,代码如下:
#重写构造函数 def __init__(self, myurl = None, *args, **kwargs): super(XiaoyaoSpider, self).__init__(*args, **kwargs) myurllist = myurl.split('|') for i in myurllist: print("要爬取的网站为: %s" %i) self.urls = myurllistwhc@whc-ThinkPad-E455:~/code/7.27/first$ scrapy crawl xiaoyao -a myurl="http://www.baidu.com" --nolog要爬取的网站为: http://www.baidu.com
用XMLFeedSpider来分析XML源
略
学会使用CSVFeedSpider
略
Scrapy爬虫多开技能
创建项目,在项目下依据模板创建三只爬虫
whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_1 sina.com.cnCreated spider 'xiaoyao_1' using template 'basic' in module: multispd.spiders.xiaoyao_1whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_2 sina.com.cnCreated spider 'xiaoyao_2' using template 'basic' in module: multispd.spiders.xiaoyao_2whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_3 sina.com.cnCreated spider 'xiaoyao_3' using template 'basic' in module: multispd.spiders.xiaoyao_3
通过修改crawl源代码的方式进行实现,crawl源代码可以在http://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py找到
import osfrom scrapy.commands import ScrapyCommandfrom scrapy.utils.conf import arglist_to_dictfrom scrapy.utils.python import without_none_valuesfrom scrapy.exceptions import UsageErrorclass Command(ScrapyCommand): requires_project = True def syntax(self): return "[options] <spider>" def short_desc(self): return "Run a spider" def add_options(self, parser): ScrapyCommand.add_options(self, parser) parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", help="set spider argument (may be repeated)") parser.add_option("-o", "--output", metavar="FILE", help="dump scraped items into FILE (use - for stdout)") parser.add_option("-t", "--output-format", metavar="FORMAT", help="format to use for dumping items with -o") def process_options(self, args, opts): ScrapyCommand.process_options(self, args, opts) try: opts.spargs = arglist_to_dict(opts.spargs) except ValueError: raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False) if opts.output: if opts.output == '-': self.settings.set('FEED_URI', 'stdout:', priority='cmdline') else: self.settings.set('FEED_URI', opts.output, priority='cmdline') feed_exporters = without_none_values( self.settings.getwithbase('FEED_EXPORTERS')) valid_output_formats = feed_exporters.keys() if not opts.output_format: opts.output_format = os.path.splitext(opts.output)[1].replace(".", "") if opts.output_format not in valid_output_formats: raise UsageError("Unrecognized output format '%s', set one" " using the '-t' switch or as a file extension" " from the supported list %s" % (opts.output_format, tuple(valid_output_formats))) self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline') # 要指定运行哪个爬虫,要运行所有爬虫,关键是修改该方法 def run(self, args, opts): if len(args) < 1: raise UsageError() elif len(args) > 1: raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported") spname = args[0] self.crawler_process.crawl(spname, **opts.spargs) self.crawler_process.start()
在项目目录下新建mucmd目录,在其中创建mycrawl.py和init.py文件
.├── multispd│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── mycmd│ │ ├── __init__.py│ │ ├── mycrawl.py│ │ └── __pycache__│ │ ├── __init__.cpython-35.pyc│ │ └── mycrawl.cpython-35.pyc│ ├── pipelines.py│ ├── __pycache__│ │ ├── __init__.cpython-35.pyc│ │ └── settings.cpython-35.pyc│ ├── settings.py│ └── spiders│ ├── __init__.py│ ├── __pycache__│ │ ├── __init__.cpython-35.pyc│ │ ├── xiaoyao_1.cpython-35.pyc│ │ ├── xiaoyao_2.cpython-35.pyc│ │ └── xiaoyao_3.cpython-35.pyc│ ├── xiaoyao_1.py│ ├── xiaoyao_2.py│ └── xiaoyao_3.py└── scrapy.cfg
将crawl源代码复制到mycrawl.py中,并修改如下:
# 主要修改以下部分 def run(self, args, opts): # if len(args) < 1: # raise UsageError() # elif len(args) > 1: # raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported") # spname = args[0] # # self.crawler_process.crawl(spname, **opts.spargs) # self.crawler_process.start() spd_loader_list = self.crawler_process.spider_loader.list() # 遍历爬虫 for spdname in spd_loader_list or args: self.crawler_process.crawl(spdname, **opts.spargs) print("此时启动的爬虫是:" + spdname) self.crawler_process.start()
修改settings.py,讲命令添加到项目中
COMMANDS_MODULE = 'multispd.mycmd'
在终端查看命令scrapy -h
找到mycrawl命令,并运行
whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy mycrawl --nolog此时启动的爬虫是:xiaoyao_1此时启动的爬虫是:xiaoyao_2此时启动的爬虫是:xiaoyao_3
避免被禁止
禁止Cookie
有的网站会通过用户的Cookie信息对用户信息进行识别,通过禁用本地cookie可以让网站无法识别会话信息
直接对settings.py进行设置就可以了
# Disable cookies (enabled by default)#COOKIES_ENABLED = False# 这两行代码就是设置cookie的,只要把第二行注释去掉就可以# Disable cookies (enabled by default)COOKIES_ENABLED = False
设置下载延时
有的网站会根据我们爬去网站的频率进行分析,如果爬取过快,就会判断为爬虫,通过设置下载延时可以解决这个问题
打开settings.py进行相关设置
# Configure a delay for requests for the same website (default: 0)# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# 修改最后一行可以设置延迟时间,单位是秒,默认3秒
使用IP池
步骤:
- 在项目中建立一个下载中间件
- 在下载中间件中设置IP选择规则
- 在settings.py中配置好下载中间件,并配置好IP池
代码:
先写一个爬取IP的爬虫
import reimport urllib.requestpattern = '<td>((\d+\.)+\d+).*?(\d+)</td>'headers = ('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) Ap'\ 'pleWebKit/537.36 (KHTML, like Gecko) '\ 'Chrome/59.0.3071.115 Safari/537.36')opener = urllib.request.build_opener()opener.addheaders = [headers]urllib.request.install_opener(opener)def getip(): url = 'http://www.xicidaili.com/nn/1' html = urllib.request.urlopen(url).read() html = str(html) try: result = re.compile(pattern, re.S).findall(html) striplist = [] for i in result: strip = i[0] + ':' + i[2] striplist.append(strip) return striplist except Exception as e: print(str(e))
- 修改settings.py
from multispd import get_ipIPPOOL = []iplist = get_ip()for i in iplist: dict = {} dict['ipaddr'] = i IPPOOL.append(dict)
- 编写中间件文件
from scrapy import signalsimport randomfrom multispd.settings import IPPOOLfrom scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddlewareclass IPPOOLS(HttpProxyMiddleware): def __init__(self, ip =''): self.ip = ip def process_request(self, request, spider): thisip = random.choice(IPPOOL) print("当前使用的IP是:" + thisip["ipaddr"]) request.meta['proxy'] = 'http://' + thisip["ipaddr"]
- 再次设置settings.py
DOWNLOADER_MIDDLEWARES = {# 'multispd.middlewares.MyCustomDownloaderMiddleware': 543, 'scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware':123, 'multispd.middlewares.IPPOOLS':125}
使用用户代理池
略
其他方法,比如分布式爬虫
略
- scrapy爬虫框架
- Scrapy爬虫框架入门
- Python 爬虫框架 scrapy
- 网络爬虫框架-Scrapy
- Scrapy爬虫框架笔记
- Scrapy - 爬虫框架
- 爬虫框架Scrapy
- 关于scrapy爬虫框架
- scrapy爬虫框架
- scrapy 爬虫框架
- Scrapy爬虫框架
- 爬虫框架scrapy安装
- Scrapy爬虫框架
- Python爬虫框架--Scrapy
- Python Scrapy爬虫框架
- Scrapy爬虫框架介绍
- python爬虫 -- scrapy框架
- Python爬虫---scrapy框架
- RecycleView实现MVP框架下的双列表联动与悬停
- 拥塞控制算法之Remy (2013 Sigcomm)
- wpf datagrid简单显示数据
- nfs连接(转载加修改)
- 关于程序员面试
- Scrapy爬虫框架
- 八大排序算法总结之一(冒泡排序,快速排序,直接插入排序,希尔排序)
- 移除有序数组中的重复元素
- 橱柜也可以这样美!!
- chrome调试项目,z-index显示黄色感叹号,图片无法显示的问题
- NSRunLoop的退出方式
- org.springframework.beans.factory.NoSuchBeanDefinitionException: No qualifying bean of type [com.cui
- maven+springmvc+hibernate4框架搭建-纯手工亲手测试通过
- Android sqlite数据库命令操作