Scrapy爬虫:模拟浏览器和使用代理
来源:互联网 发布:霸业传奇降级转生数据 编辑:程序博客网 时间:2024/05/16 11:40
采用settings.py的方式进行设置user agent和proxy列表
http://www.tuicool.com/articles/VRfQR3U
http://jinbitou.net/2016/12/01/2229.html(本人用的这种)
网站的反爬虫策略:
http://www.cnblogs.com/tyomcat/p/5447853.html
1.在settings.py同级目录下新建文件useragent.py
# -*-coding:utf-8-*-from scrapy import logimport loggingimport randomfrom scrapy.downloadermiddlewares.useragent import UserAgentMiddlewareclass UserAgent(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): ua = random.choice(self.user_agent_list) if ua: #显示当前使用的useragent #print "********Current UserAgent:%s************" %ua #记录 log.msg('Current UserAgent: '+ua, level=logging.DEBUG) request.headers.setdefault('User-Agent', ua) #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
2.在settings.py同级目录新建文件proxymiddlewares.py
# -*- coding: utf-8 -*-import random, base64class ProxyMiddleware(object): proxyList = [ \ '121.193.143.249:80','112.126.65.193:80','122.96.59.104:82','115.29.98.139:9999','117.131.216.214:80','116.226.243.166:8118','101.81.22.21:8118','122.96.59.107:843' ] def process_request(self, request, spider): # Set the location of the proxy pro_adr = random.choice(self.proxyList) print("USE PROXY -> " + pro_adr) request.meta['proxy'] = "http://" + pro_adr
3.修改settings.py (注意DOWNLOADER_MIDDLEWARES)
# -*- coding: utf-8 -*-BOT_NAME = 'ip_proxy_pool'SPIDER_MODULES = ['ip_proxy_pool.spiders']NEWSPIDER_MODULE = 'ip_proxy_pool.spiders'# Obey robots.txt rulesROBOTSTXT_OBEY = FalseITEM_PIPELINES = { 'ip_proxy_pool.pipelines.IpProxyPoolPipeline': 300,}#爬取间隔DOWNLOAD_DELAY = 1# 禁用cookieCOOKIES_ENABLED = False# 重写默认请求头DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html, application/xhtml+xml, application/xml', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Host':'ip84.com', 'Referer':'http://ip84.com/', 'X-XHR-Referer':'http://ip84.com/'}#激活自定义UserAgent和代理IP# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = { 'ip_proxy_pool.useragent.UserAgent': 1, 'ip_proxy_pool.proxymiddlewares.ProxyMiddleware':100, 'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,}
4.开始爬取即可
0 0
- Scrapy爬虫:模拟浏览器和使用代理
- python爬虫之Scrapy 使用代理配置
- python爬虫之Scrapy 使用代理配置
- Scrapy爬虫框架使用IP代理池
- scrapy模拟表单爬虫
- scrapy模拟表单爬虫
- scrapy: 使用HTTP代理绕过网站反爬虫机制
- Python爬虫系列之----Scrapy(七)使用IP代理池
- scrapy: 使用HTTP代理绕过网站反爬虫机制
- python爬虫 - scrapy的安装和使用
- Scrapy爬虫入门教程 安装和基本使用
- 【python爬虫03】使用Scrapy框架模拟登录知乎
- Scrapy爬虫:代理IP配置
- Scrapy爬虫(二)——自定义Item和代理访问的爬虫
- scrapy 使用代理
- scrapy 使用代理
- scrapy中使用代理
- Scrapy爬虫实战三:获取代理
- android Service后台播放音乐
- 刷简书阅读量
- git 的基本命令学习
- Python爬虫实战:Scrapy豆瓣电影爬取
- linux下的task struct结构体分析
- Scrapy爬虫:模拟浏览器和使用代理
- jquery操作select
- 爬虫解析2:pyquery总结
- intellij IDEA 如何设置默认的maven配置
- python 发送HTTP请求 post json 格式
- 函数名与函数指针
- 虚拟内存-页式存储管理算法
- python学习笔记2-列表与元组
- VCGlib入门记录一