Scrapy在采集网页时使用随机user-agent
来源:互联网 发布:gson遍历json 编辑:程序博客网 时间:2024/05/23 10:51
随机生成User-agent:更改User-agent能够防止一些403或者400的错误,基本上属于每个爬虫都会写的。这里我们可以重写scrapy 里的middleware,让程序每次请求都随机获取一个User-agent,增大隐蔽性。
在settings.py中添加以下代码:
DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'guazi.middlewares.RotateUserAgentMiddleware': 543,}
在middlewares.py中添加以下代码:
import randomfrom scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): #这句话用于随机选择user-agent ua = random.choice(self.user_agent_list) if ua: request.headers.setdefault('User-Agent', ua) #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" ]
阅读全文
0 0
- Scrapy在采集网页时使用随机user-agent
- scrapy在采集网页时使用随机user-agent的方法
- scrapy在爬取网页时使用随机user-agent方法
- scrapy采集数据时为每个请求随机分配user-agent
- python scrapy 之 随机选择user-agent
- Scrapy 通过中间件实现随机User-Agent
- python3 网络爬虫(五)scrapy中使用User-Agent
- scrapy使用random user-agent的两种方式
- Scrapy命令 和 User Agent
- scrapy之user-agent池
- 工具------随机获取User-Agent
- 使用Firefox user agent进行移动端网页测试
- 获取网页内容---"User-Agent"
- scrapy动态设置user agent,使用IP地址池,禁用cookies,设置下载延迟.
- 抓取网页碰到500错误时:User Agent
- 获取随机User-Agent和随机ip代理
- 教你用User Agent Switcher火狐插件在电脑上浏览手机版网页
- 模拟UA(user agent)实现访问只能在微信上打开的网页
- new的数据能用free吗?
- Go1.9按行读取日志文件并处理
- 登录界面测试用例设计
- ServletRequest接口的常用方法
- Android
- Scrapy在采集网页时使用随机user-agent
- 获取父窗口的元素的方法
- MongoDB数据库安装配置
- jlink下载配置
- 2017年6月22日
- HDU 4331 Image Recognition 题解
- JAVA--自己实现ArrayList
- 关于mysql复合索引
- java实现截取网页中所包含的网址