python3 网络爬虫(五)scrapy中使用User-Agent
来源:互联网 发布:js怎么控制class 编辑:程序博客网 时间:2024/05/20 15:58
环境:python3.4
win7 、ubuntu
框架:scrapy
好久没有写博客了,感觉这类博客在网上也有,自己写会不会就没什么用,但是作为一种记录吧,记录自己成长的经历,所以还是偶尔写几篇,大家可能没注意到,编程环境从win7 增加到了ubuntu和win7, 这是因为我最近在用linux编程,感觉也没什么不同,除了安装各类包时有点不同之外 - - 好了,今天我们来说说我们常用的两种添加User-Agent的办法(其实有三种吧,但是第三种不想说)。下面让我慢慢道来:
(1)在scrapy下的setting.py文件中设置User-Agent:
如上图我们发现,在setting.py中有USER_AGENT这个设置项,那么我们就可以直接在setting.py里设置User-Agent:
import randomuser_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]UA = random.choice(user_agent_list)USER_AGENT = UA
这样,当启动爬虫程序时就能发现带有User-Agent了
(2)在scrapy下的middlewares.py文件中设置User-Agent(也称通过改写中间件来设置User-Agent):
我们也可以在middlewares.py中设置User-Agent,首先,我们应在middleware.py编写一个类,名字可以为UserAgentMiddleware,代码大概如下:
class UserAgentMiddleware(object): user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] UA = random.choice(user_agent_list) def process_request(self,request,spider): UA = random.choice(self.user_agent_list) request.headers['User-Agent'] = UA
这样中间件就编写完毕了,接下来就是启用中间件,这点也很重要,如果不启用,那么写也是白写:在setting.py中启用中间件:
DOWNLOADER_MIDDLEWARES = {
‘scrapy.middlewares.UserAgentMiddleware’: 543,
}
这样就能在每次请求一个网站的时候使用一个UserAgent了
(3)在scrapy主程序里直接加上User-Agent(不推荐,这样会让代码非常的不美观,所以在此不介绍)
User-Agent的编写方法到此介绍完毕,谢谢大家,如果有疑问可以私信或者留言,接下来一篇博文,想举几个例子来介绍如果获取动态数据的方法
- python3 网络爬虫(五)scrapy中使用User-Agent
- Python3网络爬虫(五):Python3安装Scrapy
- Python3网络爬虫(四):使用User Agent和代理IP隐藏身份
- Python3网络爬虫(四):使用User Agent和代理IP隐藏身份
- Python3网络爬虫_使用User Agent和代理IP隐藏身份
- python爬虫之scrapy中user agent浅谈(两种方法)
- Python网络爬虫(三)-----User-Agent
- Python3 爬虫使用User Agent和代理IP隐藏身份
- Python3网络爬虫:Scrapy入门之使用ImagesPipline下载图片
- Python3网络爬虫:初识Scrapy爬虫框架
- scrapy爬虫防止被禁止 User Agent切换
- Windows Python3 Scrapy网络爬虫环境搭建
- Python3网络爬虫框架库scrapy
- Scrapy在采集网页时使用随机user-agent
- scrapy使用random user-agent的两种方式
- Scrapy命令 和 User Agent
- scrapy之user-agent池
- Linux企业级项目实践之网络爬虫(13)——处理user-agent
- Linux下shell脚本中的eval命令
- STL容器——vector接口介绍
- SpringBoot-访问MySQL数据库
- IP分片
- The Boss on Mars
- python3 网络爬虫(五)scrapy中使用User-Agent
- PHP从后台读取数据实现轮播的效果
- java获取两个字符串中最大子串
- MySQL优化
- 算法七
- DNS的原理解析
- 基于vultr的vps建立shadowsock
- 17 多校 3
- STL容器——list接口介绍