scrapy User Agent切换的两种方法

来源:互联网 发布:西门子博图软件 编辑:程序博客网 时间:2024/05/23 11:28

第一种是使用在setting里面设置middlewares。

这个网上有较多版本,我觉得这个版本写的比较好,也比较新。

有的旧版本还在使用scrapy.contrib.downloadermiddleware。scrapy新的版本里已经不用contrib了,直接写scrapy.downloadermiddleware就可以了。

note:

另外为了避免覆盖本身的middlewares.py文件,我觉得新文件名名为别的名字比较好,例如user-agent.py

setting里面写入的

DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        '你的项目名字.引用的文件名字.RotateUserAgentMiddleware' :400,
     
    }


方法二:

使用def start_requests,添加headers,start_urls默认调用了start_requests方法,所以这边重新写了start_requests方法就不用写start_urls了。

from random import choicedef start_requests(self):       user_agent_list = [\        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"       ]           header={#'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',                    #'Accept-Encoding':'gzip, deflate, sdch',                    #Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4                    #Cache-Control:max-age=0                    #'Connection':'keep-alive',                    #Upgrade-Insecure-Requests:1                     'User-Agent':choice(user_agent_list)  #打备注的都是可加可不加的,我没加好像也没什么影响。random.choice 可以从列表里随机选取元素                     }            url = 'http://www.example.com'         yield scrapy.Request(url,callback=self.parse,dont_filter=True,headers=header)


我试过这两种方法都可以。


0 0