python scrapy 向parse传递参数、标识

来源:互联网 发布:易语言源码怎么用 编辑:程序博客网 时间:2024/05/20 10:12

在做爬foursquare的爬虫时,需要在parse函数里以userid为文件名进行保存,有一种最简单的方法,那就是在构造初始链接时,将id=[userid]作为参数加入到链接中,

start_urls =[  'http://foursquare.com/user/%d?id=%d' %(n,n) for n in range(99660,99665)  ] 
这个参数会被foursquare的服务器过滤到,依然能访问到正确的链接内容,而这样带参数的链接,又可以在parse里通过response.url来得到userid。

def parse(self,response):      ID=str(response.url).strip().split("id=")[-1]      with open(str(ID)+".txt","w") as fw:          ...  
程序运行结果如下:

...2016-09-21 14:05:22 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/kluoma?id=99660> from <GET https://foursquare.com/user/99660?id=99660>2016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/kluoma?id=99660> (referer: None)https://foursquare.com/kluoma?id=996602016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99661?id=99661> (referer: None)https://foursquare.com/user/99661?id=996612016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/user/99664?id=99664> from <GET http://foursquare.com/user/99664?id=99664>2016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/lucasb?id=99663> from <GET https://foursquare.com/user/99663?id=99663>2016-09-21 14:05:25 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99664?id=99664> (referer: None)https://foursquare.com/user/99664?id=996642016-09-21 14:05:26 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99662?id=99662> (referer: None)https://foursquare.com/user/99662?id=996622016-09-21 14:05:28 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/lucasb?id=99663> (referer: None)https://foursquare.com/lucasb?id=99663...


1 0
原创粉丝点击