crawlspider小试牛刀

来源：互联网发布：小众软件安卓编辑：程序博客网时间：2024/05/21 08:52

这里有crawlspider源码分析
1、start_urls里面的URL会经过parse、_parse_response、parse_start_url得到处理。
2、Rule里面没有指定callback的URL会经过_requests_to_follow发起请求，经过_response_download、_parse_response(这里会判断是否有callback，若是来自parse的调用，那么callback就是parse_start_url)。如果不是来自parse的调用而且Rule里没有callback，那么这个网页就不会被爬下来。
3、如果Rule有指定callback，那么会调用，将网页爬下来。
4、爬考拉的时候，我把parse_start_url重写，然后将Rule里面的nextPage的callback指定为parse，start_urls方第一页的URL，那么这样所有的列表也都会经过parse，再经过parse_start_url，最后经过parse_item被抓下来。
直接上代码：

class KaolaSpider(CrawlSpider):    name = "kaola"    start_urls = ['http://www.kaola.com/search.html?key=coach&pageNo=1&type=2&pageSize=60&isStock=false&isSelfProduct=false&isDesc=true&brandId=&proIds=&isSearch=0&isPromote=false&backCategory=&country=&lowerPrice=-1&upperPrice=-1&changeContent=type',]    rules = (        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="nextPage"]')),            follow=True,            callback='parse'),    )    def parse_start_url(self,response):        sel = Selector(response)        goods = sel.xpath('//ul[@id="result"]/li')        currPage = ''.join( sel.xpath('//span[@class="num"]/i/text()').extract() ).strip()        i=1        for good in goods:            item = items.KaolaItem()            item['rank'] = i + ( int(currPage)-1 )*60            i+=1                       item['currentPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[1]/text()').extract() ).strip()            item['marketPrice'] = ''.join( good.xpath('.//*/*/p[@class="price"]/span[2]/del/text()').extract() ).strip()            tmp = good.xpath('.//div/div[@class="img"]/a/@href').extract()[0]            detailUrl = "http://www.kaola.com"            if "http://" not in tmp:                detailUrl = detailUrl + tmp            else:                detailUrl = tmp            item['goodUrl'] = detailUrl            r = Request(detailUrl,callback=self.parse_item)            r.meta['item'] = item            yield r    def parse_item(self,response):        item = response.meta['item']        sel = Selector(response)        item['name'] = ''.join( sel.xpath('//dt[@class="product-title"]/text()').extract() ).strip()        item['commentCount'] = ''.join( sel.xpath('//b[@id="commentCounts"]/text()').extract() ).strip()        params = sel.xpath('//ul[@class="goods_parameter"]/li')        for param in params:            text = ''.join( param.xpath('.//text()').extract() ).strip().encode("utf-8")            if "商品品牌" in text:                item['brand'] = text            elif "产品类型" in text:                item['proType'] = text            elif "适用人群" in text:                item['fitPeople'] = text        yield item

5、还有一个方法就是完全不用parse和parse_start_url，直接用start_request方法发起初始请求并将callback设为parse_item，然后设置Rule里面抓到的URL的callback也为parse_item，这样就统一了页面处理，但是怎样用start_request发起多个初始URL的请求？？？？？

0 0