捉取dmoztools.net的标题,链接和描述

来源:互联网 发布:记分牌算法 编辑:程序博客网 时间:2024/06/07 19:15

原http://www.dmoz.org已经重定向到dmoztools.net,html也有所改变,所以不能照搬参考文档。

Scrapy的安装和使用请参考http://docs.pythontab.com/scrapy/scrapy0.24/intro/tutorial.html。

源码:

#!/usr/bin/env python# -*- coding: utf-8 -*-from scrapy.spider import Spiderfrom scrapy.selector import Selectorfrom tutorial.items import DmozItemclass DmozSpider(Spider):    name = "dmoz"    allowed_domains = ["dmoztools.net"]    start_urls = [        "http://dmoztools.net/Computers/Programming/Languages/Python/Books/",        "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"    ]    #http://dmoztools.net/Computers/Programming/Languages/Python/Resources/    #http://dmoztools.net/Computers/Programming/Languages/Python/Books/    def parse(self,response):        """ demo 1        #filename = response.url.split("/")[-2]        #open(filename,'wb').write(response.body)        """        """ demo 2        sel = Selector(response)        sites = sel.xpath('//div[@class="title-and-desc"]')        for site in sites:            #title = site.xpath('a/div[@class="site-title"]/text()').extract()            #link = site.xpath('a/@href').extract()            desc = site.xpath('div[@class="site-descr "]/text()').extract()            #print title            #print link            print desc        """        sel = Selector(response)        sites = sel.xpath('//div[@class="title-and-desc"]')        items = []        for site in sites:            item = DmozItem()            item['title'] = site.xpath('a/div[@class="site-title"]/text()').extract()            item['link'] = site.xpath('a/@href').extract()            item['desc'] = site.xpath('div[@class="site-descr "]/text()').extract()            items.append(item)        return items

原创粉丝点击