scrapy-抓取天猫女装销量前60的商品名称、价格、链接及打开连接后的店铺名称和链接

来源:互联网 发布:只会php有用吗 编辑:程序博客网 时间:2024/04/29 10:14

目标网址——https://list.tmall.com/search_product.htmspm=a220m.1000858.1000724.4.JnyabL&cat=50025135&sort=d&style=g&from=mallfp..pc_1_searchbutton&tmhkmain=0#J_Filter,界面如下,我们要抓取这个界面的商品价格、名称、和商品的链接



打开商品连接后,我们要获取的信息为店铺名称和链接,如下图所示:




项目结构如图:


贴代码:

tm_goods.py

# -*- coding: utf-8 -*-import scrapyfrom topgoods.items import TopgoodsItemimport sysreload(sys)sys.setdefaultencoding('utf-8')class TmGoodsSpider(scrapy.Spider):    name = "tm_goods"    allowed_domains =["http://www.tmall.com"]    start_urls = ("http://list.tmall.com/search_product.htm?spm=a220m.1000858.1000724.4.JnyabL&cat=50025135"                  "&sort=d&style=g&from=mallfp..pc_1_searchbutton&tmhkmain=0#J_Filter",)    count = 0    def parse(self,response):         TmGoodsSpider.count += 1         divs = response.xpath('//div[@id="J_ItemList"]/div[@class="product  "]/div')  #<div class="product-iWrap">是区分他们的共同元素         if not divs:             self.log("List Page error __%s" %response.url)         print "Goods numbers:",len(divs)         for div in divs:            item = TopgoodsItem( )#位置不要放错了!!!!!!!            item["GOODS_PRICE"] = div.xpath('p[@class="productPrice"]/em/@title')[0].extract()            item["GOODS_NAME"] = div.xpath('p[@class="productTitle"]/a/@title')[0].extract()            goods_url = div.xpath('p[@class="productTitle"]/a/@href')[0].extract()            item["GOODS_URL"] = goods_url  if "http:" in goods_url else ("http:"+ goods_url)            yield scrapy.Request(url = item["GOODS_URL"],meta = {"item":item},callback = self.parse_detail,dont_filter=True)            print item["GOODS_NAME"]    def parse_detail(self,response):#处理打开商品链接的那个页面        item = response.meta["item"]        divs = response.xpath('//div[@class="slogo"]/a')        item["SHOP_NAME"] = divs.xpath('//strong/text()')[0].extract()        shop_url = divs.xpath('@href')[0].extract()        item["SHOP_URL"] = shop_url if "http:" in  shop_url else ("http"+ shop_url)        print item["GOODS_PRICE"],item["GOODS_NAME"], item["GOODS_URL"],item["SHOP_NAME"],item["SHOP_URL"]        yield item


items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass TopgoodsItem(scrapy.Item):    # define the fields for your item here like:    GOODS_URL = scrapy.Field()    GOODS_PRICE = scrapy.Field()    GOODS_NAME = scrapy.Field()    SHOP_NAME = scrapy.Field()    SHOP_URL = scrapy.Field()

结果如图:


0 0
原创粉丝点击