爬取素材网的妹子图片

来源:互联网 发布:淘宝可生成推广链接 编辑:程序博客网 时间:2024/05/17 09:16

这几天心血来潮想研究研究一下scrapy,想测试下其在linux下爬取的速度,于是选取了妹子网来练手(之前爬过),但是获取的链接的竟然在解析下载图片时出现错误,于是换了一个素材网站!

话不多说,贴上代码:

# -*- coding: utf-8 -*-"""Created on Mon Nov 21 23:14:09 2016@author: alis"""from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import Selectorfrom MeiZi.items import MeiziItemimport sysreload(sys)sys.setdefaultencoding('utf-8')import urllibsys.stdout=open('urls.txt','w') #将打印信息输出在相应的位置下b = '/media/alis/个人文件资料/Spider/MeiZi/photo/'      x = 0 class MeiZiSpider(CrawlSpider):    name = "meizi"    allowed_domains = ["tooopen.com"]    '''start_urls=["http://www.meizitu.com/a/xinggan.html",                         "http://www.meizitu.com/a/sifang.html" ,                        "http://www.meizitu.com/a/qingchun.html",                        "http://www.meizitu.com/a/meizi.html",                        "http://www.meizitu.com/a/xiaoqingxin.htm",                        "http://www.meizitu.com/a/nvshen.html",                        "http://www.meizitu.com/a/qizhi.html",                        "http://www.meizitu.com/a/mote.html",                        "http://www.meizitu.com/a/bijini.html",                        "http://www.meizitu.com/a/wangluo.html"                    ]'''    start_urls = ['http://www.tooopen.com/img/88.aspx']    rules=[        Rule(SgmlLinkExtractor(allow=(r'http://www.tooopen.com/img/88_(\d+)_(\d+)_(\d+).aspx' ))),        #Rule(SgmlLinkExtractor(allow=(r'http://www.meizitu.com/a/meizi_\d+_\d+.html' ))),        Rule(SgmlLinkExtractor(allow=(r'http://www.tooopen.com/view/(\d+).html')),callback="parse_item"),           ]        def parse_item(self,response):        global x        sel=Selector(response)        # Item=MeiziItem()        #print add                image_urls = sel.xpath('//div[@class="hindendiv"]/a/@data-img').extract()        for url in image_urls:            #print add            print url            x += 1            #urllib.urlretrieve(url,b+'%d.jpg'%x)                                  
解释:一开始进去初始网页,然后发现规律是
http://www.tooopen.com/img/88_(\d+)_(\d+)_(\d+).aspx
,最后面我们需要爬取的图片,调用函数下载图片!

接下来将讲解下载解析妹子网图片的方法,见下篇

0 0
原创粉丝点击