使用scrapy爬取1024的图片(hentai!)

来源:互联网 发布:克克录音软件 编辑:程序博客网 时间:2024/04/24 13:51

一、scrapy安装与入门

    首先安装scrapy。安装过程还是很简单的,我的环境是:python2.7+ubuntu14.04,直接

pip install scrapy

    然后建议先把官方文档(中文)的入门教程跟着做一遍,基本就对scrapy有了基本的了解,这里就不做搬运工了。

二、新建项目

scrapy startproject hentai


得到的目录途径如下:

hentai/    scrapy.cfg    hentai/        __init__.py        items.py        pipelines.py        settings.py        spiders/            __init__.py            ...            


三、定义Item 

    编写items.py,代码如下:

import scrapyclass HentaiItem(scrapy.Item):    title = scrapy.Field()      image_urls=scrapy.Field()    images=scrapy.Field()


四、编写Spider

    在/hentai/hentai/spiders目录下创建hentai_spider.py文件,代码如下:

#!/usr/bin/python  # -*- coding:utf-8 -*-  import sysreload(sys)sys.setdefaultencoding( "utf-8" )from scrapy.spider import Spider  from scrapy.http import Request  from scrapy.selector import Selector  from hentai.items import HentaiItem    class Hentai_Spider(Spider):        name = "hentai"      #减慢爬取速度为1s      # download_delay = 1      allowed_domains = ["********.com"]          #起始地址,这里为前十页    content_urls = []    for i in range(0,11):        content_urls.append('********&page='+str(i))    start_urls = content_urls    def parse(self, response):         sel = Selector(response)          this_url = str(response.url)        #目录页        if  'fid=15&page=' in this_url:            urls = sel.xpath('//h3/a/@href').extract()            for url in urls:                if 'htm_data/' in url:                    yield Request('********/pw/'+url, callback=self.parse)          #图片页        else:            item = HentaiItem()              item['title'] = sel.xpath('//h1[@id="subject_tpc"]/text()').extract()            images = sel.xpath('//div[@id="read_tpc"]/img/@src').extract()            item['image_urls'] = [n.encode('utf-8') for n in images]             yield item            


    这里给出几个XPath表达式的例子:

  • /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素

  • /html/head/title/text(): 选择上面提到的 <title> 元素的文字

  • //td: 选择所有的 <td> 元素

  • //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

五、编写Pipeline,定制MyImagesPipeline

    pipelines.py文件代码如下:

# -*- coding: utf-8 -*-import jsonimport codecsfrom scrapy.contrib.pipeline.images import ImagesPipelinefrom scrapy.http.request import Request#将获取到的文件信息存到hentai.jsonclass HentaiPipeline(object):    def __init__(self):        self.file=codecs.open('hentai.json', mode='wb',encoding='utf-8')    def process_item(self, item, spider):        line=json.dumps(item['title'])+'\n'        self.file.write(line.decode("unicode_escape"))        return item#定制图片管道,实现图片按标题存到对应的文件夹下    class MyImagesPipeline(ImagesPipeline):    def get_media_requests(self, item, info):        for image_url in item['image_urls']:            yield Request(image_url, meta={'name': item['title'][0]})    def item_completed(self, results, item, info):        return super(MyImagesPipeline, self).item_completed(results, item, info)    def file_path(self, request, response=None, info=None):        f_path = super(MyImagesPipeline, self).file_path(request, response, info)        f_path = f_path.replace('full', request.meta['name'], 1)        return f_path        pass        


    为了开启管道、设置默认的图片存储路径和有效时间,在settings.py中添加如下代码:

ITEM_PIPELINES = {   'hentai.pipelines.HentaiPipeline': 2,   'hentai.pipelines.MyImagesPipeline':1}IMAGES_STORE = './images'IMAGES_EXPIRES = 90


六、运行

    进入gen目录,执行scrapy crawl hentai就可以绅士啦~

    如果报错:No module named mail.smtp ,是scrapy-twist的问题,重装twisted就行了:

sudo apt-get install Python-twisted

最后,附上项目地址:https://github.com/MengZanZan/Hentai-scrapy

1 0
原创粉丝点击