使用scrapy爬取1024的图片(hentai!)

来源：互联网发布：克克录音软件编辑：程序博客网时间：2024/04/24 13:51

一、scrapy安装与入门

首先安装scrapy。安装过程还是很简单的，我的环境是：python2.7+ubuntu14.04，直接

pip install scrapy

然后建议先把官方文档（中文）的入门教程跟着做一遍，基本就对scrapy有了基本的了解，这里就不做搬运工了。

二、新建项目

scrapy startproject hentai

得到的目录途径如下：

hentai/    scrapy.cfg    hentai/        __init__.py        items.py        pipelines.py        settings.py        spiders/            __init__.py            ...

三、定义Item

编写items.py，代码如下：

import scrapyclass HentaiItem(scrapy.Item):    title = scrapy.Field()      image_urls=scrapy.Field()    images=scrapy.Field()

四、编写Spider

在/hentai/hentai/spiders目录下创建hentai_spider.py文件，代码如下：

#!/usr/bin/python  # -*- coding:utf-8 -*-  import sysreload(sys)sys.setdefaultencoding( "utf-8" )from scrapy.spider import Spider  from scrapy.http import Request  from scrapy.selector import Selector  from hentai.items import HentaiItem    class Hentai_Spider(Spider):        name = "hentai"      #减慢爬取速度为1s      # download_delay = 1      allowed_domains = ["********.com"]          #起始地址，这里为前十页    content_urls = []    for i in range(0,11):        content_urls.append('********&page='+str(i))    start_urls = content_urls    def parse(self, response):         sel = Selector(response)          this_url = str(response.url)        #目录页        if  'fid=15&page=' in this_url:            urls = sel.xpath('//h3/a/@href').extract()            for url in urls:                if 'htm_data/' in url:                    yield Request('********/pw/'+url, callback=self.parse)          #图片页        else:            item = HentaiItem()              item['title'] = sel.xpath('//h1[@id="subject_tpc"]/text()').extract()            images = sel.xpath('//div[@id="read_tpc"]/img/@src').extract()            item['image_urls'] = [n.encode('utf-8') for n in images]             yield item

这里给出几个XPath表达式的例子：

/html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
/html/head/title/text(): 选择上面提到的 <title> 元素的文字
//td: 选择所有的 <td> 元素
//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

五、编写Pipeline，定制MyImagesPipeline

pipelines.py文件代码如下：

# -*- coding: utf-8 -*-import jsonimport codecsfrom scrapy.contrib.pipeline.images import ImagesPipelinefrom scrapy.http.request import Request#将获取到的文件信息存到hentai.jsonclass HentaiPipeline(object):    def __init__(self):        self.file=codecs.open('hentai.json', mode='wb',encoding='utf-8')    def process_item(self, item, spider):        line=json.dumps(item['title'])+'\n'        self.file.write(line.decode("unicode_escape"))        return item#定制图片管道，实现图片按标题存到对应的文件夹下    class MyImagesPipeline(ImagesPipeline):    def get_media_requests(self, item, info):        for image_url in item['image_urls']:            yield Request(image_url, meta={'name': item['title'][0]})    def item_completed(self, results, item, info):        return super(MyImagesPipeline, self).item_completed(results, item, info)    def file_path(self, request, response=None, info=None):        f_path = super(MyImagesPipeline, self).file_path(request, response, info)        f_path = f_path.replace('full', request.meta['name'], 1)        return f_path        pass

为了开启管道、设置默认的图片存储路径和有效时间，在settings.py中添加如下代码：

ITEM_PIPELINES = {   'hentai.pipelines.HentaiPipeline': 2,   'hentai.pipelines.MyImagesPipeline':1}IMAGES_STORE = './images'IMAGES_EXPIRES = 90

六、运行

进入gen目录,执行scrapy crawl hentai就可以绅士啦~

如果报错：No module named mail.smtp ，是scrapy-twist的问题，重装twisted就行了：

sudo apt-get install Python-twisted

最后，附上项目地址：https://github.com/MengZanZan/Hentai-scrapy

1 0