Scrapy爬虫（二）——自定义Item和代理访问的爬虫

来源：互联网发布：sqlserver存储过程if 编辑：程序博客网时间：2024/05/07 21:32

前言

在Scrapy爬虫（一）——你的第一个Scrapy爬虫中我们写了一个最简易的爬虫，可是这个爬虫并没有实现保存页面内容的功能。
本篇主要会通过pipLine实现页面内容保存以及代理访问的功能。

Item

在每次调用parse()函数都通过yield语句返回一个list，而实际上我们可以自定义一个Item类通过这个函数返回一个Item List。
首先，我们要在新建一个item.py并且自定义一个Item：

class CollectItem(scrapy.Item):    news_id=scrapy.Field()    language=scrapy.Field()    request_url=scrapy.Field()    title=scrapy.Field()    classification=scrapy.Field()    body=scrapy.Field()

这个类要继承于scrapy.Item，里面每个自定义的变量都要等于scrapy.Field()才能发挥作用。而后在parse()中方法中要这样定义：

item['news_id'] = self.cotitem['language'] = 'ind'item['classification'] = response.xpath("//a[@class='label label-primary']/text()").extract()

而后我们自定义的item就会传到pipline.py中的process_item这个方法里面

PipLine

项目管道(Item Pipeline)，负责处理有蜘蛛从网页中抽取的项目，他的主要任务是清晰、验证和存储数据。当页面被蜘蛛解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
具体定义如下：

class LearnscrapyPipeline(object):    def process_item(self, item, spider):        filename="/home/lhn/Desktop/json/"+str(item['news_id'])+".json"        with open(filename,"wb")as f:            f.write('\"news_id\":'+str(item['news_id'])+'\n')            f.write('\"pub_time\":'+''.join(item['pub_time'])+'\n')        return item

最后要在setting.py中添加

# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'learnScrapy.pipelines.LearnscrapyPipeline': 300,}

下载保存页面内容如此便实现了。

代理访问

当我们爬取一些网站的时候可能需要用到代理服务器，而代理的功能主要是用自定义的DownloadMiddware来实现的。
我们新建一个proxymiddware.py文件

class ProxyMiddleware(object):    proxyList = [ \        '127.0.0.1:393'    ]    def process_request(self, request, spider):        # Set the location of the proxy        pro_adr = random.choice(self.proxyList)        print("USE PROXY -> " + pro_adr)        request.meta['proxy'] = "http://" + pro_adr        url = request.url;        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; windows NT)'        headers = {'User-Agent': user_agent}        r = requests.post(url, headers=headers)        body = r.content        return HtmlResponse(url, encoding='utf-8', status=200, body=body)

我们可以在proxyList中添加任意多的地址，该方法可以随机选择服务器地址来访问。
最后需要再一次在settingups.py里面再次添加配置

# Enable or disable downloader middlewares# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {   'learnScrapy.proxymiddleware.ProxyMiddleware': 100}

便可以实现代理访问了。

阅读全文

0 0