scrapy爬虫框架结合BeautifulSoup

来源:互联网 发布:南京大汉网络怎么样 编辑:程序博客网 时间:2024/06/05 17:46

参考教程:https://github.com/yidao620c/core-scrapy
①安装scrapy
pip install scrapy
依赖的包 python-lxml python-dev libffi-dev
在指定目录下创建项目:
$ scrapy startproject weather
②定义Item
Item就是要保存的属性对象,定义在Item.py中
Item 是保存爬取到的数据的容器;其使用方法和python字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。

import scrapyclass BkgscrapyItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    name =scrapy.Field()    pass

③编写spider

import scrapyfrom bs4 import BeautifulSoupfrom bkgscrapy.items import BkgscrapyItemclass localspider(scrapy.Spider):    name="myspider"    allowed_domains=["meizitu.com/"]    start_urls=['http://www.meizitu.com/']   def parse(self, response):        html_doc = response.body        #html_doc = html_doc.decode('utf-8')        soup = BeautifulSoup(html_doc,'lxml')        item =BkgscrapyItem()        item['name'] = soup.find(id='slider_name')     return item

④配置pipline
pipelines.py文件在创建项目时已经自动被创建好了,我们更改如下

class BkgscrapyPipeline(object):    def __init__(self):        pass    def process_item(self, item, spider):        with open('wea.txt','w+') as file:            city=item['name'][0].encode('utf-8')            file.write('name:'+str(name)+'\n\n')        return item

⑤配置运行
在settings.py中,设置

ITEM_PIPELINES = {   'bkgscrapy.pipelines.BkgscrapyPipeline': 1}

运行 $ scrapy crawl myspider

实战项目: