Scrapy框架学习

来源：互联网发布：java的calendar输出编辑：程序博客网时间：2024/05/15 00:58

安装

sudo pip3 install scrapy

测试是否安装成功

创建一个项目

创建一个爬虫

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapy# class MyscrapyItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    # passclass MyItem(scrapy.Item):    # h1标题    h1=scrapy.Field()    # h2标题    h2=scrapy.Field()

spiders/myspider.py

# !/usr/bin/env python# -*- coding:utf-8 -*-import scrapyfrom myscrapy.items import MyItemclass MySpider(scrapy.Spider):    name = 'myspider'    allowed_domains=['https://docs.scrapy.org/en/latest',]    start_urls=['https://docs.scrapy.org/en/latest/intro/tutorial.html',]    def parse(self, response):        print('----------\n'+response.body+'----------\n')        items=[]        # h1,只有一个        h1=response.xpath('//h1/text()').extract()[0]        h1item=MyItem()        h1item['h1']=h1        items.append(h1item)        # h2,有多个        h2_list=response.xpath('//div[@class="section"]/h2/text()').extract()        for h2 in h2_list:            h2item=MyItem()            h2item['h2']=h2            items.append(h2item)        return items

运行爬虫并将数据保存为json格式的文件

scrapy crawl myspider -o scrapy.json

运行完毕，如果 scrapy.json 文件为空，查找日志，发现报错:连接被拒绝

Connection was refused by other side: 111: Connection refused.

解决这个问题的思路:

1.在settings.py文件中设置User-Agent

2.在settings.py文件中设置DOWNLOAD_DELAY

3.如果经过以上2步骤还不行，就使用sudo命令运行爬虫

sudo scrapy crawl myspider -o scrapy.json

爬取成功!!!!

阅读全文

'); })();