Scrapy框架学习
来源:互联网 发布:java的calendar输出 编辑:程序博客网 时间:2024/05/15 00:58
安装
sudo pip3 install scrapy
测试是否安装成功
创建一个项目
创建一个爬虫
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapy# class MyscrapyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # passclass MyItem(scrapy.Item): # h1标题 h1=scrapy.Field() # h2标题 h2=scrapy.Field()
# !/usr/bin/env python# -*- coding:utf-8 -*-import scrapyfrom myscrapy.items import MyItemclass MySpider(scrapy.Spider): name = 'myspider' allowed_domains=['https://docs.scrapy.org/en/latest',] start_urls=['https://docs.scrapy.org/en/latest/intro/tutorial.html',] def parse(self, response): print('----------\n'+response.body+'----------\n') items=[] # h1,只有一个 h1=response.xpath('//h1/text()').extract()[0] h1item=MyItem() h1item['h1']=h1 items.append(h1item) # h2,有多个 h2_list=response.xpath('//div[@class="section"]/h2/text()').extract() for h2 in h2_list: h2item=MyItem() h2item['h2']=h2 items.append(h2item) return items
运行爬虫并将数据保存为json格式的文件
scrapy crawl myspider -o scrapy.json
运行完毕,如果 scrapy.json 文件为空,查找日志,发现报错:连接被拒绝
Connection was refused by other side: 111: Connection refused.
解决这个问题的思路:
1.在settings.py文件中设置User-Agent
2.在settings.py文件中设置DOWNLOAD_DELAY
3.如果经过以上2步骤还不行,就使用sudo命令运行爬虫
sudo scrapy crawl myspider -o scrapy.json
爬取成功!!!!
阅读全文