Scrapy入门（二）创建Scrapy项目

来源：互联网发布：php中switch case语句编辑：程序博客网时间：2024/05/20 21:57

创建一个Scrapy项目
定义提取的Item
编写爬取网站的 spider 并提取 Item
编写 Item Pipeline 来存储提取到的Item(即数据)

创建项目

在开始爬取之前，您必须创建一个新的Scrapy项目。进入您打算存储代码的目录中，运行下列命令:

scrapy startproject tutorial

该命令将会创建包含下列内容的 tutorial 目录:

tutorial/    scrapy.cfg    tutorial/        __init__.py        items.py        pipelines.py        settings.py        spiders/            __init__.py            ...

这些文件分别是:

scrapy.cfg: 项目的配置文件
tutorial/: 该项目的python模块。之后您将在此加入代码。
tutorial/items.py: 项目中的item文件.
tutorial/pipelines.py: 项目中的pipelines文件.
tutorial/settings.py: 项目的设置文件.
tutorial/spiders/: 放置spider代码的目录.

在默认生成的spiders目录下新建heartsong_spider.py,我们的爬虫就写在这里面，因为是介绍，那么此处就写个简单的下载网站的主页，让大家能运行一下，感受一下scrapy。

import scrapyclass HeartsongSpider(scrapy.spiders.Spider):    name = "heartsong"  # 爬虫的名字，执行时使用    allowed_domains = ["heartsong.top"]  # 允许爬取的域名，非此域名的网页不会爬取    start_urls = [        "http://www.heartsong.top"  # 起始url，此例只爬这一个页面       ]    def parse(self, response):  # 真正的爬虫方法        html = response.body  # response是获取到的来自网站的返回        # 以下四行将html存入文件        filename = "index.html"        file = open(filename, "w")        file.write(html)        file.close()

要说明的是，这个类不是随心所欲来写的，name,allowed_domains,start_urls,都是类似于”重载”的值。也就是说，scrapy内部会检测这些变量的值，变量名不可以起成其它的名字，类似的变量之后还会有介绍。至于parse方法，就是重载的父类的方法，我们爬虫的主体一般就写在这里面。
好，现在让我们来运行它
在命令行中进入heartsong目录下，执行命令

scrapy crawl heartsong

此处的名字heartsong是与爬虫类中的name保持一致。

最简单存储爬取的数据的方式是使用 Feed exports:

scrapy crawl dmoz -o items.json

该命令将采用 JSON 格式对爬取的数据进行序列化，生成 items.json 文件。

0 0