Scrapy学习笔记

来源：互联网发布：淘宝上的中药丰胸编辑：程序博客网时间：2024/05/22 12:52

创建一个Scrapy项目
定义提取的Item
编写爬取网站的 spider 并提取 Item
编写 Item Pipeline 来存储提取到的Item(即数据)

1.根据需要从dmoz.org获取到的数据对item进行建模

2.刚才发生了什么？

Scrapy为Spider的 start_urls 属性中的每个URL创建了 scrapy.Request 对象，并将 parse 方法作为回调函数(callback)赋值给了Request。

Request对象经过调度，执行生成 scrapy.http.Response 对象并送回给spider parse() 方法。

3.这里给出XPath表达式的例子及对应的含义:

/html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
/html/head/title/text(): 选择上面提到的 <title> 元素的文字
//td: 选择所有的 <td> 元素
//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

In [1]: response.xpath('//title')  返回           表达式+ <title>****** </title>Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]In [2]: response.xpath('//title').extract() 返回   <title>****** </title>Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']In [3]: response.xpath('//title/text()')   返回  表达式+******Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]In [4]: response.xpath('//title/text()').extract() 返回 ******Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']In [5]: response.xpath('//title/text()').re('(\w+):') 正则表达式 Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

0 0