tornado and scrappy 初探
来源:互联网 发布:mac 相册 导出 编辑:程序博客网 时间:2024/06/06 01:10
1. tornado ->what's this
1.1 如何安装
pip install tornado1.2 Hello, world
import tornado.ioloopimport tornado.webclass MainHandler(tornado.web.RequestHandler): def get(self): self.write("Hello, world")application = tornado.web.Application([ (r"/", MainHandler),])if __name__ == "__main__": application.listen(8888) tornado.ioloop.IOLoop.instance().start()
1.3 更复杂的例子
- url router
class MainHandler(tornado.web.RequestHandler): def get(self): self.write("You requested the main page")class StoryHandler(tornado.web.RequestHandler): def get(self, story_id): self.write("You requested the story " + story_id)application = tornado.web.Application([ (r"/", MainHandler), (r"/story/([0-9]+)", StoryHandler),])
通过正则来匹配请求url,映射到Handler类,来处理请求
- Templatestemp.html:
<html><head> <title>{{ title }}</title> </head> <body> <ul> {% for item in items %} <li>{{ escape(item) }}</li> {% end %} </ul> </body> </html>
handler.py
class MainHandler(tornado.web.RequestHandler): def get(self): items = ["Item 1", "Item 2", "Item 3"] self.render("template.html", title="My title", items=items)配置
app = tornado.web.Application( [ (r"/", MainHandler), ], cookie_secret="__TODO:_GENERATE_YOUR_OWN_RANDOM_VALUE_HERE__", login_url="/auth/login", template_path=os.path.join(os.path.dirname(__file__), "templates"), static_path=os.path.join(os.path.dirname(__file__), "static"), xsrf_cookies=True, )app.listen(8080)tornado.ioloop.IOLoop.instance().start()
更多参考实例:https://github.com/facebook/tornado/demo
参考文档1:http://www.tornadoweb.org/en/stable/overview.html2. scrapy -> what's this
2.1 如何安装
apt-get install scrapyyum install scrapy
pip install scrapy ##升级到最新版本
2.2 如何使用
2.2.1安装完scrapy之后,可以使用scrapy命令 执行命令:
scrapy startproject <project-name>
scrapy会自动帮忙建立一个scrapy项目目录,包括一些基本必须的文件。
结构如下:
<project-name>/ #项目名称
scrapy.cfg
<project-name>/ #项目名称
__init__.py
items.py #用来定义Items的模块
pipelines.py #用来定义piplines class实现的模块
settings.py #用来设置配置的文件
spiders/ #存放各个Spider实现的目录
__init__.py
...
当在spider目录下,按正确方式实现了一个自己的spider模块之后, 可以进入最上层的<project-name>目录, 执行:
scrapy crawl <spider-name>
就可以运行实现的Spiler。注:spider-name为Splider模块中设置。 具体如下:
2.2.2 Spider
说明: Spiler作用:根据给定的url作为入口, 请求页面,然后处理页面内容。
处理方式:Selector
from scrapy.selector import Selectorfrom scrapy.spider import BaseSpiderfrom scrapy.http import Requestfrom myproject.items import MyItemclass MySpider(BaseSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): sel = Selector(response) for h3 in sel.xpath('//h3').extract(): yield MyItem(title=h3) for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse)2.2.3 Selector
说明:sel.css作为css样式的选择器使用。
可以使用命里scrapy shell <url>, 进入console模式,进行网页元素选择测试
def parse(self, response): sel = Selector(response) url = response.url if url.endswith("/"): url = url[:-1] if url not in self.title: self.title[url] = sel.css("#maininfo #info h1::text").extract()[0] chapters = sel.css(".box_con dd a::attr(href)").extract() for chapter in chapters: chapter = self.base_url + chapter print chapter yield Request(chapter, callback=self.chap_callback)
2.2.4 Item
说明:用来存储数据的实体对象类:from scrapy.item import Item, Fieldclass Product(Item): name = Field() price = Field() stock = Field() last_updated = Field(serializer=str)
GET字段:
>>> product['name']Desktop PC
>>> product.get('name')
Desktop PC
>>> product.get('last_updated', 'not set')
not set
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
SET字段:
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
2.2.5 pipeline
说明:用来处理Spidler返回的Item。
import jsonclass JsonWriterPipeline(object): def __init__(self): self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item
设置PIPline: 说明:在settings.py中设置使用哪些Pipline class来处理Item,后面的数字表示 优先级,The lower the better
ITEM_PIPELINES = {'myproject.pipeline.PricePipeline': 300,
'myproject.pipeline.JsonWriterPipeline': 800,
}
参考2:http://doc.scrapy.org/en/latest/intro/tutorial.html
0 0
- tornado and scrappy 初探
- 第二章:Tornado初探
- CentOS7安装scrappy
- [py]pyweb框架本质-tornado框架初探
- tornado
- Tornado
- Tornado
- tornado
- Tornado
- tornado
- tornado
- Tornado
- tornado
- Tornado
- Tornado
- Tornado
- Tornado
- tornado
- 倒三角形
- 构建Maven Webapp项目并运行
- (28)ExtJS之在Panel中使用html配置项自定义面板的内容
- 三 memcache总结
- 【3】多线程的重入和并发
- tornado and scrappy 初探
- Yahoo Web Performance Best Practices and Rules
- python微信七token保存
- 不装Oracle 只装 SQL developer 时候,配置 PATH的问题
- (29)ExtJS之Panel面板锚定位中的百分比定位
- 四 memcache应对高并发
- (30)ExtJS之Panel面板锚定位中的偏移值定位
- 浅谈ISO 8859-1与UTF-8
- 状态模式