Why not use the Splash HTTP API directly?
来源:互联网 发布:福州大学网络 编辑:程序博客网 时间:2024/05/20 18:01
https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly
The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:
import jsonimport scrapyfrom scrapy.http.headers import HeadersRENDER_HTML_URL = "http://127.0.0.1:8050/render.html"class MySpider(scrapy.Spider): start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True) headers = Headers({'Content-Type': 'application/json'}) yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST", body=body, headers=headers) def parse(self, response): # response.body is a result of render.html call; it # contains HTML processed by a browser. # ...
It works and is easy enough, but there are some issues that you should be aware of:
- There is a bit of boilerplate.
- As seen by Scrapy, we're sending requests to
RENDER_HTML_URL
instead of the target URLs. It affects concurrency and politeness settings:CONCURRENT_REQUESTS_PER_DOMAIN
,DOWNLOAD_DELAY
, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain. - As seen by Scrapy, response.url is an URL of the Splash server. scrapy-splash fixes it to be an URL of a requested page. "Real" URL is still available as
response.real_url
. - Some options depend on each other - for example, if you use timeout Splash option then you may want to set
download_timeout
scrapy.Request meta key as well. - It is easy to get it subtly wrong - e.g. if you won't use
sort_keys=True
argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly. - Default Scrapy duplication filter doesn't take Splash specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.
- Splash Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. SplashMiddleware logs content of HTTP 400 Splash responses by default (it can be turned off by setting
SPLASH_LOG_400 = False
option). - Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Splash.
- Large Splash arguments which don't change with every request (e.g.
lua_source
) may take a lot of space when saved to Scrapy disk request queues.scrapy-splash
provides a way to store such static parameters only once. - Splash 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper
save_args
andload_args
values and handle HTTP 498 responses.
scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.
阅读全文
0 0
- Why not use the Splash HTTP API directly?
- 翻译 Scribe : a way to aggregate data and why not, to directly fill the HDFS?
- Use Design Pattern Directly Or Not?
- Why NOT Use My Index
- why not all use english?
- Why Use The Command Line?
- why not use the following way to generate the CSV file?
- API EnumFontFamilies Why not md1
- API GetSystemMetrics Why, not d1
- API mixerGetNumDevs Why, the d1
- The reason why I use CSDN blog
- ubuntu RPM should not be used directly install RPM packages, use Alien instead!
- Ubuntu下RPM should not be used directly install RPM packages, use Alien instead!
- ubuntu RPM should not be used directly install RPM packages, use Alien instead!
- RPM should not be used directly install RPM packages, use Alien instead!
- rpm:RPM should not be used directly install RPM packages, use Alien instead!
- ubuntu RPM should not be used directly install RPM packages, use Alien instead
- rpm:RPM should not be used directly install RPM packages, use Alien instead!
- 传统JDBC开发(二)----抽象自己的工具类
- 菱形的输入方法
- 三位整数的个位,十位,百位数字之和。
- vue2 设置网页title的问题
- Java实现打开某个文件 —— Desktop类
- Why not use the Splash HTTP API directly?
- JS模块化规范:AMD/CMD/CommonJS
- 中间件rpc 鸟哥 Yar 的原理和基本使用
- 疯狂老鼠撞迷宫
- 05-天亮大数据系列教程之公司虚拟化架构与Gitlab搭建
- Android设置注释模板
- STL--Vector
- 泛型面试问题
- 实现ViewPager的不同滑动效果