Scrapy--使用phantomjs爬取花瓣网图片
来源:互联网 发布:魅族官方网络授权店 编辑:程序博客网 时间:2024/04/29 10:02
新建一个scrapy工程
(python35) ubuntu@ubuntu:~/scrapy_project$ scrapy startproject huaban
添加一个spider
(python35) ubuntu@ubuntu:~/scrapy_project/huaban/huaban/spiders$ scrapy genspider huaban_pets huaban.com
目录结构如下:
(python35) ubuntu@ubuntu:~/scrapy_project/huaban$ tree -I *.pyc.├── huaban│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── pipelines.py│ ├── __pycache__│ ├── settings.py│ └── spiders│ ├── huaban_pets.py│ ├── __init__.py│ └── __pycache__└── scrapy.cfg
编辑items.py文件
# -*- coding: utf-8 -*-import scrapyclass HuabanItem(scrapy.Item): img_url = scrapy.Field()
编辑huaban_pets.py
# -*- coding: utf-8 -*-import scrapyclass HuabanPetsSpider(scrapy.Spider): name = 'huaban_pets' allowed_domains = ['huaban.com'] start_urls = ['http://huaban.com/favorite/pets/'] def parse(self, response): for img_src in response.xpath('//*[@id="waterfall"]/div/a/img/@src').extract(): item = HuabanmeinvItem() # 例如img_src为//img.hb.aicdn.com/223816b7fee96e892d20932931b15f4c2f8d19b315735-wgi1w2_fw236 # 去掉后面的_fw236就为原图 item['img_url'] = 'http:' + img_src[:-6] yield item
编写一个中间键使用phantomj获取网页源码
在middlewares.py添加如下内容:
# -*- coding: utf-8 -*-from scrapy import signalsfrom selenium import webdriverfrom scrapy.http import HtmlResponsefrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesclass JSPageMiddleware(object): def process_request(self, request, spider): if spider.name == 'hbmeinv': # cap[".page.setting.resourceTimeout"] = 180 # cap["chrome.page.setting.loadImage"] = False dcap = dict(DesiredCapabilities.PHANTOMJS) # 不载入图片,爬页面速度会快很多 dcap["phantomjs.page.settings.loadImages"] = False browser = webdriver.PhantomJS(executable_path=r'/home/ubuntu/scrapy_project/huabanphantomjs',desired_capabilities=dcap) try: browser.get(request.url) return HtmlResponse(url=browser.current_url, body=browser.page_source,encoding='utf-8',request=request) except: print("get page failed!") finally: browser.quit() el return
在pipelines.py中添加如下内容下载网页图片:
# -*- coding: utf-8 -*-import urllibclass HuabanmeinvPipeline(object): def process_item(self, item, spider): url = item['img_url'] urllib.request.urlretrieve(url, filename=r'/home/ubuntu/scrapy_project/huaban/image/%s.jpg' % url[url.rfind('/')+1:]) return item
在setting.py中使用添加的中间键和设置消息头
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN', 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}# Enable or disable downloader middlewares# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = { #'huabanmeinv.middlewares.MyCustomDownloaderMiddleware': 543, 'huabanmeinv.middlewares.JSPageMiddleware': 543,}# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'huabanmeinv.pipelines.HuabanmeinvPipeline': 300,}
开始爬取
ubuntu@ubuntu:~/scrapy_project/huaban/huaban/spiders$ scrapy runspider huaban_pets.py
爬取结束后,就可以在/home/ubuntu/scrapy_project/huaban/image目录下看到爬取的图片了,例如:
阅读全文
0 0
- Scrapy--使用phantomjs爬取花瓣网图片
- python爬取花瓣网图片
- Scrapy+phantomjs爬取动态网页数据
- 使用Scrapy爬取大众点评图片
- Scrapy爬取图片
- scrapy爬取图片
- scrapy爬取图片
- python3 scrapy 使用PhantomJS作为middlewares爬取动态加载的数据
- Scrapy之图片爬取。
- Scrapy爬取图片资源
- 使用scrapy爬取1024的图片(hentai!)
- 使用scrapy爬取网站上的所有图片
- 使用scrapy爬取CR糗百图片
- 使用phantomjs+java 爬取AJAX页面
- python爬取花瓣妹子信息
- Scrapy爬取美女图片续集
- python-scrapy-爬取图片笔记
- Scrapy 爬取图片/gif/视频
- window安装python库
- SPSS实例教程:有序多分类Logistic回归
- freeswitch系列五 解决xlite和freeswitch通话没有语音的问题
- Jsp 中 include 指令和 include 动作的区别
- 通过原生JS和CSS制作钟表
- Scrapy--使用phantomjs爬取花瓣网图片
- android Material Design设计规范
- machine-learning-ex2
- 解决androiStudio无法实现Alt+Insert快捷键弹出窗口
- 2018
- java多线程复习所感
- 17 10 29 成果
- 面向对象总结
- UVA