scrapy爬虫实战(四)--------------登陆51job并使用cookies进行爬取
来源:互联网 发布:php 模拟发送post请求 编辑:程序博客网 时间:2024/05/29 05:53
本文章代码仅供学习使用,如有侵权请联系作者删除,多谢。
主要通过一个scrapy爬虫,理解如何登陆网站并使用登陆后的cookies继续爬取。
登陆的用户名密码用XXX表示。
# -*- coding: utf-8 -*-import osimport scrapyfrom scrapy.spider import CrawlSpider, Rulefrom scrapy.http.request import Requestfrom scrapy.linkextractors import LinkExtractordef add_cookie(r): r.meta.update(cookiejar=1) new_r = r.replace(meta=r.meta) return new_rclass ExampleSpider(CrawlSpider): name = "example1" rules = ( Rule(LinkExtractor(allow='ResumeViewFolder'),process_request=add_cookie,callback='parse_one_candidate',follow=True), Rule(LinkExtractor(allow='ehire.51job.com',),process_request=add_cookie,follow=True) ) def start_requests(self): yield Request('http://ehire.51job.com/MainLogin.aspx', callback=self.parse_login_page) def parse_login_page(self, response): cookies = {} cookie_keys = ['hidLangType', 'hidAccessKey', 'hidEhireGuid', 'hidRetUrl', 'fksc', '__VIEWSTATE'] isRememberMe = "false" for key in cookie_keys: css_value = "#" + key + "::attr(value)" try: cookie_value = response.css(css_value).extract()[0] except Exception as e: print("cookies value err", css_value, e) cookies[key] = '' else: cookies[key] = cookie_value cookies['txtMemberNameCN'] = "xxxx" cookies['txtUserNameCN'] = 'xxxx' cookies['txtPasswordCN'] = 'xxxx' cookies['ctmName'] = "xxxx" cookies['userName'] = 'xxxx' cookies['password'] = 'xxxx' cookies['checkCode'] = '' cookies['oldAccessKey'] = cookies['hidAccessKey'] cookies['langtype'] = cookies['hidLangType'] cookies['isRememberMe'] = 'false' cookies['sc'] = cookies['fksc'] cookies['ec'] = cookies['hidEhireGuid'] cookies['returl'] = '' cookies['referrurl'] = '' return [ scrapy.FormRequest("https://ehirelogin.51job.com/Member/UserLogin.aspx?", formdata=cookies, meta={'cookiejar': 1}, callback=self.login_in) ] def login_in(self, response): self.recored2file(response) for request in self._requests_to_follow(response): yield request def recored2file(self, response): with open('./login.html','wb') as f: f.write(response.body) def parse_one_candidate(self, response): pass
1 0
- scrapy爬虫实战(四)--------------登陆51job并使用cookies进行爬取
- Scrapy爬虫实战(三)----------使用cookies爬取51job
- 实战 使用scrapy 爬取代理 并保存到数据库
- 网络爬虫之Scrapy实战四:爬取网页下载图片
- scrapy框架爬取51job网
- Python3[爬虫实战] 爬虫之scrapy爬取爱上程序网存MongoDB(android模块)
- scrapy爬虫实战(二)-------------爬取IT招聘信息
- Python爬虫实战:Scrapy豆瓣电影爬取
- [Python]使用Scrapy爬虫框架简单爬取图片并保存本地
- scrapy实战-爬取
- 爬取51job的爬虫(python)
- python 爬虫学习三(Scrapy 实战,豆瓣爬取电影信息)
- Scrapy爬虫(3)爬取中国高校前100名并写入MongoDB
- Python爬虫-爬取51job.com 招聘信息并写入文件和数据库mysql
- Scrapy爬虫实战四:糗事百科
- python3 [爬虫入门实战]爬虫之scrapy爬取中国医学人才网
- python3 [爬虫入门实战]爬虫之scrapy爬取中华人民共和国民政部
- python3 网络爬虫(四)如何查找以及使用cookies
- Oracle存储过程
- Android通用网络请求解析框架.11(总结)
- DOM中获得对象的方法
- 2017/1/10
- 用联合体判断大小端
- scrapy爬虫实战(四)--------------登陆51job并使用cookies进行爬取
- 般配数对
- 关于recycleview的item的布局问题
- 第三章 行转列 cast end while
- 237 Delete Node in a Linked List
- 级联分类器训练3
- vxl_安装
- export: `PATH;': not a valid identifier
- 《CSS Mastery Advance Web Standards Solutions》