Python爬虫实例:登录豆瓣并修改签名
来源:互联网 发布:国家密码 阿里云 编辑:程序博客网 时间:2024/05/14 19:19
功能
- 登录豆瓣
- 修改签名
一、登录流程分析
- 向哪个url发送请求
- 发送哪些数据
- 有哪些特殊的头字段
- 验证码问题如何解决
1.抓取豆瓣登录流程:
使用账号:xxxxxx 密码:xxxxxx 抓取得Network如下:
豆瓣登录界面网址:https://www.douban.com/accounts/login
GeneralRequest URL:https://accounts.douban.com/loginRequest Method:POSTStatus Code:302 Moved TemporarilyRemote Address:211.147.4.32:443---------------------------------------------------------------------------Response HeadersCache-Control:must-revalidate, no-cache, privateConnection:keep-aliveContent-Length:65Content-Type:text/plainDate:Sat, 11 Jun 2016 02:48:18 GMTExpires:Sun, 1 Jan 2006 01:00:00 GMTKeep-Alive:timeout=30Location:https://www.douban.comP3P:CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"Pragma:no-cacheServer:daeSet-Cookie:ue="877646746@qq.com"; domain=.douban.com; expires=Sun, 11-Jun-2017 02:48:18 GMT; httponlySet-Cookie:dbcl2="146925119:/crpdV7NiKQ"; path=/; domain=.douban.com; httponlySet-Cookie:as="deleted"; max-age=0; domain=.douban.com; expires=Thu, 01-Jan-1970 00:00:00 GMTStrict-Transport-Security:max-age=15552000;X-Content-Type-Options:nosniffX-DAE-App:accountsX-DAE-Node:sindar15aX-Douban-Mobileapp:0X-Frame-Options:SAMEORIGINX-Xss-Protection:1; mode=block----------------------------------------------------------------------------------Request HeadersAccept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Accept-Encoding:gzip, deflate, brAccept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4Cache-Control:max-age=0Connection:keep-aliveContent-Length:138Content-Type:application/x-www-form-urlencodedCookie:bid=PHjUxRzrHNk; _vwo_uuid_v2=56A954C0557184C73BBB3DF5C8D30C1D|409597a19056d473ebee60708893e9b8; ap=1; ll="118221"; __utmt=1; ps=y; __utma=30149280.2019919087.1465354115.1465606255.1465612975.3; __utmb=30149280.2.10.1465612975; __utmc=30149280; __utmz=30149280.1465612975.3.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; dbcl2="146925119:KHEcD+nREDs"; ck=9R18Host:accounts.douban.comOrigin:https://accounts.douban.comReferer:https://accounts.douban.com/loginUpgrade-Insecure-Requests:1User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36-------------------------------------------------------------Form Datack:9R18source:Noneredir:https://www.douban.comform_email:877646746@qq.comform_password:song@3345616login:登录
即登录时,我们只需要模拟Request Headers中的头和Form Data中的post参数就可以登录了。
如果登录时,需要图片中的验证码,我们需要抽取验证码图片,然后手动填写上去。(半自动化方式)
当然,如果需要全自动化的方式,则需要用到机器学习中的知识,爬取所有验证码图片,然后训练模型,用机器学习的方法自动识别出验证码图片中的验证码。
二、修改签名流程分析
- 向哪个url发送请求
- 发送哪些数据
- 有哪些特殊的头字段
- 返回值长什么样
GeneralRequest URL:https://www.douban.com/j/people/146925119/edit_signatureRequest Method:POSTStatus Code:200 OKRemote Address:211.147.4.31:443--------------------------------------------------------------------------Response HeadersCache-Control:must-revalidate, no-cache, privateConnection:keep-aliveContent-Length:47Content-Type:application/json; charset=utf-8Date:Sat, 11 Jun 2016 06:06:37 GMTExpires:Sun, 1 Jan 2006 01:00:00 GMTKeep-Alive:timeout=30Pragma:no-cacheServer:daeStrict-Transport-Security:max-age=15552000;X-DAE-App:snsX-DAE-Node:sindar25bX-Douban-Mobileapp:0X-Xss-Protection:1; mode=block-----------------------------------------------------------------------Request HeadersAccept:application/json, text/javascript, */*; q=0.01Accept-Encoding:gzip, deflate, brAccept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4Connection:keep-aliveContent-Length:54Content-Type:application/x-www-form-urlencodedCookie:bid=PHjUxRzrHNk; _vwo_uuid_v2=56A954C0557184C73BBB3DF5C8D30C1D|409597a19056d473ebee60708893e9b8; ll="118221"; ps=y; ue="877646746@qq.com"; dbcl2="146925119:/crpdV7NiKQ"; ck=vkO3; ap=1; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1465624694%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DjUjRq0ldEsr3DVgcsr-2j6hhjW72VMHrsETjWL2QAee%26wd%3D%26eqid%3Dc07ebf420008142f00000003575b7a83%22%5D; __utmt=1; push_noty_num=0; push_doumail_num=0; _pk_id.100001.8cb4=cbb9346c7bb2e22f.1465354092.4.1465624911.1465613335.; _pk_ses.100001.8cb4=*; __utma=30149280.2019919087.1465354115.1465612975.1465624696.4; __utmb=30149280.4.10.1465624696; __utmc=30149280; __utmz=30149280.1465612975.3.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmv=30149280.14692Host:www.douban.comOrigin:https://www.douban.comReferer:https://www.douban.com/people/146925119/User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36X-Requested-With:XMLHttpRequest----------------------------------------------------------------------------Form Datack:vkO3 signature:顶顶顶顶
Form Data
ck:vkO3
signature:顶顶顶顶
当不知道post data中的值如何获得时,往往需要到操作页面的html源码中去寻找,如上面的
ck:vk03
如要的操作页面的html的代码中寻找,然后把它解析出来。
实例:
注意:
本实例是基于登录时有图片验证码的,现在登录豆瓣好像不需要图片验证码了;
如果登录不需要验证码,则把验证码部分去掉即可。
# -*- coding: utf-8 -*-from HTMLParser import HTMLParserimport requestsdef _attr(attrs, attrname): for attr in attrs: if attr[0] == attrname: return attr[1] return None#获得验证码信息def _get_captcha(content): class CaptchaParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.captcha_id = None self.captcha_url = None def handle_starttag(self, tag, attrs): if tag == 'input' and _attr(attrs,'type') == 'hidden' and _attr(attrs,'name') == 'captcha_id': self.captcha_id = _attr(attrs,'value') if tag == 'image' and _attr(attrs,'id') == 'captcha_image' and _attr(attrs,'class') == 'captcha_image': self.captcha_url == _attr(attrs,'src') p = CaptchaParser() p.feed(content) return p.captcha_id, p.captcha_url#获得ck属性的值def _get_ck(content): class CKParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.ck = None def handle_starttag(self, tag, attrs): if tag == 'input' and _attr(attrs,'type') == 'hidden' and _attr(attrs,'name') == 'ck': self.ck = _attr(attrs,'value') p =CKParser() p.feed(content) return p.ckclass DoubanClient(object): def __init__(self): object.__init__(self) headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36', 'origin':'http/www.douban.com'} #create requests session self.session = requests.session() #对session的头进行定制,这样以后,以后所有的请求都会包含上面headers中的数据 self.session.headers.update(headers) #登录豆瓣 def login(self,username,password,source='index_nav', redir = 'http://www.douban.com/',login = '登录'): url = 'https://www.douban.com/accounts/login' #access login page to get captcha #湖区登录界面中的验证码图片 #r = requests.get(url) #应为登录和修改签名在同一个session中,故使用session.get(url)的方式登录 r = self.session.get(url) (captcha_id,captcha_url) = _get_captcha(r.content) if captcha_id: captcha_solution = raw_input('please input solution for [%s]' % captcha_url) #post login request data = {'from_email':username,'from_passwd':password,'source':source, 'redir':redir,'login':login} #将验证信息加入到post data中 if captcha_id: data['captcha_id'] = captcha_id data['captcha_url'] = captcha_url headers = {'referer':'http://www.douban.com/accounts/login?source=main', 'host':'accounts.douban.com'} #r = requests.post(url,data=data,headers=headers) r = self.session.post(url,data=data,headers=headers) print self.session.cookies.items() #编辑签名 def edit_signature(self,username,signature): #access user's homepage url = 'https://www.douban.com/people/%s/' % username r = self.session.get(url) #从操作页面的HTML代码中获取post data数据中参数ck的值 ck = _get_ck(r.content) #post request to change signature url = 'https://www.douban.com/j/people/%s/edit_signature' % username headers = {'referer':url,'host':'www.douban.com', 'x-requested-with':'XMLHTTPRequest'} data = {'ck':ck,'signature':signature} r = self.session.post(url,data=data,headers=headers) print r.contentif __name__ == '__main__': c = DoubanClient() c.login('877646746@qq.com','song@3345616') c.edit_signature('146925119','Hello')
四、作业
- 登录知乎
- 修改个人简介
0 0
- Python爬虫实例:登录豆瓣并修改签名
- python爬虫(登录豆瓣并修改签名)
- python爬虫之登录豆瓣
- python3 爬虫 模拟登陆豆瓣修改签名
- python爬虫登录豆瓣(一)
- python爬虫登录豆瓣(二)
- Python爬虫实例:豆瓣热播电影
- python 模拟登录豆瓣 并 发表动态
- python爬虫模拟登录网站(一)-----豆瓣
- python爬虫爬取豆瓣书籍信息并生成表格
- python爬虫 豆瓣电影
- python豆瓣电影爬虫
- Python豆瓣爬虫
- python 爬虫 豆瓣韩国电影
- python模拟登录豆瓣
- python session登录豆瓣
- python cookie登录豆瓣
- Python爬虫实例:用requests重构豆瓣热播电影爬虫
- Android Listview 隐藏滚动条
- linux下fstat、stat和lstat 区别(转)
- 6.1 Python class
- java 处理异常
- Industry 4.0
- Python爬虫实例:登录豆瓣并修改签名
- Java堆栈区别
- Linux_Shell 清空cassandra指定数据库的数据 通过 truncate
- HUE配置Spark Notebook
- Spring框架文档翻译:第一章第一节到2.1
- 三天上手PHP之3:输出方式(echo 和 print)
- Java小爬虫Demo
- 友元
- 加一乘2平方