Python爬虫实例:登录豆瓣并修改签名

来源:互联网 发布:国家密码 阿里云 编辑:程序博客网 时间:2024/05/14 19:19

功能

  • 登录豆瓣
  • 修改签名

一、登录流程分析

  • 向哪个url发送请求
  • 发送哪些数据
  • 有哪些特殊的头字段
  • 验证码问题如何解决

1.抓取豆瓣登录流程:

使用账号:xxxxxx 密码:xxxxxx 抓取得Network如下:

豆瓣登录界面网址:https://www.douban.com/accounts/login

GeneralRequest URL:https://accounts.douban.com/loginRequest Method:POSTStatus Code:302 Moved TemporarilyRemote Address:211.147.4.32:443---------------------------------------------------------------------------Response HeadersCache-Control:must-revalidate, no-cache, privateConnection:keep-aliveContent-Length:65Content-Type:text/plainDate:Sat, 11 Jun 2016 02:48:18 GMTExpires:Sun, 1 Jan 2006 01:00:00 GMTKeep-Alive:timeout=30Location:https://www.douban.comP3P:CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"Pragma:no-cacheServer:daeSet-Cookie:ue="877646746@qq.com"; domain=.douban.com; expires=Sun, 11-Jun-2017 02:48:18 GMT; httponlySet-Cookie:dbcl2="146925119:/crpdV7NiKQ"; path=/; domain=.douban.com; httponlySet-Cookie:as="deleted"; max-age=0; domain=.douban.com; expires=Thu, 01-Jan-1970 00:00:00 GMTStrict-Transport-Security:max-age=15552000;X-Content-Type-Options:nosniffX-DAE-App:accountsX-DAE-Node:sindar15aX-Douban-Mobileapp:0X-Frame-Options:SAMEORIGINX-Xss-Protection:1; mode=block----------------------------------------------------------------------------------Request HeadersAccept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Accept-Encoding:gzip, deflate, brAccept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4Cache-Control:max-age=0Connection:keep-aliveContent-Length:138Content-Type:application/x-www-form-urlencodedCookie:bid=PHjUxRzrHNk; _vwo_uuid_v2=56A954C0557184C73BBB3DF5C8D30C1D|409597a19056d473ebee60708893e9b8; ap=1; ll="118221"; __utmt=1; ps=y; __utma=30149280.2019919087.1465354115.1465606255.1465612975.3; __utmb=30149280.2.10.1465612975; __utmc=30149280; __utmz=30149280.1465612975.3.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; dbcl2="146925119:KHEcD+nREDs"; ck=9R18Host:accounts.douban.comOrigin:https://accounts.douban.comReferer:https://accounts.douban.com/loginUpgrade-Insecure-Requests:1User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36-------------------------------------------------------------Form Datack:9R18source:Noneredir:https://www.douban.comform_email:877646746@qq.comform_password:song@3345616login:登录

即登录时,我们只需要模拟Request Headers中的头和Form Data中的post参数就可以登录了。

如果登录时,需要图片中的验证码,我们需要抽取验证码图片,然后手动填写上去。(半自动化方式)

当然,如果需要全自动化的方式,则需要用到机器学习中的知识,爬取所有验证码图片,然后训练模型,用机器学习的方法自动识别出验证码图片中的验证码。

二、修改签名流程分析

  • 向哪个url发送请求
  • 发送哪些数据
  • 有哪些特殊的头字段
  • 返回值长什么样
GeneralRequest URL:https://www.douban.com/j/people/146925119/edit_signatureRequest Method:POSTStatus Code:200 OKRemote Address:211.147.4.31:443--------------------------------------------------------------------------Response HeadersCache-Control:must-revalidate, no-cache, privateConnection:keep-aliveContent-Length:47Content-Type:application/json; charset=utf-8Date:Sat, 11 Jun 2016 06:06:37 GMTExpires:Sun, 1 Jan 2006 01:00:00 GMTKeep-Alive:timeout=30Pragma:no-cacheServer:daeStrict-Transport-Security:max-age=15552000;X-DAE-App:snsX-DAE-Node:sindar25bX-Douban-Mobileapp:0X-Xss-Protection:1; mode=block-----------------------------------------------------------------------Request HeadersAccept:application/json, text/javascript, */*; q=0.01Accept-Encoding:gzip, deflate, brAccept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4Connection:keep-aliveContent-Length:54Content-Type:application/x-www-form-urlencodedCookie:bid=PHjUxRzrHNk; _vwo_uuid_v2=56A954C0557184C73BBB3DF5C8D30C1D|409597a19056d473ebee60708893e9b8; ll="118221"; ps=y; ue="877646746@qq.com"; dbcl2="146925119:/crpdV7NiKQ"; ck=vkO3; ap=1; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1465624694%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DjUjRq0ldEsr3DVgcsr-2j6hhjW72VMHrsETjWL2QAee%26wd%3D%26eqid%3Dc07ebf420008142f00000003575b7a83%22%5D; __utmt=1; push_noty_num=0; push_doumail_num=0; _pk_id.100001.8cb4=cbb9346c7bb2e22f.1465354092.4.1465624911.1465613335.; _pk_ses.100001.8cb4=*; __utma=30149280.2019919087.1465354115.1465612975.1465624696.4; __utmb=30149280.4.10.1465624696; __utmc=30149280; __utmz=30149280.1465612975.3.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmv=30149280.14692Host:www.douban.comOrigin:https://www.douban.comReferer:https://www.douban.com/people/146925119/User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36X-Requested-With:XMLHttpRequest----------------------------------------------------------------------------Form Datack:vkO3   signature:顶顶顶顶

Form Data

ck:vkO3
signature:顶顶顶顶

当不知道post data中的值如何获得时,往往需要到操作页面的html源码中去寻找,如上面的

ck:vk03
如要的操作页面的html的代码中寻找,然后把它解析出来。

实例:

注意:

本实例是基于登录时有图片验证码的,现在登录豆瓣好像不需要图片验证码了;

如果登录不需要验证码,则把验证码部分去掉即可。

# -*- coding: utf-8 -*-from HTMLParser import HTMLParserimport requestsdef _attr(attrs, attrname):    for attr in attrs:        if attr[0] == attrname:            return attr[1]    return None#获得验证码信息def _get_captcha(content):    class CaptchaParser(HTMLParser):        def __init__(self):            HTMLParser.__init__(self)            self.captcha_id = None            self.captcha_url = None        def handle_starttag(self, tag, attrs):            if tag == 'input' and _attr(attrs,'type') == 'hidden' and _attr(attrs,'name') == 'captcha_id':                self.captcha_id = _attr(attrs,'value')            if tag == 'image' and _attr(attrs,'id') == 'captcha_image' and _attr(attrs,'class') == 'captcha_image':                self.captcha_url == _attr(attrs,'src')    p = CaptchaParser()    p.feed(content)    return p.captcha_id, p.captcha_url#获得ck属性的值def _get_ck(content):    class CKParser(HTMLParser):        def __init__(self):            HTMLParser.__init__(self)            self.ck = None        def handle_starttag(self, tag, attrs):            if tag == 'input' and _attr(attrs,'type') == 'hidden' and _attr(attrs,'name') == 'ck':                self.ck = _attr(attrs,'value')    p =CKParser()    p.feed(content)    return p.ckclass DoubanClient(object):    def __init__(self):        object.__init__(self)        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',                   'origin':'http/www.douban.com'}        #create requests session        self.session = requests.session()        #对session的头进行定制,这样以后,以后所有的请求都会包含上面headers中的数据        self.session.headers.update(headers)    #登录豆瓣    def login(self,username,password,source='index_nav',              redir = 'http://www.douban.com/',login = '登录'):        url = 'https://www.douban.com/accounts/login'        #access login page to get captcha        #湖区登录界面中的验证码图片        #r = requests.get(url)        #应为登录和修改签名在同一个session中,故使用session.get(url)的方式登录        r = self.session.get(url)        (captcha_id,captcha_url) = _get_captcha(r.content)        if captcha_id:            captcha_solution = raw_input('please input solution for [%s]' % captcha_url)        #post login request        data = {'from_email':username,'from_passwd':password,'source':source,                'redir':redir,'login':login}        #将验证信息加入到post data中        if captcha_id:            data['captcha_id'] = captcha_id            data['captcha_url'] = captcha_url        headers = {'referer':'http://www.douban.com/accounts/login?source=main',                   'host':'accounts.douban.com'}        #r = requests.post(url,data=data,headers=headers)        r = self.session.post(url,data=data,headers=headers)        print self.session.cookies.items()    #编辑签名    def edit_signature(self,username,signature):        #access user's homepage        url = 'https://www.douban.com/people/%s/' % username        r  = self.session.get(url)        #从操作页面的HTML代码中获取post data数据中参数ck的值        ck = _get_ck(r.content)        #post request to change signature        url = 'https://www.douban.com/j/people/%s/edit_signature' % username        headers = {'referer':url,'host':'www.douban.com',                 'x-requested-with':'XMLHTTPRequest'}        data = {'ck':ck,'signature':signature}        r = self.session.post(url,data=data,headers=headers)        print r.contentif __name__ == '__main__':    c = DoubanClient()    c.login('877646746@qq.com','song@3345616')    c.edit_signature('146925119','Hello')

四、作业

  • 登录知乎
  • 修改个人简介
0 0
原创粉丝点击