模拟登录-知乎

来源:互联网 发布:吴恩达的编程能力 编辑:程序博客网 时间:2024/04/29 02:17

偶尔看到爬虫,就了解了下
cookielib:
该模块用于操作cookie
cookielib.CookieJar()
用于处理cookie,不过在urllib2.HTTPCookieProcessor中对其进行了封装
所以

<div style="font-family: 微软雅黑; font-size: 14px; line-height: 21px;"><span style="background-color: inherit; line-height: 1.5;">cookieJar=cookielib.CookieJar()</span></div><div style="font-family: 微软雅黑; font-size: 14px; line-height: 21px;"><span style="background-color: inherit; line-height: 1.5;">opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar)).open(url)</span></div>
也可以写为:
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor()).open(url)
因为在urllib2.HTTPCookieProcessor的__init__()中当cookieJar参数为空时,会自动实例化一个cookieJar对象来操作cookie。

cookieJar.add_cookie_header(request)
将urllib2.Request的cookie添加添加到cookieJar中

cookieJar是一个可迭代的对象,迭代得到的对象为cookie的一个键值对
for ck in cookieJar:        print ck.name,'=',ck.value输出结果:q_c1=47d7833820654cd9aaecd0176195c9bega=GA1.2.1918555708.1437623636utma=51854390.1918555708.1437623636.1441865370.1441872012.4

但是cookielib.CookieJar()处理cookie并不可靠,往往会缺少很多字段
对比:
response = opener.open(req)print response.info().get('Set-Cookie')print '========================='for ck in cookieJar:     print ck.name,'=',ck.valueq_c1=0c1aa3813d3e4669a2aa6325990a072c|1441967500000|1441967500000; Domain=zhihu.com; expires=Mon, 10 Sep 2018 10:31:40 GMT; Path=/, cap_id="aWQ=|1441967500|4b198f2db1f14b7f16aec0856e84aedc99640017"; Domain=zhihu.com; expires=Sun, 11 Oct 2015 10:31:40 GMT; Path=/, n_c=1; Domain=zhihu.com; Path=/=========================cap_id = "aWQ=|1441967500|4b198f2db1f14b7f16aec0856e84aedc99640017"n_c = 1q_c1 = 0c1aa3813d3e4669a2aa6325990a072c|1441967500000|1441967500000

在模拟登陆知乎获取验证码时候会导致验证码一直错误,后来改为根据set-cookie更新header,然后用新生成的header请求。

cookielib.FileCookieJar()
cookielib.FileCookieJar()继承了cookielib.CookieJar(),可以将cookie保存到文件

cookielib.MozillaCookieJar()
cookielib.MozillaCookieJar()继承了cookielib.FileCookieJar(),可以使用浏览器格式的cookie文件

def save_cookies(url, postdata = None, header = None, filename = None):    '''    @summary: 保存cookies    @postdata: post提交的数据    @header: 请求的头部信息    @filename: 保存cookie的文件名称(从该文件中读取cookie,也可以保存cookie到该文件中)    '''    req = urllib2.Request(url, postdata, header)       ckjar = cookielib.MozillaCookieJar(filename)    ckproc = urllib2.HTTPCookieProcessor(ckjar)       opener = urllib2.build_opener(ckproc)       response = opener.open(req)    html = response.read()    response.close()    '''保存cookie到文件'''    ckjar.save(ignore_discard=True, ignore_expires=True)    return  html

urllib2:

urllib2通过data参数来确定是get请求还是post请求
get请求:
1.import urllib2response= urllib2.urlopen('http://www.baidu.com/')content = response.read()print content2.import urllib2req = urllib2.Request('http://www.baidu.com/')response= urllib2.urlopen(req)content = response.read()print content

post请求:
1.<strong></strong>import urllib2postdata = {'k':'v'}#post提交的数据是需要进行urlcode编码postdata = urllib.urlencode(postdata)response= urllib2.urlopen('http://www.baidu.com/',data = postdata)content = response.read()print content2.import urllib2postdata = {'k':'v'}postdata = urllib.urlencode(postdata)req = urllib2.Request('http://www.baidu.com/',data = postdata)response= urllib2.urlopen(req)content = response.read()print content

带有header的请求:
postdata = {'_xsrf':'','account':'','password': 'xxx','remember_me': 'true'}postdata = urllib.urlencode(postdata)headers = {'Host', 'www.zhihu.com','User-Agent','Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'}req=urllib2.Request('http://www.zhihu.com/login/email', data = postdata, header = headers)response= urllib2.urlopen(req)content = response.read()print content

模拟登录:
主要验证是否是浏览器访问,和cookie是否正确,这两个信息都保存在header中,模拟登录主要流程如下:
1.构造postdata,然后通过post请求登录
2.将返回header的set-cookie的cookie保存到header的Cookie字段中
3.然后再用保存的新header访问(同样:登录后直接用浏览器将header保存下来,然后通过该header请求一样可以)
header的"User-Agent"字段保存这浏览器信息,urllib2.build_opener可以自动处理cookie。如下(自己的代码不小心删了,流程差不多):
import urllib2import urllibimport cookielibauth_url = 'http://www.nowamagic.net/'home_url = 'http://www.nowamagic.net/';# 登陆用户名和密码data={"username":"nowamagic","password":"pass"}# urllib进行编码post_data=urllib.urlencode(data)# 发送头信息headers ={        "User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko""Host":"www.nowamagic.net", "Referer": "http://www.nowamagic.net"}# 初始化一个CookieJar来处理CookiecookieJar=cookielib.CookieJar()# 实例化一个全局openeropener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))# 获取cookiereq=urllib2.Request(auth_url,post_data,headers)#动态添加header字段req.add_header('Accept-Encoding', "gzip, deflate")result = opener.open(req)# 访问主页 自动带着cookie信息#注意:用这种方法获取的cookie可能不全,导致不能正常登录,如果该方法不能正常登录的时候,可以使用#result.info().get('Set-Cookie')获取header的'Set-Cookie',然后手动跟新请求header的cookie字段#然后请求result = opener.open(home_url)# 显示结果print result.read()

验证码登录:
很多时候登录时需要验证码,验证码主要难度在验证码识别上,但是很多时候会产生验证码错误的警告。验证码登录的主要步骤为:
1.请求验证码
2.保存验证码和请求验证码时返回的'Set-Cookie'到'Cookie'中(服务器会根据,cookie数据判断验证码是否正确)
3.识别验证码(可以手动保存查看)
4.将验证码和账号密码一同post提交
5.保存请求验证码返回的'Set-Cookie'到'Cookie'中
6.有时候提交登录数据后会发生跳转(其实就是根据查看登录流程一步步更新Cookie的过程,然后用最终的Cookie访问)
模拟登录知乎
import cookielib, urllib2, urllib,re,gzip,timefrom StringIO import StringIOimport socketsocket.setdefaulttimeout(300)headers = {'Host':'www.zhihu.com',               'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0',               'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',               'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',               'Accept-Encoding':'gzip, deflate',               'Referer':'gzip, deflate',               'connection':'keep-alive',}def analy_data(response):    if response.info().get('Content-Encoding') == 'gzip':        buf = StringIO( response.read())        f = gzip.GzipFile(fileobj=buf)        data = f.read()    else:        data = response.read()      response.close()    return datadef get_xsrf(data):    cer = re.compile('name="_xsrf" value="(.*)"', flags = 0)    strlist = cer.findall(data)    xsrf = strlist[0]    print xsrf       return xsrf def save_gif(response):    gif = response.read()    with open("captcha.gif", 'wb') as f:        f.write(gif)    response.close()    captcha = raw_input("查看验证码:")    captcha = captcha.strip()    print captcha    return captchadef get_postdata(xsrf, captcha = None):    postdata = {    '_xsrf': xsrf,    'account': 'user',    'password': 'password',    'captcha' : captcha,    'remember_me': 'true'     }    postdata = urllib.urlencode(postdata)    print postdata    return postdatadef save_cookie(set_cookie,headers):    '''手动处理cookie'''    old_dic = []    new_dic = []    old_cookie = headers.get('Cookie')    old_kv_list = old_cookie.split(";")    new_kv_list = set_cookie.split(";")    for i in old_kv_list:        old_dic[i.split(';')[0]] = i.split(';')[1]    for i in new_kv_list:        new_dic[i.split(';')[0]] = i.split(';')[1]    for k in new_dic:        old_dic[k] = new_dic[k]    cookie = ''    for k in old_dic:        str_node = str(k) + '=' + str(old_dic[k])        cookie += str_node + ';'    cookie = cookie.strip(';')    headers['Cookie'] = cookie    return headersif __name__ == "__main__":    '''发现cookielib处理cookie并不可靠,会缺少很多字段'''    cookieJar=cookielib.CookieJar()    opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))    req=urllib2.Request('http://www.zhihu.com/')    response = opener.open(req)    data = analy_data(response)    xsrf = get_xsrf(data)        req = urllib2.Request('http://www.zhihu.com/captcha.gif')                response = opener.open(req)    print response.info()    print response.info().get('Set-Cookie')    captcha = save_gif(response)    for ck in cookieJar:        print ck.name,'=',ck.value    print "----------------------------"         postdata = get_postdata(xsrf,captcha)    req=urllib2.Request('http://www.zhihu.com/login/email',data = postdata)    response = opener.open(req)    data = analy_data(response)    print data        if __name__ == "__main__":    '''手动处理cookie''    req=urllib2.Request('http://www.zhihu.com/')    response = urllib2.urlopen(req)    data = analy_data(response)    xsrf = get_xsrf(data)        req = urllib2.Request('http://www.zhihu.com/captcha.gif', headers = headers)        response = urllib2.urlopen(req)    print response.info()    set_cookie = response.info().get('Set-Cookie')    headers = save_cookie(set_cookie,headers)    captcha = save_gif(response)    for ck in cookieJar:        print ck.name,'=',ck.value    print "----------------------------"         postdata = get_postdata(xsrf,captcha)    req=urllib2.Request('http://www.zhihu.com/login/email',data = postdata, headers = headers)    response = urllib2.urlopen(req)    data = analy_data(response)    print data


0 0