模拟登录-知乎
来源:互联网 发布:吴恩达的编程能力 编辑:程序博客网 时间:2024/04/29 02:17
偶尔看到爬虫,就了解了下
cookielib:
该模块用于操作cookie
cookielib.CookieJar()
用于处理cookie,不过在urllib2.HTTPCookieProcessor中对其进行了封装
所以
<div style="font-family: 微软雅黑; font-size: 14px; line-height: 21px;"><span style="background-color: inherit; line-height: 1.5;">cookieJar=cookielib.CookieJar()</span></div><div style="font-family: 微软雅黑; font-size: 14px; line-height: 21px;"><span style="background-color: inherit; line-height: 1.5;">opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar)).open(url)</span></div>也可以写为:
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor()).open(url)因为在urllib2.HTTPCookieProcessor的__init__()中当cookieJar参数为空时,会自动实例化一个cookieJar对象来操作cookie。
cookieJar.add_cookie_header(request)
将urllib2.Request的cookie添加添加到cookieJar中
cookieJar是一个可迭代的对象,迭代得到的对象为cookie的一个键值对
for ck in cookieJar: print ck.name,'=',ck.value输出结果:q_c1=47d7833820654cd9aaecd0176195c9bega=GA1.2.1918555708.1437623636utma=51854390.1918555708.1437623636.1441865370.1441872012.4
但是cookielib.CookieJar()处理cookie并不可靠,往往会缺少很多字段
对比:
response = opener.open(req)print response.info().get('Set-Cookie')print '========================='for ck in cookieJar: print ck.name,'=',ck.valueq_c1=0c1aa3813d3e4669a2aa6325990a072c|1441967500000|1441967500000; Domain=zhihu.com; expires=Mon, 10 Sep 2018 10:31:40 GMT; Path=/, cap_id="aWQ=|1441967500|4b198f2db1f14b7f16aec0856e84aedc99640017"; Domain=zhihu.com; expires=Sun, 11 Oct 2015 10:31:40 GMT; Path=/, n_c=1; Domain=zhihu.com; Path=/=========================cap_id = "aWQ=|1441967500|4b198f2db1f14b7f16aec0856e84aedc99640017"n_c = 1q_c1 = 0c1aa3813d3e4669a2aa6325990a072c|1441967500000|1441967500000
在模拟登陆知乎获取验证码时候会导致验证码一直错误,后来改为根据set-cookie更新header,然后用新生成的header请求。
cookielib.FileCookieJar()
cookielib.FileCookieJar()继承了cookielib.CookieJar(),可以将cookie保存到文件
cookielib.MozillaCookieJar()
cookielib.MozillaCookieJar()继承了cookielib.FileCookieJar(),可以使用浏览器格式的cookie文件
def save_cookies(url, postdata = None, header = None, filename = None): ''' @summary: 保存cookies @postdata: post提交的数据 @header: 请求的头部信息 @filename: 保存cookie的文件名称(从该文件中读取cookie,也可以保存cookie到该文件中) ''' req = urllib2.Request(url, postdata, header) ckjar = cookielib.MozillaCookieJar(filename) ckproc = urllib2.HTTPCookieProcessor(ckjar) opener = urllib2.build_opener(ckproc) response = opener.open(req) html = response.read() response.close() '''保存cookie到文件''' ckjar.save(ignore_discard=True, ignore_expires=True) return html
urllib2:
urllib2通过data参数来确定是get请求还是post请求
get请求:
1.import urllib2response= urllib2.urlopen('http://www.baidu.com/')content = response.read()print content2.import urllib2req = urllib2.Request('http://www.baidu.com/')response= urllib2.urlopen(req)content = response.read()print content
post请求:
1.<strong></strong>import urllib2postdata = {'k':'v'}#post提交的数据是需要进行urlcode编码postdata = urllib.urlencode(postdata)response= urllib2.urlopen('http://www.baidu.com/',data = postdata)content = response.read()print content2.import urllib2postdata = {'k':'v'}postdata = urllib.urlencode(postdata)req = urllib2.Request('http://www.baidu.com/',data = postdata)response= urllib2.urlopen(req)content = response.read()print content
带有header的请求:
postdata = {'_xsrf':'','account':'','password': 'xxx','remember_me': 'true'}postdata = urllib.urlencode(postdata)headers = {'Host', 'www.zhihu.com','User-Agent','Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'}req=urllib2.Request('http://www.zhihu.com/login/email', data = postdata, header = headers)response= urllib2.urlopen(req)content = response.read()print content
模拟登录:主要验证是否是浏览器访问,和cookie是否正确,这两个信息都保存在header中,模拟登录主要流程如下:1.构造postdata,然后通过post请求登录2.将返回header的set-cookie的cookie保存到header的Cookie字段中3.然后再用保存的新header访问(同样:登录后直接用浏览器将header保存下来,然后通过该header请求一样可以)header的"User-Agent"字段保存这浏览器信息,urllib2.build_opener可以自动处理cookie。如下(自己的代码不小心删了,流程差不多):
import urllib2import urllibimport cookielibauth_url = 'http://www.nowamagic.net/'home_url = 'http://www.nowamagic.net/';# 登陆用户名和密码data={"username":"nowamagic","password":"pass"}# urllib进行编码post_data=urllib.urlencode(data)# 发送头信息headers ={ "User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko""Host":"www.nowamagic.net", "Referer": "http://www.nowamagic.net"}# 初始化一个CookieJar来处理CookiecookieJar=cookielib.CookieJar()# 实例化一个全局openeropener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))# 获取cookiereq=urllib2.Request(auth_url,post_data,headers)#动态添加header字段req.add_header('Accept-Encoding', "gzip, deflate")result = opener.open(req)# 访问主页 自动带着cookie信息#注意:用这种方法获取的cookie可能不全,导致不能正常登录,如果该方法不能正常登录的时候,可以使用#result.info().get('Set-Cookie')获取header的'Set-Cookie',然后手动跟新请求header的cookie字段#然后请求result = opener.open(home_url)# 显示结果print result.read()
模拟登录知乎验证码登录:很多时候登录时需要验证码,验证码主要难度在验证码识别上,但是很多时候会产生验证码错误的警告。验证码登录的主要步骤为:1.请求验证码2.保存验证码和请求验证码时返回的'Set-Cookie'到'Cookie'中(服务器会根据,cookie数据判断验证码是否正确)3.识别验证码(可以手动保存查看)4.将验证码和账号密码一同post提交5.保存请求验证码返回的'Set-Cookie'到'Cookie'中6.有时候提交登录数据后会发生跳转(其实就是根据查看登录流程一步步更新Cookie的过程,然后用最终的Cookie访问)
import cookielib, urllib2, urllib,re,gzip,timefrom StringIO import StringIOimport socketsocket.setdefaulttimeout(300)headers = {'Host':'www.zhihu.com', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding':'gzip, deflate', 'Referer':'gzip, deflate', 'connection':'keep-alive',}def analy_data(response): if response.info().get('Content-Encoding') == 'gzip': buf = StringIO( response.read()) f = gzip.GzipFile(fileobj=buf) data = f.read() else: data = response.read() response.close() return datadef get_xsrf(data): cer = re.compile('name="_xsrf" value="(.*)"', flags = 0) strlist = cer.findall(data) xsrf = strlist[0] print xsrf return xsrf def save_gif(response): gif = response.read() with open("captcha.gif", 'wb') as f: f.write(gif) response.close() captcha = raw_input("查看验证码:") captcha = captcha.strip() print captcha return captchadef get_postdata(xsrf, captcha = None): postdata = { '_xsrf': xsrf, 'account': 'user', 'password': 'password', 'captcha' : captcha, 'remember_me': 'true' } postdata = urllib.urlencode(postdata) print postdata return postdatadef save_cookie(set_cookie,headers): '''手动处理cookie''' old_dic = [] new_dic = [] old_cookie = headers.get('Cookie') old_kv_list = old_cookie.split(";") new_kv_list = set_cookie.split(";") for i in old_kv_list: old_dic[i.split(';')[0]] = i.split(';')[1] for i in new_kv_list: new_dic[i.split(';')[0]] = i.split(';')[1] for k in new_dic: old_dic[k] = new_dic[k] cookie = '' for k in old_dic: str_node = str(k) + '=' + str(old_dic[k]) cookie += str_node + ';' cookie = cookie.strip(';') headers['Cookie'] = cookie return headersif __name__ == "__main__": '''发现cookielib处理cookie并不可靠,会缺少很多字段''' cookieJar=cookielib.CookieJar() opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar)) req=urllib2.Request('http://www.zhihu.com/') response = opener.open(req) data = analy_data(response) xsrf = get_xsrf(data) req = urllib2.Request('http://www.zhihu.com/captcha.gif') response = opener.open(req) print response.info() print response.info().get('Set-Cookie') captcha = save_gif(response) for ck in cookieJar: print ck.name,'=',ck.value print "----------------------------" postdata = get_postdata(xsrf,captcha) req=urllib2.Request('http://www.zhihu.com/login/email',data = postdata) response = opener.open(req) data = analy_data(response) print data if __name__ == "__main__": '''手动处理cookie'' req=urllib2.Request('http://www.zhihu.com/') response = urllib2.urlopen(req) data = analy_data(response) xsrf = get_xsrf(data) req = urllib2.Request('http://www.zhihu.com/captcha.gif', headers = headers) response = urllib2.urlopen(req) print response.info() set_cookie = response.info().get('Set-Cookie') headers = save_cookie(set_cookie,headers) captcha = save_gif(response) for ck in cookieJar: print ck.name,'=',ck.value print "----------------------------" postdata = get_postdata(xsrf,captcha) req=urllib2.Request('http://www.zhihu.com/login/email',data = postdata, headers = headers) response = urllib2.urlopen(req) data = analy_data(response) print data
0 0
- 模拟登录-知乎
- 模拟登录知乎
- android模拟登录知乎
- Python 模拟登录知乎
- python 模拟知乎登录
- 爬虫模拟登录知乎
- python-知乎模拟登录
- python3模拟登录知乎
- scrapy 知乎模拟登录
- Python3.3.3 模拟浏览器登录知乎
- HttpClient4.4.1模拟登录知乎
- HttpClient4.4.1模拟登录知乎
- 知乎爬虫之模拟登录
- python之selenium模拟登录知乎
- 使用python脚本模拟登录知乎
- HttpClient4.4.1模拟登录知乎
- 使用Python模拟登录知乎
- 利用request模拟登录知乎
- HDU_2883_kebab(最大流)
- 面向对象2
- Sublime Text2使用教程
- 二叉查找树的创建,查找,删除,插入的c描述
- LeetCode Problem:Reverse Integer
- 模拟登录-知乎
- 面向对象3
- 安装LuaSocket
- 《剑指offer》对称的二叉树
- 文本编辑器推荐 - Visual Studio Code
- 15_09_13 参数传递错误
- 基于开源库jsoncpp的json字符串解析
- 素数的线性筛选
- php 时间函数