模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)

来源:互联网 发布:新手怎么淘宝购物 编辑:程序博客网 时间:2024/06/07 03:26

之前已经介绍过了网络相关的一些基础知识了:

【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项

以及,简单的网页内容抓取,用Python是如何实现的:

【教程】抓取网并提取网页中所需要的信息 之 Python版

现在接着来介绍,如何通过Python来实现基本的模拟网站登陆的流程。

不过,此处需要介绍一下此文前提:

假定你已经看完了:

【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项

了解了基本的网络相关基本概念;

看完了:

【总结】浏览器中的开发人员工具(IE9的F12和Chrome的Ctrl+Shift+I)-网页分析的利器

知道了如何使用IE9的F12等工具去分析网页执行的过程。

此处已模拟登陆百度首页:

http://www.baidu.com/

为例,说明如何通过Python模拟登陆网站。


1.模拟登陆网站之前,需要搞清楚,登陆该网站的内部执行逻辑

此想要通过程序,python代码,实现模拟登陆百度首页之前。

你自己本身先要搞懂,本身登陆该网站,内部的逻辑是什么样的。

 

而关于如何利用工具,分析出来,百度首页登录的内部逻辑过程,参见:

【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程

 

2.然后才是用对应的语言,此处是Python去实现,模拟登陆的逻辑

看懂了上述用F12分析出来的百度首页的登陆的内部逻辑过程,接下来,用Python代码去实现,相对来说,就不是很难了。

 

注:

(1)关于在Python中如何利用cookie,不熟悉的,先去看:

【已解决】Python中如何获得访问网页所返回的cookie

【已解决】Python中实现带Cookie的Http的Post请求

(2)对于正则表达式不熟悉的,去参考:

正则表达式学习心得

(3)对python的正则表达式不熟悉的,可参考:

【教程】详解Python正则表达式

 

此处,再把分析出来的流程,贴出来,以便方便和代码对照:

顺序访问地址访问类型发送的数据需要获得/提取的返回的值1http://www.baidu.com/GET无返回的cookie中的BAIDUID2https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=trueGET包含BAIDUID这个cookie从返回的html中提取出token的值3https://passport.baidu.com/v2/api/?loginPOST一堆的post data,其中token的值是之前提取出来的需要验证返回的cookie中,是否包含BDUSS,PTOKEN,STOKEN,SAVEUSERID

 

然后,最终就可以写出相关的,用于演示模拟登录百度首页的Python代码了。

【版本1:Python实现模拟登陆百度首页的完整代码 之 精简版】

这个是相对精简的一个版本:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:   Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/
Note:       Before try to understand following code, firstly, please read the related articles:
            (1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
 
http://www.crifan.com/summary_about_flow_process_of_fetch_webpage_simulate_login_website_and_some_notice/
 
            (2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
 
http://www.crifan.com/use_ie9_f12_to_analysis_the_internal_logical_process_of_login_baidu_main_page_website/
 
            (3) 【教程】模拟登陆网站 之 Python版
 
http://www.crifan.com/emulate_login_website_using_python
 
Version:    2012-11-06
Author:     Crifan
"""
 
import re;
import cookielib;
import urllib;
import urllib2;
import optparse;
 
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def checkAllCookiesExist(cookieNameList, cookieJar) :
    cookiesDict= {};
    foreachCookieName in cookieNameList :
        cookiesDict[eachCookieName]= False;
     
    allCookieFound= True;
    forcookie in cookieJar :
        if(cookie.namein cookiesDict) :
            cookiesDict[cookie.name]= True;
     
    foreachCookie in cookiesDict.keys() :
        if(notcookiesDict[eachCookie]) :
            allCookieFound= False;
            break;
 
    returnallCookieFound;
 
#------------------------------------------------------------------------------
# just for print delimiter
def printDelimiter():
    print'-'*80;
 
#------------------------------------------------------------------------------
# main function to emulate login baidu
def emulateLoginBaidu():
    print"Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/";
    print"Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword";
    printDelimiter();
 
    # parse input parameters
    parser= optparse.OptionParser();
    parser.add_option("-u","--username",action="store",type="string",default='',dest="username",help="Your Baidu Username");
    parser.add_option("-p","--password",action="store",type="string",default='',dest="password",help="Your Baidu password");
    (options, args)= parser.parse_args();
    # export all options variables, then later variables can be used
    fori in dir(options):
        exec(i+ " = options."+ i);
 
    printDelimiter();
    print"[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies";
    cj= cookielib.CookieJar();
    opener= urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
    urllib2.install_opener(opener);
 
    printDelimiter();
    print"[step1] to get cookie BAIDUID";
    baiduMainUrl= "http://www.baidu.com/";
    resp= urllib2.urlopen(baiduMainUrl);
    #respInfo = resp.info();
    #print "respInfo=",respInfo;
    forindex, cookie in enumerate(cj):
        print'[',index, ']',cookie;
 
    printDelimiter();
    print"[step2] to get token value";
    getapiUrl= "https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true";
    getapiResp= urllib2.urlopen(getapiUrl);
    #print "getapiResp=",getapiResp;
    getapiRespHtml= getapiResp.read();
    #print "getapiRespHtml=",getapiRespHtml;
    #bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
    foundTokenVal= re.search("bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';", getapiRespHtml);
    if(foundTokenVal):
        tokenVal= foundTokenVal.group("tokenVal");
        print"tokenVal=",tokenVal;
 
        printDelimiter();
        print"[step3] emulate login baidu";
        staticpage= "http://www.baidu.com/cache/user/html/jump.html";
        baiduMainLoginUrl= "https://passport.baidu.com/v2/api/?login";
        postDict= {
            #'ppui_logintime': "",
            'charset'      : "utf-8",
            #'codestring'    : "",
            'token'        : tokenVal, #de3dbf1e8596642fa2ddf2921cd6257f
            'isPhone'      : "false",
            'index'        : "0",
            #'u'             : "",
            #'safeflg'       : "0",
            'staticpage'   : staticpage, #http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
            'loginType'    : "1",
            'tpl'          : "mn",
            'callback'     : "parent.bdPass.api.login._postCallback",
            'username'     : username,
            'password'     : password,
            #'verifycode'    : "",
            'mem_pass'     : "on",
        };
        postData= urllib.urlencode(postDict);
        # here will automatically encode values of parameters
        # such as:
        # encodehttp://www.baidu.com/cache/user/html/jump.html into http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
        #print "postData=",postData;
        req= urllib2.Request(baiduMainLoginUrl, postData);
        # in most case, for do POST request, the content-type, is application/x-www-form-urlencoded
        req.add_header('Content-Type',"application/x-www-form-urlencoded");
        resp= urllib2.urlopen(req);
        #for index, cookie in enumerate(cj):
        #    print '[',index, ']',cookie;
        cookiesToCheck= ['BDUSS','PTOKEN', 'STOKEN','SAVEUSERID'];
        loginBaiduOK= checkAllCookiesExist(cookiesToCheck, cj);
        if(loginBaiduOK):
            print"+++ Emulate login baidu is OK, ^_^";
        else:
            print"--- Failed to emulate login baidu !"
    else:
        print"Fail to extract token value from html=",getapiRespHtml;
 
if __name__=="__main__":
    emulateLoginBaidu();

 

【版本2:Python实现模拟登陆百度首页的完整代码 之 crifanLib.py版】

这个是另外一个版本,其中利用到我自己的python库:crifanLib.py :

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:   Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/
            Use the functions from crifanLib.py
Note:       Before try to understand following code, firstly, please read the related articles:
            (1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
 
http://www.crifan.com/summary_about_flow_process_of_fetch_webpage_simulate_login_website_and_some_notice/
 
            (2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
 
http://www.crifan.com/use_ie9_f12_to_analysis_the_internal_logical_process_of_login_baidu_main_page_website/
 
            (3) 【教程】模拟登陆网站 之 Python版
 
http://www.crifan.com/emulate_login_website_using_python
 
Version:    2012-11-07
Author:     Crifan
Contact:    admin (at) crifan.com
"""
 
import re;
import cookielib;
import urllib;
import urllib2;
import optparse;
 
#===============================================================================
# following are some functions, extracted from my python library: crifanLib.py
# for the whole crifanLib.py:
# online browser:http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
# download      :http://code.google.com/p/crifanlib/downloads/list
#===============================================================================
 
import zlib;
 
gConst ={
    'constUserAgent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
    #'constUserAgent' : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
}
 
################################################################################
# Network: urllib/urllib2/http
################################################################################
 
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
    # makesure url is string, not unicode, otherwise urllib2.urlopen will error
    url= str(url);
 
    if(postDict) :
        postData= urllib.urlencode(postDict);
        req= urllib2.Request(url, postData);
        req.add_header('Content-Type',"application/x-www-form-urlencoded");
    else:
        req= urllib2.Request(url);
 
    if(headerDict) :
        #print "added header:",headerDict;
        forkey in headerDict.keys() :
            req.add_header(key, headerDict[key]);
 
    defHeaderDict= {
        'User-Agent'   : gConst['constUserAgent'],
        'Cache-Control': 'no-cache',
        'Accept'       : '*/*',
        'Connection'   : 'Keep-Alive',
    };
 
    # add default headers firstly
    foreachDefHd in defHeaderDict.keys() :
        #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
        req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
 
    if(useGzip) :
        #print "use gzip for",url;
        req.add_header('Accept-Encoding','gzip, deflate');
 
    # add customized header later -> allow overwrite default header
    if(headerDict) :
        #print "added header:",headerDict;
        forkey in headerDict.keys() :
            req.add_header(key, headerDict[key]);
 
    if(timeout >0) :
        # set timeout value if necessary
        resp= urllib2.urlopen(req, timeout=timeout);
    else:
        resp= urllib2.urlopen(req);
     
    returnresp;
 
#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) :
    resp= getUrlResponse(url, postDict, headerDict, timeout, useGzip);
    respHtml= resp.read();
    if(useGzip) :
        #print "---before unzip, len(respHtml)=",len(respHtml);
        respInfo= resp.info();
         
        # Server: nginx/1.0.8
        # Date: Sun, 08 Apr 2012 12:30:35 GMT
        # Content-Type: text/html
        # Transfer-Encoding: chunked
        # Connection: close
        # Vary: Accept-Encoding
        # ...
        # Content-Encoding: gzip
         
        # sometime, the request use gzip,deflate, but actually returned is un-gzip html
        # -> response info not include above "Content-Encoding: gzip"
        # eg:http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
        # -> so here only decode when it is indeed is gziped data
        if( ("Content-Encoding"in respInfo) and(respInfo['Content-Encoding']== "gzip")) :
            respHtml= zlib.decompress(respHtml,16+zlib.MAX_WBITS);
            #print "+++ after unzip, len(respHtml)=",len(respHtml);
 
    returnrespHtml;
 
################################################################################
# Cookies
################################################################################
 
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def checkAllCookiesExist(cookieNameList, cookieJar) :
    cookiesDict= {};
    foreachCookieName in cookieNameList :
        cookiesDict[eachCookieName]= False;
     
    allCookieFound= True;
    forcookie in cookieJar :
        if(cookie.namein cookiesDict) :
            cookiesDict[cookie.name]= True;
     
    foreachCookie in cookiesDict.keys() :
        if(notcookiesDict[eachCookie]) :
            allCookieFound= False;
            break;
 
    returnallCookieFound;
 
#===============================================================================
 
#------------------------------------------------------------------------------
# just for print delimiter
def printDelimiter():
    print'-'*80;
 
#------------------------------------------------------------------------------
# main function to emulate login baidu
def emulateLoginBaidu():
    print"Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/";
    print"Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword";
    printDelimiter();
 
    # parse input parameters
    parser= optparse.OptionParser();
    parser.add_option("-u","--username",action="store",type="string",default='',dest="username",help="Your Baidu Username");
    parser.add_option("-p","--password",action="store",type="string",default='',dest="password",help="Your Baidu password");
    (options, args)= parser.parse_args();
    # export all options variables, then later variables can be used
    fori in dir(options):
        exec(i+ " = options."+ i);
 
    printDelimiter();
    print"[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies";
    cj= cookielib.CookieJar();
    opener= urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
    urllib2.install_opener(opener);
 
    printDelimiter();
    print"[step1] to get cookie BAIDUID";
    baiduMainUrl= "http://www.baidu.com/";
    resp= getUrlResponse(baiduMainUrl);
    # here you should see: BAIDUID
    forindex, cookie in enumerate(cj):
        print'[',index, ']',cookie;
 
    printDelimiter();
    print"[step2] to get token value";
    getapiUrl= "https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true";
    getapiRespHtml= getUrlRespHtml(getapiUrl);
    #bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
    foundTokenVal= re.search("bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';", getapiRespHtml);
    if(foundTokenVal):
        tokenVal= foundTokenVal.group("tokenVal");
        print"tokenVal=",tokenVal;
 
        printDelimiter();
        print"[step3] emulate login baidu";
        staticpage= "http://www.baidu.com/cache/user/html/jump.html";
        baiduMainLoginUrl= "https://passport.baidu.com/v2/api/?login";
        postDict= {
            #'ppui_logintime': "",
            'charset'      : "utf-8",
            #'codestring'    : "",
            'token'        : tokenVal, #de3dbf1e8596642fa2ddf2921cd6257f
            'isPhone'      : "false",
            'index'        : "0",
            #'u'             : "",
            #'safeflg'       : "0",
            'staticpage'   : staticpage, #http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
            'loginType'    : "1",
            'tpl'          : "mn",
            'callback'     : "parent.bdPass.api.login._postCallback",
            'username'     : username,
            'password'     : password,
            #'verifycode'    : "",
            'mem_pass'     : "on",
        };
        loginRespHtml= getUrlRespHtml(baiduMainLoginUrl, postDict);
        cookiesToCheck= ['BDUSS','PTOKEN', 'STOKEN','SAVEUSERID'];
        loginBaiduOK= checkAllCookiesExist(cookiesToCheck, cj);
        if(loginBaiduOK):
            print"+++ Emulate login baidu is OK, ^_^";
        else:
            print"--- Failed to emulate login baidu !"
    else:
        print"Fail to extract token value from html=",getapiRespHtml;
 
if __name__=="__main__":
    emulateLoginBaidu();

此版本的目的在于,方便后来人使用网络相关的函数,不用关心内部细节。

并且,相关的函数,也可以供以后再次利用。

注:关于crifanLib.py:

在线浏览:crifanLib.py

下载:crifanLib_2012-11-07.7z

 

 

上述两种版本的代码,对应的输出,都是:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
D:\tmp\tmp_dev_root\python\emulate_login_baidu_python>emulate_login_baidu_python.py -u crifan -p xxxxxx
Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/
Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies
--------------------------------------------------------------------------------
[step1] to get cookie BAIDUID
[ 0 ] <Cookie BAIDUID=8D85C6528FDF7B5F49C746A18524495B:FG=1for .baidu.com/>
--------------------------------------------------------------------------------
[step2] to get token value
tokenVal= 4d3f004bbe3e6f0cfa435abd38dd9fec
--------------------------------------------------------------------------------
[step3] emulate login baidu
+++ Emulate login baidu is OK, ^_^

 

【总结】

总的来说,其实分析网站登陆的过程,所涉及的内部逻辑,其实比用代码写出来要难多了。

而分析网站登陆过程的大概逻辑,要比用工具去具体的分析,要重要的多。

而这一堆的过程,之前自己折腾时,也正是苦于无完整的教程,所以,才有现在的一堆的帖子,来从头到尾的解释,从概念,到逻辑,到分析,到实现的整个过程。

全部都看完,应该对这部分内容,就大概有个了解的。

剩下的东西,就是实际的操练了,就是自己折腾的过程了。

希望上述所有的概念,逻辑,方法,代码,对你有用。

0

原创粉丝点击