模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
来源:互联网 发布:新手怎么淘宝购物 编辑:程序博客网 时间:2024/06/07 03:26
之前已经介绍过了网络相关的一些基础知识了:
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
以及,简单的网页内容抓取,用Python是如何实现的:
【教程】抓取网并提取网页中所需要的信息 之 Python版
现在接着来介绍,如何通过Python来实现基本的模拟网站登陆的流程。
不过,此处需要介绍一下此文前提:
假定你已经看完了:
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
了解了基本的网络相关基本概念;
看完了:
【总结】浏览器中的开发人员工具(IE9的F12和Chrome的Ctrl+Shift+I)-网页分析的利器
知道了如何使用IE9的F12等工具去分析网页执行的过程。
此处已模拟登陆百度首页:
http://www.baidu.com/
为例,说明如何通过Python模拟登陆网站。
1.模拟登陆网站之前,需要搞清楚,登陆该网站的内部执行逻辑
此想要通过程序,python代码,实现模拟登陆百度首页之前。
你自己本身先要搞懂,本身登陆该网站,内部的逻辑是什么样的。
而关于如何利用工具,分析出来,百度首页登录的内部逻辑过程,参见:
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
2.然后才是用对应的语言,此处是Python去实现,模拟登陆的逻辑
看懂了上述用F12分析出来的百度首页的登陆的内部逻辑过程,接下来,用Python代码去实现,相对来说,就不是很难了。
注:
(1)关于在Python中如何利用cookie,不熟悉的,先去看:
【已解决】Python中如何获得访问网页所返回的cookie
【已解决】Python中实现带Cookie的Http的Post请求
(2)对于正则表达式不熟悉的,去参考:
正则表达式学习心得
(3)对python的正则表达式不熟悉的,可参考:
【教程】详解Python正则表达式
此处,再把分析出来的流程,贴出来,以便方便和代码对照:
顺序访问地址访问类型发送的数据需要获得/提取的返回的值1http://www.baidu.com/GET无返回的cookie中的BAIDUID2https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=trueGET包含BAIDUID这个cookie从返回的html中提取出token的值3https://passport.baidu.com/v2/api/?loginPOST一堆的post data,其中token的值是之前提取出来的需要验证返回的cookie中,是否包含BDUSS,PTOKEN,STOKEN,SAVEUSERID
然后,最终就可以写出相关的,用于演示模拟登录百度首页的Python代码了。
【版本1:Python实现模拟登陆百度首页的完整代码 之 精简版】
这个是相对精简的一个版本:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/
Note: Before try to understand following code, firstly, please read the related articles:
(1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
http://www.crifan.com/summary_about_flow_process_of_fetch_webpage_simulate_login_website_and_some_notice/
(2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
http://www.crifan.com/use_ie9_f12_to_analysis_the_internal_logical_process_of_login_baidu_main_page_website/
(3) 【教程】模拟登陆网站 之 Python版
http://www.crifan.com/emulate_login_website_using_python
Version: 2012-11-06
Author: Crifan
"""
import
re;
import
cookielib;
import
urllib;
import
urllib2;
import
optparse;
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def
checkAllCookiesExist(cookieNameList, cookieJar) :
cookiesDict
=
{};
for
eachCookieName
in
cookieNameList :
cookiesDict[eachCookieName]
=
False
;
allCookieFound
=
True
;
for
cookie
in
cookieJar :
if
(cookie.name
in
cookiesDict) :
cookiesDict[cookie.name]
=
True
;
for
eachCookie
in
cookiesDict.keys() :
if
(
not
cookiesDict[eachCookie]) :
allCookieFound
=
False
;
break
;
return
allCookieFound;
#------------------------------------------------------------------------------
# just for print delimiter
def
printDelimiter():
print
'-'
*
80
;
#------------------------------------------------------------------------------
# main function to emulate login baidu
def
emulateLoginBaidu():
print
"Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/"
;
print
"Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword"
;
printDelimiter();
# parse input parameters
parser
=
optparse.OptionParser();
parser.add_option(
"-u"
,
"--username"
,action
=
"store"
,
type
=
"string"
,default
=
'',dest
=
"username"
,
help
=
"Your Baidu Username"
);
parser.add_option(
"-p"
,
"--password"
,action
=
"store"
,
type
=
"string"
,default
=
'',dest
=
"password"
,
help
=
"Your Baidu password"
);
(options, args)
=
parser.parse_args();
# export all options variables, then later variables can be used
for
i
in
dir
(options):
exec
(i
+
" = options."
+
i);
printDelimiter();
print
"[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies"
;
cj
=
cookielib.CookieJar();
opener
=
urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
urllib2.install_opener(opener);
printDelimiter();
print
"[step1] to get cookie BAIDUID"
;
baiduMainUrl
=
"http://www.baidu.com/"
;
resp
=
urllib2.urlopen(baiduMainUrl);
#respInfo = resp.info();
#print "respInfo=",respInfo;
for
index, cookie
in
enumerate
(cj):
print
'['
,index,
']'
,cookie;
printDelimiter();
print
"[step2] to get token value"
;
getapiUrl
=
"https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true"
;
getapiResp
=
urllib2.urlopen(getapiUrl);
#print "getapiResp=",getapiResp;
getapiRespHtml
=
getapiResp.read();
#print "getapiRespHtml=",getapiRespHtml;
#bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
foundTokenVal
=
re.search(
"bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';"
, getapiRespHtml);
if
(foundTokenVal):
tokenVal
=
foundTokenVal.group(
"tokenVal"
);
print
"tokenVal="
,tokenVal;
printDelimiter();
print
"[step3] emulate login baidu"
;
staticpage
=
"http://www.baidu.com/cache/user/html/jump.html"
;
baiduMainLoginUrl
=
"https://passport.baidu.com/v2/api/?login"
;
postDict
=
{
#'ppui_logintime': "",
'charset'
:
"utf-8"
,
#'codestring' : "",
'token'
: tokenVal,
#de3dbf1e8596642fa2ddf2921cd6257f
'isPhone'
:
"false"
,
'index'
:
"0"
,
#'u' : "",
#'safeflg' : "0",
'staticpage'
: staticpage,
#http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
'loginType'
:
"1"
,
'tpl'
:
"mn"
,
'callback'
:
"parent.bdPass.api.login._postCallback"
,
'username'
: username,
'password'
: password,
#'verifycode' : "",
'mem_pass'
:
"on"
,
};
postData
=
urllib.urlencode(postDict);
# here will automatically encode values of parameters
# such as:
# encodehttp://www.baidu.com/cache/user/html/jump.html into http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
#print "postData=",postData;
req
=
urllib2.Request(baiduMainLoginUrl, postData);
# in most case, for do POST request, the content-type, is application/x-www-form-urlencoded
req.add_header(
'Content-Type'
,
"application/x-www-form-urlencoded"
);
resp
=
urllib2.urlopen(req);
#for index, cookie in enumerate(cj):
# print '[',index, ']',cookie;
cookiesToCheck
=
[
'BDUSS'
,
'PTOKEN'
,
'STOKEN'
,
'SAVEUSERID'
];
loginBaiduOK
=
checkAllCookiesExist(cookiesToCheck, cj);
if
(loginBaiduOK):
print
"+++ Emulate login baidu is OK, ^_^"
;
else
:
print
"--- Failed to emulate login baidu !"
else
:
print
"Fail to extract token value from html="
,getapiRespHtml;
if
__name__
=
=
"__main__"
:
emulateLoginBaidu();
【版本2:Python实现模拟登陆百度首页的完整代码 之 crifanLib.py版】
这个是另外一个版本,其中利用到我自己的python库:crifanLib.py :
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/
Use the functions from crifanLib.py
Note: Before try to understand following code, firstly, please read the related articles:
(1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
http://www.crifan.com/summary_about_flow_process_of_fetch_webpage_simulate_login_website_and_some_notice/
(2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
http://www.crifan.com/use_ie9_f12_to_analysis_the_internal_logical_process_of_login_baidu_main_page_website/
(3) 【教程】模拟登陆网站 之 Python版
http://www.crifan.com/emulate_login_website_using_python
Version: 2012-11-07
Author: Crifan
Contact: admin (at) crifan.com
"""
import
re;
import
cookielib;
import
urllib;
import
urllib2;
import
optparse;
#===============================================================================
# following are some functions, extracted from my python library: crifanLib.py
# for the whole crifanLib.py:
# online browser:http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
# download :http://code.google.com/p/crifanlib/downloads/list
#===============================================================================
import
zlib;
gConst
=
{
'constUserAgent'
:
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)'
,
#'constUserAgent' : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
}
################################################################################
# Network: urllib/urllib2/http
################################################################################
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def
getUrlResponse(url, postDict
=
{}, headerDict
=
{}, timeout
=
0
, useGzip
=
False
) :
# makesure url is string, not unicode, otherwise urllib2.urlopen will error
url
=
str
(url);
if
(postDict) :
postData
=
urllib.urlencode(postDict);
req
=
urllib2.Request(url, postData);
req.add_header(
'Content-Type'
,
"application/x-www-form-urlencoded"
);
else
:
req
=
urllib2.Request(url);
if
(headerDict) :
#print "added header:",headerDict;
for
key
in
headerDict.keys() :
req.add_header(key, headerDict[key]);
defHeaderDict
=
{
'User-Agent'
: gConst[
'constUserAgent'
],
'Cache-Control'
:
'no-cache'
,
'Accept'
:
'*/*'
,
'Connection'
:
'Keep-Alive'
,
};
# add default headers firstly
for
eachDefHd
in
defHeaderDict.keys() :
#print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
if
(useGzip) :
#print "use gzip for",url;
req.add_header(
'Accept-Encoding'
,
'gzip, deflate'
);
# add customized header later -> allow overwrite default header
if
(headerDict) :
#print "added header:",headerDict;
for
key
in
headerDict.keys() :
req.add_header(key, headerDict[key]);
if
(timeout >
0
) :
# set timeout value if necessary
resp
=
urllib2.urlopen(req, timeout
=
timeout);
else
:
resp
=
urllib2.urlopen(req);
return
resp;
#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def
getUrlRespHtml(url, postDict
=
{}, headerDict
=
{}, timeout
=
0
, useGzip
=
True
) :
resp
=
getUrlResponse(url, postDict, headerDict, timeout, useGzip);
respHtml
=
resp.read();
if
(useGzip) :
#print "---before unzip, len(respHtml)=",len(respHtml);
respInfo
=
resp.info();
# Server: nginx/1.0.8
# Date: Sun, 08 Apr 2012 12:30:35 GMT
# Content-Type: text/html
# Transfer-Encoding: chunked
# Connection: close
# Vary: Accept-Encoding
# ...
# Content-Encoding: gzip
# sometime, the request use gzip,deflate, but actually returned is un-gzip html
# -> response info not include above "Content-Encoding: gzip"
# eg:http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
# -> so here only decode when it is indeed is gziped data
if
( (
"Content-Encoding"
in
respInfo)
and
(respInfo[
'Content-Encoding'
]
=
=
"gzip"
)) :
respHtml
=
zlib.decompress(respHtml,
16
+
zlib.MAX_WBITS);
#print "+++ after unzip, len(respHtml)=",len(respHtml);
return
respHtml;
################################################################################
# Cookies
################################################################################
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def
checkAllCookiesExist(cookieNameList, cookieJar) :
cookiesDict
=
{};
for
eachCookieName
in
cookieNameList :
cookiesDict[eachCookieName]
=
False
;
allCookieFound
=
True
;
for
cookie
in
cookieJar :
if
(cookie.name
in
cookiesDict) :
cookiesDict[cookie.name]
=
True
;
for
eachCookie
in
cookiesDict.keys() :
if
(
not
cookiesDict[eachCookie]) :
allCookieFound
=
False
;
break
;
return
allCookieFound;
#===============================================================================
#------------------------------------------------------------------------------
# just for print delimiter
def
printDelimiter():
print
'-'
*
80
;
#------------------------------------------------------------------------------
# main function to emulate login baidu
def
emulateLoginBaidu():
print
"Function: Used to demostrate how to use Python code to emulate login baidu main page:http://www.baidu.com/"
;
print
"Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword"
;
printDelimiter();
# parse input parameters
parser
=
optparse.OptionParser();
parser.add_option(
"-u"
,
"--username"
,action
=
"store"
,
type
=
"string"
,default
=
'',dest
=
"username"
,
help
=
"Your Baidu Username"
);
parser.add_option(
"-p"
,
"--password"
,action
=
"store"
,
type
=
"string"
,default
=
'',dest
=
"password"
,
help
=
"Your Baidu password"
);
(options, args)
=
parser.parse_args();
# export all options variables, then later variables can be used
for
i
in
dir
(options):
exec
(i
+
" = options."
+
i);
printDelimiter();
print
"[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies"
;
cj
=
cookielib.CookieJar();
opener
=
urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
urllib2.install_opener(opener);
printDelimiter();
print
"[step1] to get cookie BAIDUID"
;
baiduMainUrl
=
"http://www.baidu.com/"
;
resp
=
getUrlResponse(baiduMainUrl);
# here you should see: BAIDUID
for
index, cookie
in
enumerate
(cj):
print
'['
,index,
']'
,cookie;
printDelimiter();
print
"[step2] to get token value"
;
getapiUrl
=
"https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true"
;
getapiRespHtml
=
getUrlRespHtml(getapiUrl);
#bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
foundTokenVal
=
re.search(
"bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';"
, getapiRespHtml);
if
(foundTokenVal):
tokenVal
=
foundTokenVal.group(
"tokenVal"
);
print
"tokenVal="
,tokenVal;
printDelimiter();
print
"[step3] emulate login baidu"
;
staticpage
=
"http://www.baidu.com/cache/user/html/jump.html"
;
baiduMainLoginUrl
=
"https://passport.baidu.com/v2/api/?login"
;
postDict
=
{
#'ppui_logintime': "",
'charset'
:
"utf-8"
,
#'codestring' : "",
'token'
: tokenVal,
#de3dbf1e8596642fa2ddf2921cd6257f
'isPhone'
:
"false"
,
'index'
:
"0"
,
#'u' : "",
#'safeflg' : "0",
'staticpage'
: staticpage,
#http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
'loginType'
:
"1"
,
'tpl'
:
"mn"
,
'callback'
:
"parent.bdPass.api.login._postCallback"
,
'username'
: username,
'password'
: password,
#'verifycode' : "",
'mem_pass'
:
"on"
,
};
loginRespHtml
=
getUrlRespHtml(baiduMainLoginUrl, postDict);
cookiesToCheck
=
[
'BDUSS'
,
'PTOKEN'
,
'STOKEN'
,
'SAVEUSERID'
];
loginBaiduOK
=
checkAllCookiesExist(cookiesToCheck, cj);
if
(loginBaiduOK):
print
"+++ Emulate login baidu is OK, ^_^"
;
else
:
print
"--- Failed to emulate login baidu !"
else
:
print
"Fail to extract token value from html="
,getapiRespHtml;
if
__name__
=
=
"__main__"
:
emulateLoginBaidu();
此版本的目的在于,方便后来人使用网络相关的函数,不用关心内部细节。
并且,相关的函数,也可以供以后再次利用。
注:关于crifanLib.py:
在线浏览:crifanLib.py
下载:crifanLib_2012-11-07.7z
上述两种版本的代码,对应的输出,都是:
D:\tmp\tmp_dev_root\python\emulate_login_baidu_python>emulate_login_baidu_python.py -u crifan -p xxxxxx
Function: Used to demostrate how to use Python code to emulate login baidu main page: http:
//www
.baidu.com/
Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies
--------------------------------------------------------------------------------
[step1] to get cookie BAIDUID
[ 0 ] <Cookie BAIDUID=8D85C6528FDF7B5F49C746A18524495B:FG=1
for
.baidu.com/>
--------------------------------------------------------------------------------
[step2] to get token value
tokenVal= 4d3f004bbe3e6f0cfa435abd38dd9fec
--------------------------------------------------------------------------------
[step3] emulate login baidu
+++ Emulate login baidu is OK, ^_^
【总结】
总的来说,其实分析网站登陆的过程,所涉及的内部逻辑,其实比用代码写出来要难多了。
而分析网站登陆过程的大概逻辑,要比用工具去具体的分析,要重要的多。
而这一堆的过程,之前自己折腾时,也正是苦于无完整的教程,所以,才有现在的一堆的帖子,来从头到尾的解释,从概念,到逻辑,到分析,到实现的整个过程。
全部都看完,应该对这部分内容,就大概有个了解的。
剩下的东西,就是实际的操练了,就是自己折腾的过程了。
希望上述所有的概念,逻辑,方法,代码,对你有用。
- 模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
- 模拟登陆网站 之 C#版(内含两种版本的完整的可运行的代码)
- 模拟登陆网站 之 C#版(内含两种版本的完整的可运行的代码)
- 【教程】模拟登陆网站 之 C#版(内含两种版本的完整的可运行的代码)
- python爬虫学习之路(1)_ CSDN网站的模拟登陆
- python requests模拟登陆带验证码的网站
- Delphi字符串处理(下面贴出的是完整的可运行代码)
- 从控制台读入double数据的容错处理(附完整可运行代码)
- Java完整的运行代码
- 详解抓取网站,模拟登陆,抓取动态网页的原理和实现(Python,C#等)
- 登陆网站的python脚本
- Python的运行版本
- 可算是写好了自己的第一个教务处模拟登陆的代码了-.-
- 网页版的模拟登陆有验证码的网站
- 在rhas3.0上建立一个完整的邮件系统(内含四部分)修正版 V
- python3.3教程之模拟百度登陆的代码
- 使用C#的HttpWebRequest模拟登陆网站
- 使用C#的HttpWebRequest模拟登陆网站
- ARM9存储器
- python
- 游戏化学习法:牛人教你如何赢得谷歌面试
- 给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持
- 利用搜狗输入法构建企业级云输入法平台
- 模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
- 黑马程序员_学习记录11:多线程
- HBASE的shell使用
- 黑马程序员_学习记录12:String、StringBuffer、基本数据类型对象包装类
- python发送post请求
- 黑马程序员_学习记录13:集合框架
- vim无法安装,更新又提示Ubuntu无法获得锁/var/lib/dpkg/lock
- post方式提交数据xml数据!该怎么解决
- 黑马程序员_学习记录14:Map