【Python爬虫6】表单交互
来源:互联网 发布:ubuntu游客创建新用户 编辑:程序博客网 时间:2024/06/10 21:17
- 手工处理发送POST请求提交登录表单
- 1分析表单内容
- 2手工测试post请求提交表单
- 3手工处理post请求登录的完整源代码
- 从FF浏览器加载cookie登录网站
- 1session文件位置
- 2FF浏览器cookie内容
- 3使用cookie测试加载登录
- 4使用cookie登录源代码
- 使用高级模块Mechanize自动化处理表单提交
- 1用高级模块Mechanize自动化处理表单提交并支持登录后网页内容更新
- 2用普通方法支持登录后网页内容更新
严格来说,本篇表单交互和下一篇验证码处理不算是网络爬虫,而是广义上的网络机器人。使用网络机器人可以减少提取数据时需要表单交互的一道门槛。
1.手工处理发送POST请求提交登录表单
我们先在示例网站手工注册一个账号,注册这个账号需要验证码,下一篇会介绍处理验证码问题。
1.1分析表单内容
我们在登录网址http://127.0.0.1:8000/places/default/user/login 获得如下表单。在下面登录表单中包括几个重要的组成部分:
- form标签的action属性:用于设置表单数据提交的地址,本例中为#
,也就是和登录表单同一个URL;
- form标签的enctype属性:用于设置数据提交的编码,本例中为application/x-www-form-urlencoded
,表示所有非字母数字的字符都需要转换为十六进制的ASCII值;上传二进程文件最好用multipart/form-data
编码类型,这种编码不会对输入进行编码从而不会影响效率,而是使用MIME协议将其作为多个部分进行发送,和邮件的传输标准相同。文档:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
- form标签的method属性:本例中post
表示通过请求体向服务器提交表单数据;
- imput标签的name属性:用于设定提交到服务器端时某个域的名称。
<form action="#" enctype="application/x-www-form-urlencoded" method="post"> <table> <tr id="auth_user_email__row"> <td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td> <td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td> <td class="w2p_fc"></td> </tr> <tr id="auth_user_password__row"> <td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td> <td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td> <td class="w2p_fc"></td> </tr> <tr id="auth_user_remember_me__row"> <td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td> <td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td> <td class="w2p_fc"></td> </tr> <tr id="submit_record__row"> <td class="w2p_fl"></td><td class="w2p_fw"> <input type="submit" value="Log In" /> <button class="btn w2p-form-button" onclick="window.location='/places/default/user/register';return false">Register</button> </td> <td class="w2p_fc"></td> </tr> </table> <div style="display:none;"> <input name="_next" type="hidden" value="/places/default/index" /> <input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" /> <input name="_formname" type="hidden" value="login" /> </div></form>
1.2手工测试post请求提交表单
如果登录成功则跳到主页,否则回到登录页。下面是尝试自动登录的初始版本代码。显然登录失败!
>>> import urllib,urllib2>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'>>> LOGIN_EMAIL='1040003585@qq.com'>>> LOGIN_PASSWORD='wu.com'>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=urllib2.urlopen(request)>>> response.geturl()'http://127.0.0.1:8000/places/default/user/login'>>>
因为登录时还需要添加隐藏的_formkey
属性,这个唯一的ID用来避免表单多次提交。每次加载网页时,都会产生不同的ID,然后服务器端就可以通过这个给定的ID来判断表单是否已经通过提交过。下面是获得该属性值:
>>> >>> import lxml.html>>> def parse_form(html):... tree=lxml.html.fromstring(html)... data={}... for e in tree.cssselect('form input'):... if e.get('name'):... data[e.get('name')]=e.get('value')... return data... >>> import pprint>>> html=urllib2.urlopen(LOGIN_URL).read()>>> form=parse_form(html)>>> pprint.pprint(form){'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce', '_formname': 'login', '_next': '/places/default/index', 'email': '', 'password': '', 'remember_me': 'on'}>>>
下面是通过_formkey
和其他隐藏域的新版本自动登录代码。发现还是不成功!
>>> >>> html=urllib2.urlopen(LOGIN_URL).read()>>> data=parse_form(html)>>> data['email']=LOGIN_EMAIL>>> data['password']=LOGIN_PASSWORD>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=urllib2.urlopen(request)>>> response.geturl()'http://127.0.0.1:8000/places/default/user/login'>>>
因为我们缺失了一个重要的组成部分——cookie。当普通用户加载登录表单时,_formkey
的值将会保存在cookie中,然后该值会与提交的登录表单数据中的_formkey
的值进行对比。下面是使用urllib2.HTTPCookieProcessor
类增加了cookie支持之后的代码。最后登录成功了!
>>> >>> import cookielib>>> cj=cookielib.CookieJar()>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))>>> >>> html=opener.open(LOGIN_URL).read() #opener>>> data=parse_form(html)>>> data['email']=LOGIN_EMAIL>>> data['password']=LOGIN_PASSWORD>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=opener.open(request) #opener>>> response.geturl()'http://127.0.0.1:8000/places/default/index'>>>
1.3手工处理post请求登录的完整源代码:
# -*- coding: utf-8 -*-import urllibimport urllib2import cookielibimport lxml.htmlLOGIN_EMAIL = '1040003585@qq.com'LOGIN_PASSWORD = 'wu.com'#LOGIN_URL = 'http://example.webscraping.com/user/login'LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'def login_basic(): """fails because not using formkey """ data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD} encoded_data = urllib.urlencode(data) request = urllib2.Request(LOGIN_URL, encoded_data) response = urllib2.urlopen(request) print response.geturl()def login_formkey(): """fails because not using cookies to match formkey """ html = urllib2.urlopen(LOGIN_URL).read() data = parse_form(html) data['email'] = LOGIN_EMAIL data['password'] = LOGIN_PASSWORD encoded_data = urllib.urlencode(data) request = urllib2.Request(LOGIN_URL, encoded_data) response = urllib2.urlopen(request) print response.geturl()def login_cookies(): """working login """ cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) html = opener.open(LOGIN_URL).read() data = parse_form(html) data['email'] = LOGIN_EMAIL data['password'] = LOGIN_PASSWORD encoded_data = urllib.urlencode(data) request = urllib2.Request(LOGIN_URL, encoded_data) response = opener.open(request) print response.geturl() return openerdef parse_form(html): """extract all input properties from the form """ tree = lxml.html.fromstring(html) data = {} for e in tree.cssselect('form input'): if e.get('name'): data[e.get('name')] = e.get('value') return datadef main(): #login_basic() #login_formkey() login_cookies()if __name__ == '__main__': main()
2.从FF浏览器加载cookie登录网站
我们先用手工执行登录,我们先在FF浏览器用手工执行登录,然后关闭FF浏览器,然后用python脚本复用之前得到的cookie,从而实现自动登录。
2.1session文件位置
FireFox在sqlist数据库中存储cookie,在json文件中存储session,这两种存储方式都可以直接通过Python获取。对于登录操作而言,我们只需要获致session即可。对于不同的操作系统,FireFox存储的session文件的位置不同:
- Linux系统:~/.mozilla/firefox/*.default/sessionstore.js
- OS X系统:~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
- Windows Vista及以上版本系统:%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js
下面是返回session文件路径的辅助函数代码:
def find_ff_sessions(): paths = [ '~/.mozilla/firefox/*.default', '~/Library/Application Support/Firefox/Profiles/*.default', '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default' ] for path in paths: filename = os.path.join(path, 'sessionstore.js') matches = glob.glob(os.path.expanduser(filename)) if matches: return matches[0]
注:glob
模块会返回指定路径中所有匹配的文件。
2.2FF浏览器cookie内容
下面是Linux系统火狐浏览器session文件内容:
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ lsaddons.json datareporting key3.db prefs.js storageblocklist.xml extensions logins.json revocations.txt storage.sqlitebookmarkbackups extensions.ini mimeTypes.rdf saved-telemetry-pings times.jsoncert8.db extensions.json minidumps search.json.mozlz4 webappscompatibility.ini features permissions.sqlite secmod.db webappsstore.sqlitecontainers.json formhistory.sqlite places.sqlite sessionCheckpoints.json xulstore.jsoncontent-prefs.sqlite gmp places.sqlite-shm sessionstore-backupscookies.sqlite gmp-gmpopenh264 places.sqlite-wal sessionstore.jscrashes healthreport pluginreg.dat SiteSecurityServiceState.txtwu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js {"version":["sessionrestore",1],"windows":[{ ... "cookies":[ {"host":"127.0.0.1", "value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5", "path":"/", "name":"session_id_welcome", "httponly":true, "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}, {"host":"127.0.0.1", "value":"True", "path":"/", "name":"session_id_places", "httponly":true, "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}, {"host":"127.0.0.1", "value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"", "path":"/", "name":"session_data_places", "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}} ], "title":"Example web scraping website", "_shouldRestore":true, "closedAt":1485228738310}],"selectedWindow":0,"_closedWindows":[],"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},"global":{}}wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$
根据seesion存储结构,我们用下面代码把session解析到CookieJar对象中。
def load_ff_sessions(session_filename): cj = cookielib.CookieJar() if os.path.exists(session_filename): try: json_data = json.loads(open(session_filename, 'rb').read()) except ValueError as e: print 'Error parsing session JSON:', str(e) else: for window in json_data.get('windows', []): for cookie in window.get('cookies', []): import pprint; pprint.pprint(cookie) c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), None, False, cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), cookie.get('path', ''), False, False, str(int(time.time()) + 3600 * 24 * 7), False, None, None, {}) cj.set_cookie(c) else: print 'Session filename does not exist:', session_filename return cj
2.3使用cookie测试加载登录
session_filename = find_ff_sessions()cj = load_ff_sessions(session_filename)opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))html = opener.open(COUNTRY_URL).read()tree = lxml.html.fromstring(html)print tree.cssselect('ul#navbar li a')[0].text_content()
如果得到的结果是Login
则说明没能正确加载。如果出现这样情况,你就需要确认一下FireFox中是否已经成功登录救命网站。如果得到下面结果,有Welcome 用户的first name
,则登录表示成功。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py {u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_welcome', u'originAttributes': {u'addonId': u'', u'appId': 0, u'inIsolatedMozBrowser': False, u'privateBrowsingId': 0, u'signedPkg': u'', u'userContextId': 0}, u'path': u'/', u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}Log Inwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py {u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_welcome', u'originAttributes': {u'addonId': u'', u'appId': 0, u'inIsolatedMozBrowser': False, u'privateBrowsingId': 0, u'signedPkg': u'', u'userContextId': 0}, u'path': u'/', u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}{u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_places', u'originAttributes': {u'addonId': u'', u'appId': 0, u'inIsolatedMozBrowser': False, u'privateBrowsingId': 0, u'signedPkg': u'', u'userContextId': 0}, u'path': u'/', u'value': u'True'}{u'host': u'127.0.0.1', u'name': u'session_data_places', u'originAttributes': {u'addonId': u'', u'appId': 0, u'inIsolatedMozBrowser': False, u'privateBrowsingId': 0, u'signedPkg': u'', u'userContextId': 0}, u'path': u'/', u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}Welcome Wuwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$
如果你想从其他浏览器的cookie,可以使用browsercookie
模块。通过pip install browsercookie
命令进行安装,文档:https://pypi.python.org/pypi/browsercookie
2.4使用cookie登录源代码
# -*- coding: utf-8 -*-import urllib2import globimport osimport cookielibimport jsonimport timeimport lxml.htmlCOUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def login_firefox(): """load cookies from firefox """ session_filename = find_ff_sessions() cj = load_ff_sessions(session_filename) opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) html = opener.open(COUNTRY_URL).read() tree = lxml.html.fromstring(html) print tree.cssselect('ul#navbar li a')[0].text_content() return openerdef load_ff_sessions(session_filename): cj = cookielib.CookieJar() if os.path.exists(session_filename): try: json_data = json.loads(open(session_filename, 'rb').read()) except ValueError as e: print 'Error parsing session JSON:', str(e) else: for window in json_data.get('windows', []): for cookie in window.get('cookies', []): import pprint; pprint.pprint(cookie) c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), None, False, cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), cookie.get('path', ''), False, False, str(int(time.time()) + 3600 * 24 * 7), False, None, None, {}) cj.set_cookie(c) else: print 'Session filename does not exist:', session_filename return cjdef find_ff_sessions(): paths = [ '~/.mozilla/firefox/*.default', '~/Library/Application Support/Firefox/Profiles/*.default', '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default' ] for path in paths: filename = os.path.join(path, 'sessionstore.js') matches = glob.glob(os.path.expanduser(filename)) if matches: return matches[0]def main(): login_firefox()if __name__ == '__main__': main()
3.使用高级模块Mechanize自动化处理表单提交
使用Mechanize模块可以简化表单提交,先如下安装:pip install mechanize
3.1用高级模块Mechanize自动化处理表单提交并支持登录后网页内容更新
# -*- coding: utf-8 -*-import mechanizeimport login#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def mechanize_edit(): """Use mechanize to increment population """ # login br = mechanize.Browser() br.open(login.LOGIN_URL) br.select_form(nr=0) print br.form br['email'] = login.LOGIN_EMAIL br['password'] = login.LOGIN_PASSWORD response = br.submit() # edit country br.open(COUNTRY_URL) br.select_form(nr=0) print 'Population before:', br['population'] br['population'] = str(int(br['population']) + 1) br.submit() # check population increased br.open(COUNTRY_URL) br.select_form(nr=0) print 'Population after:', br['population']if __name__ == '__main__': mechanize_edit()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3mechanize_edit.py <POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded <TextControl(email=)> <PasswordControl(password=)> <CheckboxControl(remember_me=[on])> <SubmitControl(<None>=Log In) (readonly)> <SubmitButtonControl(<None>=) (readonly)> <HiddenControl(_next=/places/default/index) (readonly)> <HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)> <HiddenControl(_formname=login) (readonly)>>Population before: 1330044000Population after: 1330044001wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$
文档:http://www.search.sourceforge.net/mechanize/
3.2用普通方法支持登录后网页内容更新
# -*- coding: utf-8 -*-import urllibimport urllib2import login#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def edit_country(): opener = login.login_cookies() country_html = opener.open(COUNTRY_URL).read() data = login.parse_form(country_html) import pprint; pprint.pprint(data) print 'Population before: ' + data['population'] data['population'] = int(data['population']) + 1 encoded_data = urllib.urlencode(data) request = urllib2.Request(COUNTRY_URL, encoded_data) response = opener.open(request) country_html = opener.open(COUNTRY_URL).read() data = login.parse_form(country_html) print 'Population after:', data['population']if __name__ == '__main__': edit_country()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3edit_country.py http://127.0.0.1:8000/places/default/index{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659', '_formname': 'places/5087', 'area': '9596960.00', 'capital': 'Beijing', 'continent': 'AS', 'country': 'China', 'currency_code': 'CNY', 'currency_name': 'Yuan Renminbi', 'id': '5087', 'iso': 'CN', 'languages': 'zh-CN,yue,wuu,dta,ug,za', 'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN', 'phone': '86', 'population': '1330044001', 'postal_code_format': '######', 'postal_code_regex': '^(\\d{6})$', 'tld': '.cn'}Population before: 1330044001Population after: 1330044002wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$
Wu_Being 博客声明:本人博客欢迎转载,请标明博客原文和原链接!谢谢!
【Python爬虫系列】《【Python爬虫6】表单交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬虫系列的GitHub代码文件:https://github.com/1040003585/WebScrapingWithPython
如果你看完这篇博文,觉得对你有帮助,并且愿意付赞助费,那么我会更有动力写下去。
- 【Python爬虫6】表单交互
- python爬虫-->表单交互
- [python网络爬虫]表单和登陆爬虫
- Python需要提交表单爬虫
- 爬虫表单
- 用Python写网络爬虫系列(三)表单处理
- Python爬虫之表单和登录窗口采集
- python爬虫备忘(6)
- python爬虫-->爬虫基础
- [爬虫] Python爬虫技巧
- scrapy模拟表单爬虫
- scrapy模拟表单爬虫
- 表单标签--浏览器交互
- AngularJS的表单交互
- Python爬虫
- python 爬虫
- python 爬虫
- python 爬虫
- yii2.0 rbac权限学习笔记 [ 2.0 版本 ]
- xampp 80 端口被占用解决方法
- jQuery如何设置下拉框显示与隐藏
- android app 社交动态
- leetcode[494]:Target Sum
- 【Python爬虫6】表单交互
- 开源路由软件Quagga安装及使用
- null!="" 常用的字符串非空判断
- 电商网站架构
- 我的Java设计模式学习笔记(2)- Bridge模式
- 今天第一天开通博客
- C#程序加壳
- Linux termios 串口编程之 VTIME与VMIN
- 安卓入门