【Python爬虫6】表单交互

来源:互联网 发布:ubuntu游客创建新用户 编辑:程序博客网 时间:2024/06/10 21:17

  • 手工处理发送POST请求提交登录表单
    • 1分析表单内容
    • 2手工测试post请求提交表单
    • 3手工处理post请求登录的完整源代码
  • 从FF浏览器加载cookie登录网站
    • 1session文件位置
    • 2FF浏览器cookie内容
    • 3使用cookie测试加载登录
    • 4使用cookie登录源代码
  • 使用高级模块Mechanize自动化处理表单提交
    • 1用高级模块Mechanize自动化处理表单提交并支持登录后网页内容更新
    • 2用普通方法支持登录后网页内容更新

严格来说,本篇表单交互和下一篇验证码处理不算是网络爬虫,而是广义上的网络机器人。使用网络机器人可以减少提取数据时需要表单交互的一道门槛。

1.手工处理发送POST请求提交登录表单

我们先在示例网站手工注册一个账号,注册这个账号需要验证码,下一篇会介绍处理验证码问题。

1.1分析表单内容

我们在登录网址http://127.0.0.1:8000/places/default/user/login 获得如下表单。在下面登录表单中包括几个重要的组成部分:
- form标签的action属性:用于设置表单数据提交的地址,本例中为#,也就是和登录表单同一个URL;
- form标签的enctype属性:用于设置数据提交的编码,本例中为application/x-www-form-urlencoded,表示所有非字母数字的字符都需要转换为十六进制的ASCII值;上传二进程文件最好用multipart/form-data编码类型,这种编码不会对输入进行编码从而不会影响效率,而是使用MIME协议将其作为多个部分进行发送,和邮件的传输标准相同。文档:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
- form标签的method属性:本例中post表示通过请求体向服务器提交表单数据;
- imput标签的name属性:用于设定提交到服务器端时某个域的名称。

<form action="#" enctype="application/x-www-form-urlencoded" method="post">    <table>        <tr id="auth_user_email__row">            <td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>            <td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>            <td class="w2p_fc"></td>        </tr>        <tr id="auth_user_password__row">            <td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>            <td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>            <td class="w2p_fc"></td>        </tr>        <tr id="auth_user_remember_me__row">            <td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>            <td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>            <td class="w2p_fc"></td>        </tr>        <tr id="submit_record__row">            <td class="w2p_fl"></td><td class="w2p_fw">                <input type="submit" value="Log In" />                <button class="btn w2p-form-button" onclick="window.location=&#x27;/places/default/user/register&#x27;;return false">Register</button>            </td>            <td class="w2p_fc"></td>        </tr>    </table>    <div style="display:none;">        <input name="_next" type="hidden" value="/places/default/index" />        <input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />        <input name="_formname" type="hidden" value="login" />    </div></form>

1.2手工测试post请求提交表单

如果登录成功则跳到主页,否则回到登录页。下面是尝试自动登录的初始版本代码。显然登录失败!

>>> import urllib,urllib2>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'>>> LOGIN_EMAIL='1040003585@qq.com'>>> LOGIN_PASSWORD='wu.com'>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=urllib2.urlopen(request)>>> response.geturl()'http://127.0.0.1:8000/places/default/user/login'>>> 

因为登录时还需要添加隐藏的_formkey属性,这个唯一的ID用来避免表单多次提交。每次加载网页时,都会产生不同的ID,然后服务器端就可以通过这个给定的ID来判断表单是否已经通过提交过。下面是获得该属性值:

>>> >>> import lxml.html>>> def parse_form(html):...     tree=lxml.html.fromstring(html)...     data={}...     for e in tree.cssselect('form input'):...             if e.get('name'):...                     data[e.get('name')]=e.get('value')...     return data... >>> import pprint>>> html=urllib2.urlopen(LOGIN_URL).read()>>> form=parse_form(html)>>> pprint.pprint(form){'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce', '_formname': 'login', '_next': '/places/default/index', 'email': '', 'password': '', 'remember_me': 'on'}>>> 

下面是通过_formkey和其他隐藏域的新版本自动登录代码。发现还是不成功!

>>> >>> html=urllib2.urlopen(LOGIN_URL).read()>>> data=parse_form(html)>>> data['email']=LOGIN_EMAIL>>> data['password']=LOGIN_PASSWORD>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=urllib2.urlopen(request)>>> response.geturl()'http://127.0.0.1:8000/places/default/user/login'>>> 

因为我们缺失了一个重要的组成部分——cookie。当普通用户加载登录表单时,_formkey的值将会保存在cookie中,然后该值会与提交的登录表单数据中的_formkey的值进行对比。下面是使用urllib2.HTTPCookieProcessor类增加了cookie支持之后的代码。最后登录成功了!

>>> >>> import cookielib>>> cj=cookielib.CookieJar()>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))>>> >>> html=opener.open(LOGIN_URL).read()      #opener>>> data=parse_form(html)>>> data['email']=LOGIN_EMAIL>>> data['password']=LOGIN_PASSWORD>>> encoded_data=urllib.urlencode(data)>>> request=urllib2.Request(LOGIN_URL,encoded_data)>>> response=opener.open(request)       #opener>>> response.geturl()'http://127.0.0.1:8000/places/default/index'>>> 

1.3手工处理post请求登录的完整源代码:

# -*- coding: utf-8 -*-import urllibimport urllib2import cookielibimport lxml.htmlLOGIN_EMAIL = '1040003585@qq.com'LOGIN_PASSWORD = 'wu.com'#LOGIN_URL = 'http://example.webscraping.com/user/login'LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'def login_basic():    """fails because not using formkey    """    data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}    encoded_data = urllib.urlencode(data)    request = urllib2.Request(LOGIN_URL, encoded_data)    response = urllib2.urlopen(request)    print response.geturl()def login_formkey():    """fails because not using cookies to match formkey    """    html = urllib2.urlopen(LOGIN_URL).read()    data = parse_form(html)    data['email'] = LOGIN_EMAIL    data['password'] = LOGIN_PASSWORD    encoded_data = urllib.urlencode(data)    request = urllib2.Request(LOGIN_URL, encoded_data)    response = urllib2.urlopen(request)    print response.geturl()def login_cookies():    """working login    """    cj = cookielib.CookieJar()    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))    html = opener.open(LOGIN_URL).read()    data = parse_form(html)    data['email'] = LOGIN_EMAIL    data['password'] = LOGIN_PASSWORD    encoded_data = urllib.urlencode(data)    request = urllib2.Request(LOGIN_URL, encoded_data)    response = opener.open(request)    print response.geturl()    return openerdef parse_form(html):    """extract all input properties from the form    """    tree = lxml.html.fromstring(html)    data = {}    for e in tree.cssselect('form input'):        if e.get('name'):            data[e.get('name')] = e.get('value')    return datadef main():    #login_basic()    #login_formkey()    login_cookies()if __name__ == '__main__':    main()

2.从FF浏览器加载cookie登录网站

我们先用手工执行登录,我们先在FF浏览器用手工执行登录,然后关闭FF浏览器,然后用python脚本复用之前得到的cookie,从而实现自动登录。

2.1session文件位置

FireFox在sqlist数据库中存储cookie,在json文件中存储session,这两种存储方式都可以直接通过Python获取。对于登录操作而言,我们只需要获致session即可。对于不同的操作系统,FireFox存储的session文件的位置不同:
- Linux系统:~/.mozilla/firefox/*.default/sessionstore.js
- OS X系统:~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
- Windows Vista及以上版本系统:%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js

下面是返回session文件路径的辅助函数代码:

def find_ff_sessions():    paths = [        '~/.mozilla/firefox/*.default',        '~/Library/Application Support/Firefox/Profiles/*.default',        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'    ]    for path in paths:        filename = os.path.join(path, 'sessionstore.js')        matches = glob.glob(os.path.expanduser(filename))        if matches:            return matches[0]

注:glob模块会返回指定路径中所有匹配的文件。

2.2FF浏览器cookie内容

下面是Linux系统火狐浏览器session文件内容:

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ lsaddons.json           datareporting       key3.db             prefs.js                      storageblocklist.xml         extensions          logins.json         revocations.txt               storage.sqlitebookmarkbackups       extensions.ini      mimeTypes.rdf       saved-telemetry-pings         times.jsoncert8.db              extensions.json     minidumps           search.json.mozlz4            webappscompatibility.ini     features            permissions.sqlite  secmod.db                     webappsstore.sqlitecontainers.json       formhistory.sqlite  places.sqlite       sessionCheckpoints.json       xulstore.jsoncontent-prefs.sqlite  gmp                 places.sqlite-shm   sessionstore-backupscookies.sqlite        gmp-gmpopenh264     places.sqlite-wal   sessionstore.jscrashes               healthreport        pluginreg.dat       SiteSecurityServiceState.txtwu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js {"version":["sessionrestore",1],"windows":[{    ...    "cookies":[        {"host":"127.0.0.1",        "value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",        "path":"/",        "name":"session_id_welcome",        "httponly":true,        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},        {"host":"127.0.0.1",        "value":"True",        "path":"/",        "name":"session_id_places",        "httponly":true,        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},        {"host":"127.0.0.1",        "value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",        "path":"/",        "name":"session_data_places",        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}    ],    "title":"Example web scraping website",    "_shouldRestore":true,    "closedAt":1485228738310}],"selectedWindow":0,"_closedWindows":[],"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},"global":{}}wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ 

根据seesion存储结构,我们用下面代码把session解析到CookieJar对象中。

def load_ff_sessions(session_filename):    cj = cookielib.CookieJar()    if os.path.exists(session_filename):          try:             json_data = json.loads(open(session_filename, 'rb').read())        except ValueError as e:            print 'Error parsing session JSON:', str(e)        else:            for window in json_data.get('windows', []):                for cookie in window.get('cookies', []):                    import pprint; pprint.pprint(cookie)                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''),                         None, False,                         cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'),                         cookie.get('path', ''), False,                        False, str(int(time.time()) + 3600 * 24 * 7), False,                         None, None, {})                    cj.set_cookie(c)    else:        print 'Session filename does not exist:', session_filename    return cj

2.3使用cookie测试加载登录

session_filename = find_ff_sessions()cj = load_ff_sessions(session_filename)opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))html = opener.open(COUNTRY_URL).read()tree = lxml.html.fromstring(html)print tree.cssselect('ul#navbar li a')[0].text_content()

如果得到的结果是Login则说明没能正确加载。如果出现这样情况,你就需要确认一下FireFox中是否已经成功登录救命网站。如果得到下面结果,有Welcome 用户的first name,则登录表示成功。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py {u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_welcome', u'originAttributes': {u'addonId': u'',                       u'appId': 0,                       u'inIsolatedMozBrowser': False,                       u'privateBrowsingId': 0,                       u'signedPkg': u'',                       u'userContextId': 0}, u'path': u'/', u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}Log Inwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py {u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_welcome', u'originAttributes': {u'addonId': u'',                       u'appId': 0,                       u'inIsolatedMozBrowser': False,                       u'privateBrowsingId': 0,                       u'signedPkg': u'',                       u'userContextId': 0}, u'path': u'/', u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}{u'host': u'127.0.0.1', u'httponly': True, u'name': u'session_id_places', u'originAttributes': {u'addonId': u'',                       u'appId': 0,                       u'inIsolatedMozBrowser': False,                       u'privateBrowsingId': 0,                       u'signedPkg': u'',                       u'userContextId': 0}, u'path': u'/', u'value': u'True'}{u'host': u'127.0.0.1', u'name': u'session_data_places', u'originAttributes': {u'addonId': u'',                       u'appId': 0,                       u'inIsolatedMozBrowser': False,                       u'privateBrowsingId': 0,                       u'signedPkg': u'',                       u'userContextId': 0}, u'path': u'/', u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}Welcome Wuwu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

如果你想从其他浏览器的cookie,可以使用browsercookie模块。通过pip install browsercookie命令进行安装,文档:https://pypi.python.org/pypi/browsercookie

2.4使用cookie登录源代码

# -*- coding: utf-8 -*-import urllib2import globimport osimport cookielibimport jsonimport timeimport lxml.htmlCOUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def login_firefox():    """load cookies from firefox    """    session_filename = find_ff_sessions()    cj = load_ff_sessions(session_filename)    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))    html = opener.open(COUNTRY_URL).read()    tree = lxml.html.fromstring(html)    print tree.cssselect('ul#navbar li a')[0].text_content()    return openerdef load_ff_sessions(session_filename):    cj = cookielib.CookieJar()    if os.path.exists(session_filename):          try:             json_data = json.loads(open(session_filename, 'rb').read())        except ValueError as e:            print 'Error parsing session JSON:', str(e)        else:            for window in json_data.get('windows', []):                for cookie in window.get('cookies', []):                    import pprint; pprint.pprint(cookie)                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''),                         None, False,                         cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'),                         cookie.get('path', ''), False,                        False, str(int(time.time()) + 3600 * 24 * 7), False,                         None, None, {})                    cj.set_cookie(c)    else:        print 'Session filename does not exist:', session_filename    return cjdef find_ff_sessions():    paths = [        '~/.mozilla/firefox/*.default',        '~/Library/Application Support/Firefox/Profiles/*.default',        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'    ]    for path in paths:        filename = os.path.join(path, 'sessionstore.js')        matches = glob.glob(os.path.expanduser(filename))        if matches:            return matches[0]def main():    login_firefox()if __name__ == '__main__':    main()

3.使用高级模块Mechanize自动化处理表单提交

使用Mechanize模块可以简化表单提交,先如下安装:pip install mechanize

3.1用高级模块Mechanize自动化处理表单提交并支持登录后网页内容更新

# -*- coding: utf-8 -*-import mechanizeimport login#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def mechanize_edit():    """Use mechanize to increment population    """    # login    br = mechanize.Browser()    br.open(login.LOGIN_URL)    br.select_form(nr=0)    print br.form    br['email'] = login.LOGIN_EMAIL    br['password'] = login.LOGIN_PASSWORD    response = br.submit()    # edit country    br.open(COUNTRY_URL)    br.select_form(nr=0)    print 'Population before:', br['population']    br['population'] = str(int(br['population']) + 1)    br.submit()    # check population increased    br.open(COUNTRY_URL)    br.select_form(nr=0)    print 'Population after:', br['population']if __name__ == '__main__':    mechanize_edit()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3mechanize_edit.py <POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded  <TextControl(email=)>  <PasswordControl(password=)>  <CheckboxControl(remember_me=[on])>  <SubmitControl(<None>=Log In) (readonly)>  <SubmitButtonControl(<None>=) (readonly)>  <HiddenControl(_next=/places/default/index) (readonly)>  <HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)>  <HiddenControl(_formname=login) (readonly)>>Population before: 1330044000Population after: 1330044001wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

文档:http://www.search.sourceforge.net/mechanize/

3.2用普通方法支持登录后网页内容更新

# -*- coding: utf-8 -*-import urllibimport urllib2import login#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'def edit_country():    opener = login.login_cookies()    country_html = opener.open(COUNTRY_URL).read()    data = login.parse_form(country_html)    import pprint; pprint.pprint(data)    print 'Population before: ' + data['population']    data['population'] = int(data['population']) + 1    encoded_data = urllib.urlencode(data)    request = urllib2.Request(COUNTRY_URL, encoded_data)    response = opener.open(request)    country_html = opener.open(COUNTRY_URL).read()    data = login.parse_form(country_html)    print 'Population after:', data['population']if __name__ == '__main__':    edit_country()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3edit_country.py http://127.0.0.1:8000/places/default/index{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659', '_formname': 'places/5087', 'area': '9596960.00', 'capital': 'Beijing', 'continent': 'AS', 'country': 'China', 'currency_code': 'CNY', 'currency_name': 'Yuan Renminbi', 'id': '5087', 'iso': 'CN', 'languages': 'zh-CN,yue,wuu,dta,ug,za', 'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN', 'phone': '86', 'population': '1330044001', 'postal_code_format': '######', 'postal_code_regex': '^(\\d{6})$', 'tld': '.cn'}Population before: 1330044001Population after: 1330044002wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

Wu_Being 博客声明:本人博客欢迎转载,请标明博客原文和原链接!谢谢!
【Python爬虫系列】《【Python爬虫6】表单交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬虫系列的GitHub代码文件:https://github.com/1040003585/WebScrapingWithPython

Wu_Being 吴兵博客接受赞助费二维码

如果你看完这篇博文,觉得对你有帮助,并且愿意付赞助费,那么我会更有动力写下去。

0 0