Python常用标准库 --- urllib与urllib2
来源:互联网 发布:数据挖掘工作 编辑:程序博客网 时间:2024/06/03 14:57
转自:http://lizhenliang.blog.51cto.com/7876557/1872538
>>> import urllib, urllib2>>> response = urllib.urlopen("http://www.baidu.com") # 获取的网站页面源码>>> response.readline()'<!DOCTYPE html>\n'>>> response.getcode()200>>> response.geturl()'http://www.baidu.com'
2)伪装chrome浏览器访问
>>> user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36">>> header = {"User-Agent": user_agent}>>> request = urllib2.Request("http://www.baidu.com", headers=header) # 也可以通过request.add_header('User-Agent', 'Mozilla...')方式添加 >>> response = urllib2.urlopen(request)>>> response.geturl()'https://www.baidu.com/'>>> print respose.info() # 查看服务器返回的header信息Server: bfe/1.0.8.18Date: Sat, 12 Nov 2016 06:34:54 GMTContent-Type: text/html; charset=utf-8Transfer-Encoding: chunkedConnection: closeVary: Accept-EncodingSet-Cookie: BAIDUID=5979A74F742651531360C08F3BE06754:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.comSet-Cookie: BIDUPSID=5979A74F742651531360C08F3BE06754; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.comSet-Cookie: PSTM=1478932494; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.comSet-Cookie: BDSVRTM=0; path=/Set-Cookie: BD_HOME=0; path=/Set-Cookie: H_PS_PSSID=1426_18240_17945_21118_17001_21454_21408_21394_21377_21525_21192; path=/; domain=.baidu.comP3P: CP=" OTI DSP COR IVA OUR IND COM "Cache-Control: privateCxy_all: baidu+a24af77d41154f5fc0d314a73fd4c48fExpires: Sat, 12 Nov 2016 06:34:17 GMTX-Powered-By: HPHPX-UA-Compatible: IE=Edge,chrome=1Strict-Transport-Security: max-age=604800BDPAGETYPE: 1BDQID: 0xf51e0c970000d938BDUSERID: 0Set-Cookie: __bsi=12824513216883597638_00_24_N_N_3_0303_C02F_N_N_N_0; expires=Sat, 12-Nov-16 06:34:59 GMT; domain=www.baidu.com; path=/
>>> post_data = {"loginform-username":"test","loginform-password":"123456"}>>> response = urllib2.urlopen("http://home.51cto.com/index", data=(urllib.urlencode(post_data)))>>> response.read() # 登录后网页内容
>>> urllib.urlencode(post_data)'loginform-password=123456&loginform-username=test'
#!/usr/bin/python# -*- coding: utf-8 -*-import urllib, urllib2import cookielib# 实例化CookieJar对象来保存cookiecookie = cookielib.CookieJar()# 创建cookie处理器handler = urllib2.HTTPCookieProcessor(cookie)# 通过handler构造openeropener = urllib2.build_opener(handler)response = opener.open("http://www.baidu.com")for item in cookie: print item.name, item.value # python test.pyBAIDUID EB4BF619C95630EFD619B99C596744B0:FG=1BIDUPSID EB4BF619C95630EFD619B99C596744B0H_PS_PSSID 1437_20795_21099_21455_21408_21395_21377_21526_21190_21306PSTM 1478936429BDSVRTM 0BD_HOME 0
#!/usr/bin/python# -*- coding: utf-8 -*-import urllib, urllib2import cookielibcookie_file = 'cookie.txt'# 保存cookie到文件cookie = cookielib.MozillaCookieJar(cookie_file)# 创建cookie处理器handler = urllib2.HTTPCookieProcessor(cookie)# 通过handler构造openeropener = urllib2.build_opener(handler)response = opener.open("http://www.baidu.com")# 保存cookie.save(ignore_discard=True, ignore_expires=True) # ignore_discard默认是false,不保存将被丢失的。ignore_expires默认flase,如果cookie存在,则不写入。 # python test.py# cat cookie.txt # Netscape HTTP Cookie File# http://curl.haxx.se/rfc/cookie_spec.html# This is a generated file! Do not edit..baidu.com TRUE / FALSE 3626420835 BAIDUID 687544519EA906BD0DE5AE02FB25A5B3:FG=1.baidu.com TRUE / FALSE 3626420835 BIDUPSID 687544519EA906BD0DE5AE02FB25A5B3.baidu.com TRUE / FALSE H_PS_PSSID 1420_21450_21097_18560_21455_21408_21395_21377_21526_21192_20927.baidu.com TRUE / FALSE 3626420835 PSTM 1478937189www.baidu.com FALSE / FALSE BDSVRTM 0www.baidu.com FALSE / FALSE BD_HOME 0
#!/usr/bin/python# -*- coding: utf-8 -*-import urllib2import cookielib# 实例化对象cookie = cookielib.MozillaCookieJar()# 从文件中读取cookiecookie.load("cookie.txt", ignore_discard=True, ignore_expires=True)# 创建cookie处理器handler = urllib2.HTTPCookieProcessor(cookie)# 通过handler构造openeropener = urllib2.build_opener(handler)# request = urllib2.Request("http://www.baidu.com")response = opener.open("http://www.baidu.com")
7)使用代理服务器访问URL
import urllib2proxy_address = {"http": "http://218.17.252.34:3128"}handler = urllib2.ProxyHandler(proxy_address)opener = urllib2.build_opener(handler)response = opener.open("http://www.baidu.com")print response.read()
8)URL访问认证
import urllib2auth = urllib2.HTTPBasicAuthHandler()# (realm, uri, user, passwd)auth.add_password(None, 'http://www.example.com','user','123456')opener = urllib2.build_opener(auth)response = opener.open('http://www.example.com/test.html')
0 0
- Python常用标准库 --- urllib与urllib2
- Python 标准库 —— urllib/urllib2
- python标准库之urllib, httplib, urllib2
- [python]urllib与urllib2的常用函数
- 标准库:urllib/urllib2
- python urllib与urllib2
- Python标准库之urllib,urllib2自定义Opener
- python http 标准库之urllib,urllib2,自定义Opener,cookie
- Python库urllib与urllib2有哪些区别
- Python:urllib与urllib2错误解析
- python之urllib与urllib2比较
- Python 标准库 urllib2
- Python标准库urllib
- Python 标准库 urllib
- python中的urllib库和urllib2
- python的urllib, urllib2库基本知识
- python urllib/urllib2应用
- python urllib diff urllib2
- Oracle与Sql Serer的链接桥梁之透明网关的部署与配置详解(三)查看Windows服务
- bzoj 1057 悬线法
- Maven管理依赖
- 简明网络I/O模型---同步异步阻塞非阻塞之惑
- WebStorm 常用功能的使用技巧分享
- Python常用标准库 --- urllib与urllib2
- system()函数与exec组函数
- drp项目--JDBC开发流程之Oracle数据库连接
- oracle 巡检脚本(自动化)
- TabLayout+ViewPage建立关联setupWithViewPager()出现的问题
- 五种Map集合简介
- scala_严格求值和惰性求值
- node异常处理
- 电脑使用技巧(一)