（2）获取网页源代码——Python

来源：互联网发布：淘宝网时尚女鞋编辑：程序博客网时间：2024/06/01 10:03

Python版：超级简短

#!/usr/bin/python

#-*- coding: utf-8 -*-

import urllib2

response = urllib2.urlopen("http://www.baidu.com")

print response.read()

POST方式：

#!/usr/bin/python

#-*- coding: utf-8 -*-

import urllib

import urllib2

values = {"username":"1016903103@qq.com","password":"XXXX"}

data = urllib.urlencode(values)

url = "https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"

request = urllib2.Request(url,data)

response = urllib2.urlopen(request)

print response.read()

GET方式：

#!/usr/bin/python

#-*- coding: utf-8 -*-

import urllib

import urllib2

values={}

values['username'] = "1016903103@qq.com"

values['password']="XXXX"

data = urllib.urlencode(values)

url = "http://passport.csdn.net/account/login"

geturl = url + "?"+data

#print geturl

request = urllib2.Request(geturl)

response = urllib2.urlopen(request)

print response.read()

Python优化版：返回错误信息，设置Headers、Proxy

urlopen函数：urlopen(url, data, timeout)

第一个参数url即为URL，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间。

第一个参数URL是必须要传送的，第二三个参数是可以不传送的，data默认为空None，timeout默认为 socket._GLOBAL_DEFAULT_TIMEOUT。如果第二个参数data为空那么要特别指定是timeout是多少，写明形参，如果data已经传入，则不必声明。即：

response = urllib2.urlopen('http://www.baidu.com', timeout=10)

response = urllib2.urlopen('http://www.baidu.com',data, 10)

设置Headers：

在构建request时传入一个headers，在请求时，就加入了headers传送，服务器若识别了是浏览器发来的请求，就会得到响应。

headers的一些属性，下面的需要特别注意一下：

1. User-Agent : 有些服务器或 Proxy 会通过该值来判断是否是浏览器发出的请求

2. Content-Type : 在使用 REST 接口时，服务器会检查该值，用来确定 HTTP Body 中的内容该怎样解析。

3. application/xml ：在 XML RPC，如 RESTful/SOAP 调用时使用

4. application/json ：在 JSON RPC 调用时使用

5. application/x-www-form-urlencoded ：浏览器提交 Web 表单时使用

在使用服务器提供的 RESTful 或 SOAP 服务时， Content-Type 设置错误会导致服务器拒绝服务

对付“反盗链”的方式：

服务器会识别headers中的referer是不是它自己，如果不是，有的服务器不会响应，所以我们还可以在headers中加入referer

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,'Referer':'http://www.zhihu.com/articles' }

设置Proxy（代理）：

urllib2 默认会使用环境变量 http_proxy 来设置 HTTP Proxy。假如一个网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问。所以你可以设置一些代理服务器来帮助你做工作，每隔一段时间换一个代理，就不会被禁了。

解决乱码问题：

如果原来的网页的编码是gb2312或gbk，由于显示的是utf-8而乱码的话，可以通过代码来转换编码格式：

html= response.read()

html=html.decode('gbk','ignore')#将gbk编码转为unicode编码

html=html.encode('utf-8','ignore')#将unicode编码转为utf-8编码

#!/usr/bin/python

#-*- coding: utf-8 -*-

#第一行在PyCharm中必须要有，第二行是设置输出的编码格式

import urllib #导入包

import urllib2

import cookielib

try:

url = 'http://www.*.com/login'

#设置代理Proxy

enable_proxy = True

proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})

null_proxy_handler = urllib2.ProxyHandler({})

if enable_proxy:

opener = urllib2.build_opener(proxy_handler)

else:

opener = urllib2.build_opener(null_proxy_handler)

urllib2.install_opener(opener)

#设置Headers

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' #代理服务器

headers = { 'User-Agent' : user_agent }

#POST方式

values = {'username' : 'cqc', 'password' : 'XXXX' } #POST内容

data = urllib.urlencode(values)

#获取网页源代码

request = urllib2.Request(url, data, headers)#请求

response = urllib2.urlopen(request)#响应

connect = response.read()#返回网页内容

‘’’如果网页编码格式是gbk的话

connect = connect.decode('gbk','ignore')#将gbk编码转为unicode编码

connect = connect .encode('utf-8','ignore')#将unicode编码转为utf-8编码’’’