python爬虫之一——urllib、urllib2篇

来源：互联网发布：软件离职项目交接编辑：程序博客网时间：2024/06/05 01:53

写在前面：py的和其他语言相比，精华就在库非常多，优雅，简洁的引用；短短几句像诗句一样，包含着无限韵味，完全没必要死记硬背，用到时候看几眼，大体都会用了；
环境：win7、32位；py：2.7
文档参考地址：urllib英文文档

一、爬站分析：

我们的目标是爬取网站上的讯息，既然要爬取，那就要了解反爬；就和拳击选手一样，要打倒对手，首先要学会挨打；我们分析一下目前网络上大部分流行的网站我们，无非三种语言：php、asp、java最多；我们要爬取一个页面，自然要对这其中的一种有所了解:就选取世界上最好的语言php吧：

1.登录验证 ：主要是cookie存储、读取和验证码的处理，很有挑战性；
2.cookie拦截：是通过为页面设置cookie，如果验证通过，发送资源，不通过，显示403；
3.User-Agent检测：为了判断是否为爬虫，如果headers中缺少User-Agent，那么判定为爬虫。通过这个系统函数可以判断

  <?php    $_SERVER['HTTP_USER_AGENT']  ?>

4.ip访问次数过多或者频繁；这点估计也只有大厂和技术密集的公司会用到了；如果是类似restful资源型的，肯定是会检测的，比如php的yii2.0框架就有对资源访问频率做出限制。
5.其他限制：包括ajax的完全加密、滑动式验证码、点击式验证码…

二、资源的爬取、读入

get和post 参数的encode：urllib.encode(query,doseq=0)
请求资源：urllib2.Request(url,data,headers)
打开资源： resource = urllib2.urlopen(url或者定制的request,[data,timeout])
读入资源：resource.read()

以上搭配正则就可以操作简单的爬虫了；

1>source = urllib2.open(url|request,[,data,timeout])

source得到对象：
方法有：
- source.getcode() :获取响应码
- source.geturl() :判断是否重定向
- source.info() :获取响应的headers

2>urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])

Request是个对象
方法有：
- Request.add_data(data)
- Request.get_method()
- Request.has_data()
- Request.get_data()
- Request.add_header(key, val)
- Request.has_header(header)
- Request.get_full_url()
- Request.get_type()
- Request.get_header(header_name, default=None)
……
可去参考手册去查看：

三、资源请求的构造

1>get方式（追加构造url）：

# -*- coding:utf-8 -*-import urllibimport urllib2data = {'country':'China','age':32}formatdata = urllib.urlencode(data)url  = "http://cuiqingcai.com/947.html"+'?'+formatdataresponse = urllib2.urlopen(url)print response.geturl()输出：http://cuiqingcai.com/947.html?country=China&age=32

2>post方式(填充data)：

# -*- coding:utf-8 -*-import urllibimport urllib2data = {'country':'China','age':32}formatdata = urllib.urlencode(data)url  = "http://cuiqingcai.com/947.html"response = urllib2.urlopen(url, formatdata)print response.info()输出结果：Server: nginx/1.10.1Date: Tue, 20 Jun 2017 03:48:03 GMTContent-Type: text/html; charset=UTF-8Transfer-Encoding: chunkedConnection: closeVary: CookieX-Pingback: http://cuiqingcai.com/xmlrpc.phpLink: <http://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"Link: <http://cuiqingcai.com/?p=947>; rel=shortlink

3>其他方式：

# -*- coding:utf-8 -*-import urllib2request = urllib2.Request(url, data=data)request.get_method = lambda: 'PUT'|'DELETE'response = urllib2.urlopen(request)

四、debug设置

# -*- coding:utf-8 -*-import urllibimport urllib2httphander  = urllib2.HTTPHandler(debuglevel=1)httpshander = urllib2.HTTPSHandler(debuglevel=1)build_opener= urllib2.build_opener(httphander,httpshander)urllib2.install_opener(build_opener)urllib2.urlopen('http://www.news.com/')

五、设置http代理

# -*- coding:utf-8 -*-import urllib2proxy_handler = urllib2.ProxyHandler({'http': '....'})opener = urllib2.build_opener(proxy_handler)resouorce = opener.open('url')print(resource.read())#配置到全局urllib2.install_opener(opener)resouorce = opener.open('url')print(resource.read())

阅读全文

0 0