Python爬虫----爬虫入门（3）---urllib2

来源：互联网发布：php 客户端ip 编辑：程序博客网时间：2024/05/19 19:43

开发环境，ubuntu 14.0.1自带python 2.7.6

关于urllib和urllib的区别：
urllib 和urllib2都是接受URL请求的相关模块，但是提供了不同的功能。两个最显著的不同如下：（来自百度一下）
urllib2可以接受一个Request类的实例来设置URL请求的headers，urllib仅可以接受URL。这意味着，你不可以伪装你的User Agent字符串等。
urllib提供urlencode方法用来GET查询字符串的产生，而urllib2没有。这是为何urllib常和urllib2一起使用的原因。
也就是说，urllib2能够添加更多东西在请求里面。

下面举个实例：抓取糗事百科页面的段子：
使用上次的方法直接抓取：

# -*- coding:utf-8 -*-import urllib2url = 'http://www.qiushibaike.com/hot/1/'try:    request = urllib2.Request(url)    response = urllib2.urlopen(request)    print response.read()except urllib2.URLError, e:    if hasattr(e,"code"):        print e.code    if hasattr(e,"reason"):        print e.reason

直接报错，无法正常访问页面。
原因就是：
由于一些网站不希望被程序访问，或网站会发送不同的内容给不同的浏览器类型，因此需要修改HTTP头部来将程序伪造成相应的浏览器，而浏览器通常通过头部的User-Agent来识别，因此通常只改User-Agent即可。方法是传递一个headers头部字典给Request对象。
添加headers头
示例代码：

import urllib2url="http://www.example.com/"headers={"User-Agent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"}req=urllib2.Request(url,headers=headers)response=urllib2.urlopen(req)

所以把我们的代码改成：

# -*- coding:utf-8 -*-import urllib2url = 'http://www.qiushibaike.com/hot/1/'headers={"User-Agent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"}try:  request = urllib2.Request(url,headers=headers)  response = urllib2.urlopen(request)  print response.read()except urllib2.URLError, e:  if hasattr(e,"code"):    print e.code  if hasattr(e,"reason"):    print e.reason

返回了正常的HTML代码
补充：关于异常处理
原文：http://my.oschina.net/duhaizhang/blog/69834
1、URLError异常
通常引起URLError的原因是：无网络连接（没有到目标服务器的路由）、访问的目标服务器不存在。在这种情况下，异常对象会有reason属性（是一个（错误码、错误原因）的元组）。
2、HTTPError
每一个从服务器返回的HTTP响应都有一个状态码。其中，有的状态码表示服务器不能完成相应的请求，默认的处理程序可以为我们处理一些这样的状态码（如返回的响应是重定向，urllib2会自动为我们从重定向后的页面中获取信息）。有些状态码，urllib2模块不能帮我们处理，那么urlopen函数就会引起HTTPError异常,其中典型的有404/401。
HTTPError异常的实例有整数类型的code属性，表示服务器返回的错误状态码。
urllib2模块默认的处理程序可以处理重定向（状态码是300范围），而且状态码在100-299范围内表示成功。因此，能够引起HTTPError异常的状态码范围是：400-599.
当引起错误时，服务器会返回HTTP错误码和错误页面。你可以将HTPError实例作为返回页面，这意味着，HTTPError实例不仅有code属性，还有read、geturl、info等方法。

如果想在代码中处理URLError和HTTPError有两种方法，代码如下：

#! /usr/bin/env python#coding=utf-8import urllib2url="xxxxxx"  #需要访问的URLtry:    response=urllib2.urlopen(url)except urllib2.HTTPError,e:    #HTTPError必须排在URLError的前面    print "The server couldn't fulfill the request"    print "Error code:",e.code    print "Return content:",e.read()except urllib2.URLError,e:    print "Failed to reach the server"    print "The reason:",e.reasonelse:    #something you should do    pass  #其他异常的处理

#! /usr/bin/env python#coding=utf-8import urllib2url="http://xxx"  #需要访问的URLtry:    response=urllib2.urlopen(url)except urllib2.URLError,e:    if hasattr(e,"reason"):        print "Failed to reach the server"        print "The reason:",e.reason    elif hasattr(e,"code"):        print "The server couldn't fulfill the request"        print "Error code:",e.code        print "Return content:",e.read()else:    pass  #其他异常的处理

相比较而言，第二种异常处理方法更优。

由于我们是要抓取里面的段子：
正则抓取页面

# -*- coding:utf-8 -*-import urllib2import reurl = 'http://www.qiushibaike.com/textnew/'headers={"User-Agent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"}try:   request = urllib2.Request(url,headers=headers)   response = urllib2.urlopen(request)         html=response.read()except    urllib2.URLError, e:print e.code if hasattr(e,"reason"): print e.reasondef getImg(html): reg = r'<div class="content">\s\s.+\s<!' imgre = re.compile(reg) imglist = re.findall(imgre,html) for i in range(len(imglist)): print imglist[i]getImg(html)

（这里要特别指出一个正则匹配的要点：换行一定要加\s 一行加一当然也可以使用\s+切记）

最后改进：

# -*- coding:utf-8 -*-import urllib2import redef uel(url): #url = 'http://www.qiushibaike.com/textnew/' headers = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1"} try: request = urllib2.Request(url, headers=headers) response = urllib2.urlopen(request)   return response.read()  if hasattr(e, "code"):print e.reasonf=open('xaiohua.txt','a')def getImg(html): reg = r'<div class="content">\s\s.+' for i in range(len(imglist)):  f.write(imglist[i][23:]+'\n') for i in range(35): url='http://www.qiushibaike.com/text/page/%s/?s=4866809'%str(i)http=uel(url) getImg(http)f.close()

这里写图片描述

OK，已经抓下文字区所有的笑话了。

0 0