python扩展之爬虫基础

来源：互联网发布：vm linux 共享文件夹编辑：程序博客网时间：2024/06/05 16:21

URL管理器

网页下载器

urllib2下载网页的方法
1. 简洁方法
import urllib2 response = urllib2.urlopen('www.baidu.com') //直接请求 print response.getcode() //获取状态码，如果是200则成功 cont = response.read() //读取下载内容
2. 添加data、http header
request = urllib2.Request(url) //生成request对象 request.add_data('a','1') //添加数据（key,value） request.add_header('User-Agent','Mozilla/5.0') //添加http的header，伪装为mozilla浏览器 response = urllib2.urlopen(request) //发送请求获取结果
3. 添加特殊情景的处理器
HTTPCookieProcessor //需要登陆的，借助cookie ProxyHandler //需要代理才能访问 HTTPSHandler //https加密访问的 HTTPRedirectHandler //url相互跳转的网页
eg:
import urllib2, cookielib cj = cookielib.CookieJar() //创建cookie容器 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) //创建一个opener urllib2.install_opener(opener) //给urllib2安装opener,增强处理能力 urllib2.urlopen('www.baidu.com') //使用带有cookie的lib2访问网页
4. 实例
`#coding:utf8
import urllib2
import cookielib

url = ‘http://www.baidu.com’

print ‘test1’
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print ‘test2’
request = urllib2.Request(url)
request.add_header(‘user-agent’, ‘Mozilla/5.0’)
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print ‘test3’
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()
`

网页解析器

正则表达式（字符串模糊匹配）
python自带：html.parser（结构化解析 dom树）
Beautiful Soup（结构化解析）
1. 安装
2. 原理
  1. 创建bs对象，可以自动把html转化为dom树
  2. 通过find_all或find（只返回第一个）搜索节点，通过使用名称，属性，文字搜索
  3. 访问节点名，属性，文字
3. 实例：
  `from bs4 import BeautifulSoup
  //根据html字符串创建bs对象
  soup = BeautifulSoup(
  html_doc, #html字符串
  ‘html.parser’ #html解析器
  from_encoding=’utf8’) #html文档的编码
  find_all(name, attr, string)
  node.name //获取节点名称
  node[‘href’] //访问属性
  node.get_text() //获取文字`
4. 综合实例
  `#coding:utf8

from bs4 import BeautifulSoup
import re

html_doc = """

<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>

“”“

soup = BeautifulSoup(html_doc, ‘html.parser’, from_encoding = ‘utf-8’)

print ‘get a’
links = soup.find_all(‘a’)
for link in links:
print link.name, link[‘href’], link.get_text()

print ‘get lacie’
link_node = soup.find(‘a’, href = ‘http://example.com/lacie‘)
print link_node.name, link_node[‘href’], link_node.get_text()

print ‘get regex’
link_node = soup.find(‘a’, href = re.compile(r”ill”)) #正则表达式匹配，用r可以忽略转义字符
print link_node.name, link_node[‘href’], link_node.get_text()

print ‘by class’
link_node = soup.find(‘p’, class_ = ‘title’) #class要加下划线
print link_node.name, link_node.get_text()`
4. lxml（结构化解析）

扩展 ##

需登录、验证码、ajax、服务器防爬虫、多线程、分布式

简单爬虫地址：https://github.com/su526664687/Simple-Spider.git

0 0