python扩展之爬虫基础

来源:互联网 发布:vm linux 共享文件夹 编辑:程序博客网 时间:2024/06/05 16:21

URL管理器

网页下载器

urllib2下载网页的方法
1. 简洁方法
import urllib2
response = urllib2.urlopen('www.baidu.com') //直接请求
print response.getcode() //获取状态码,如果是200则成功
cont = response.read() //读取下载内容

2. 添加data、http header
request = urllib2.Request(url) //生成request对象
request.add_data('a','1') //添加数据(key,value)
request.add_header('User-Agent','Mozilla/5.0') //添加http的header,伪装为mozilla浏览器
response = urllib2.urlopen(request) //发送请求获取结果

3. 添加特殊情景的处理器
HTTPCookieProcessor //需要登陆的,借助cookie
ProxyHandler //需要代理才能访问
HTTPSHandler //https加密访问的
HTTPRedirectHandler //url相互跳转的网页

eg:
import urllib2, cookielib
cj = cookielib.CookieJar() //创建cookie容器
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) //创建一个opener
urllib2.install_opener(opener) //给urllib2安装opener,增强处理能力
urllib2.urlopen('www.baidu.com') //使用带有cookie的lib2访问网页

4. 实例
`#coding:utf8
import urllib2
import cookielib

url = ‘http://www.baidu.com’

print ‘test1’
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print ‘test2’
request = urllib2.Request(url)
request.add_header(‘user-agent’, ‘Mozilla/5.0’)
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print ‘test3’
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()
`

网页解析器

  1. 正则表达式(字符串模糊匹配)
  2. python自带:html.parser(结构化解析 dom树)
  3. Beautiful Soup(结构化解析)

    1. 安装
    2. 原理
      1. 创建bs对象,可以自动把html转化为dom树
      2. 通过find_all或find(只返回第一个)搜索节点,通过使用名称,属性,文字搜索
      3. 访问节点名,属性,文字
    3. 实例:
      `from bs4 import BeautifulSoup
      //根据html字符串创建bs对象
      soup = BeautifulSoup(
      html_doc, #html字符串
      ‘html.parser’ #html解析器
      from_encoding=’utf8’) #html文档的编码

      find_all(name, attr, string)

      node.name //获取节点名称
      node[‘href’] //访问属性
      node.get_text() //获取文字`

    4. 综合实例
      `#coding:utf8

from bs4 import BeautifulSoup
import re

html_doc = """

<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>

“”“

soup = BeautifulSoup(html_doc, ‘html.parser’, from_encoding = ‘utf-8’)

print ‘get a’
links = soup.find_all(‘a’)
for link in links:
print link.name, link[‘href’], link.get_text()

print ‘get lacie’
link_node = soup.find(‘a’, href = ‘http://example.com/lacie‘)
print link_node.name, link_node[‘href’], link_node.get_text()

print ‘get regex’
link_node = soup.find(‘a’, href = re.compile(r”ill”)) #正则表达式匹配,用r可以忽略转义字符
print link_node.name, link_node[‘href’], link_node.get_text()

print ‘by class’
link_node = soup.find(‘p’, class_ = ‘title’) #class要加下划线
print link_node.name, link_node.get_text()`
4. lxml(结构化解析)

扩展 ##

需登录、验证码、ajax、服务器防爬虫、多线程、分布式

简单爬虫地址:https://github.com/su526664687/Simple-Spider.git

0 0