[Python]

来源：互联网发布：超级玛丽安卓源码编辑：程序博客网时间：2024/06/17 22:42

Beautiful Soup的简介

Beautiful Soup 是一个可以从HTML 或 XML 文件中提取数据的 Python 库，最主要的功能是从网页抓取数据

官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

Beautiful Soup 安装

命令行安装

可以利用 pip 或者 easy_install 来安装，以下两种方法均可

easy_install beautifulsoup4# 或者pip3 install beautifulsoup4

安装包安装

下载完成之后解压： Beautiful Soup 4.3.2

运行下面的命令即可完成安装

sudo python setup.py install

安装解析器

Beautiful Soup 支持 Python 标准库中的 HTML 解析器,还支持一些第三方的解析器,其中一个是 lxml

$ easy_install lxml或$ pip install lxml

基本使用

这里是官方文档链接，不过内容是有些多，也不够条理，在此选部分常用功能示例
官方文档

BeautifulSoup 对象

将一段文档传入 BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄

from bs4 import BeautifulSoup# 打来一个 html 文件soup = BeautifulSoup(open('index.html')) # 打来 html 格式字符串soup = BeautifulSoup('<html>data</html>')

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag 是什么？通俗点讲就是HTML 中的一个个标签
Tag 对象与 XML 或 HTML 原生文档中的 tag 相同

>>> soup = BeautifulSoup('<b class="boldest">Extremely</b>', "lxml")>>> tag = soup.b>>> tag<b class="boldest">Extremely</b>>>> type(tag)<class 'bs4.element.Tag'>

tag中最重要的属性:

Name: 每个 tag 都有自己的名字,通过 .name 来获取
Attributes: tag 的属性, 属性的操作方法与字典相同

`name` 属性和获取和修改

# 获取 tag 的 name>>> tag.name'b'# 修改 tag 的 name# 如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档>>> tag.name = "blcokquote">>> tag<blcokquote class="boldest">Extremely</blcokquote>

`Attributes` 属性的操作

一个 tag 可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest”

tag 的属性的操作方法与字典相同

tag 的属性可以被添加,删除或修改, 操作方法与字典一样

# 获取 tag 的 class 属性值，返回一个列表>>> tag['class']['boldest']>>> tag['class'][0]'boldest'>>> tag.attrs{'class': ['boldest']}# 修改 tag 的 class 和 id 属性>>> tag['class'] = 'mazy'>>> tag['id'] = 1>>> tag<blcokquote class="mazy" id="1">Extremely</blcokquote># 删除 tag 的 class 属性>>> del tag['class']>>> tag<blcokquote id="1">Extremely</blcokquote># 删除 tag 的 id 属性>>> del tag['id']>>> tag<blcokquote>Extremely</blcokquote>

多值属性

HTML 定义了一系列可以包含多个值的属性,最常见的多值的属性是 class (一个tag可以有多个CSS的class).在Beautiful Soup中多值属性的返回类型是 List

>>> css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')>>> css_soup.p['class']['body', 'strikeout']

在任何版本的 HTML 定义中都没有被定义为多值属性,那么 Beautiful Soup 会将这个属性作为字符串返回

>>> id_soup = BeautifulSoup('<p id="my id"></p>', 'lxml')>>> id_soup.p['id']'my id'

NavigableString

字符串常被包含在 tag 内. Beautiful Soup 用 NavigableString 类来包装 tag 中的字符串

tag 中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法

>>> tag<blcokquote class="boldest">Extremely</blcokquote>>>> tag.string'Extremely'>>> type(tag.string)<class 'bs4.element.NavigableString'>>>> tag.string.replace_with('No longer bold')>>> tag<blcokquote>No longer bold</blcokquote>

遍历文档树

操作示例代码：

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

创建 `BeautifulSoup` 文档对象

>>> soup = BeautifulSoup(html_doc, 'html.parser')# soup 对象>>> soup<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

子节点

一个Tag可能包含多个字符串或其它的 Tag,这些都是这个 Tag 的子节点

`Tag` 的名字

# 获取 head 标签>>> soup.head<head><title>The Dormouse's story</title></head># 获取 title 标签>>> soup.title<title>The Dormouse's story</title># 这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签>>> soup.body.b<b>The Dormouse's story</b># 获取第一个 a 标签>>> soup.a<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a># 如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到: find_all()>>> soup.find_all('a')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

`Tag` 的 `.contents` 和 `.children` 属性

Tag 的 .contents 属性可以将 Tag 的子节点以列表的方式输出
字符串没有 .contents 属性,因为字符串没有子节点

# 获取 head 标签内部的内容>>> soup.head.contents[<title>The Dormouse's story</title>]# 获取 head 标签内部的内容的第一个元素>>> soup.head.contents[0]<title>The Dormouse's story</title># 获取 head 标签内部的内容的第一个元素的内容>>> soup.head.contents[0].contents["The Dormouse's story"]

通过 Tag 的 .children 生成器,可以对 Tag 的子节点进行循环

for child in soup.body.p.children:    print(child) # <b>The Dormouse's story</b>

搜索文档树

Beautiful Soup 定义了很多搜索方法,这里着重介绍2个:

find()
find_all()

使用 find_all() 类似的方法可以查找到想要查找的文档内容

字符串搜索

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容

下面的例子用于查找文档中所有的标签

>>> soup.find_all('b')[<b>The Dormouse's story</b>]

正则表达式搜索

import re# 下面例子中找出所有以b开头的标签 for tag in soup.find_all(re.compile("^b")):    print(tag.name) #body #b

参数列表搜索

如果传入列表参数,Beautiful Soup 会将与列表中任一元素匹配的内容返回

下面代码找到文档中所有 <a> 标签和 <b> 标签:

>>> soup.find_all(['a','b'])[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

方法 / 函数搜索

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class 同时属性包含 id 属性,那么将返回 True

def has_class_and_id(tag):    return tag.has_attr('class') and tag.has_attr('id')# 将这个方法作为参数传入 find_all() 方法,将得到所有<a>标签: result = soup.find_all(has_class_and_id)print(result) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find() 的使用

find( name , attrs , recursive , string , **kwargs )

使用 find_all() 方法并设置 limit=1 参数不如直接使用 find() 方法.

下面两行代码是等价的:

>>> soup.find_all('title', limit=1)[<title>The Dormouse's story</title>]# 等价于>>> soup.find('title')<title>The Dormouse's story</title>

`find_all()` 和 `find()` 的区别：

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果
find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None

CSS选择器

Beautiful Soup 支持大部分的 CSS 选择器, 在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到 Tag

>>> soup.select('title')[<title>The Dormouse's story</title>]

通过 `tag` 标签逐层查找

>>> soup.select('body a')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.select('html head title')[<title>The Dormouse's story</title>]

找到某个 `tag` 标签下的直接子标签

>>> soup.select('head > title')[<title>The Dormouse's story</title>]>>> soup.select('p > a')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.select('p > #link1')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>> soup.select('body > a')[]

通过 `CSS` 的类名查找

>>> soup.select('.sister')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 `tag` 的 `id` 查找

>>> soup.select('#link1')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>> soup.select('a#link1')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

同时用多种 `CSS` 选择器查询元素

>>> soup.select('#link1, #link2')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

返回查找到的元素的第一个

>>> soup.select_one('.sister')<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

阅读全文

'); })();