Python3的解析库BeautifulSoup如何使用？

来源：互联网发布：医学网络教育编辑：程序博客网时间：2024/05/19 02:21

Beautiful Soup 是一个可以从 HTML 和 XML 文件中提取数据的 Python 库，本文整理了Beautiful Soup的基础知识和使用方法，一起来看看吧，希望对大家学习python有所帮助。

　　版本：4.4.0

　　安装Beautiful Soup

　　确保安装Python3 之后，只需一行命令。

　　 pip install beautifulsoup4

　　注意，Mac 中可能需要使用 pip3 install beautifulsoup4

　　安装完BeautifulSoup 后，我们还需要 HTTP 解析器，例如三方解析器 lxml

　　 pip install lxml

　　万事俱备只欠东风！

　　快速开始

　　>>>from bs4import BeautifulSoup>>>soup = BeautifulSoup('Extremely bold')

　　对象种类

　　BeautifulSoup 将复杂的 HTML 文档转为一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为四种： Tag, NavigableString, BeautifulSoup, Comment 。

　　 Tag

　　Tag 对象与 XML 和 HTML 原生文档中的 tag 相同。例如：

　　>>>tagb = soup.b>>>type(tag)

　　< class' bs4. element. Tag'>

　　下面介绍两个最重要的属性：name 和 attributes 。 Tag 有很多属性和方法，在遍历文档树和搜索文档树中详细介绍。

　　 Name

　　使用 .name 获取和修改tag 的名字

　　>>> tag.name'b'

　　 Attributes

　　一个tag 有很多属性。例如：前面的 tag , 有一个 class 属性。

　　>>> tag['class']

　　['boldest']

　　获取所有的属性

　　 tag.attrs

　　另外tag 的属性可以添加，删除和修改。操作方法和字典一样

　　注意：多值属性，一个属性可以同时存在多个值

　　 NavigableString

　　字符串常被包含在tag 中，使用 NavigableString 类来包装 tag 中的字符串：

　　>>> tag.string'Extremely bold'>>> type(tag.string)

　　< class' bs4. element. NavigableString'>

　　 BeautifulSoup

　　BeautifulSoup 对象并不是真正的 HTML 或 XML 的 tag ，所以它没有 name 和 Attribute 属性。有时我们需要 .name 查看，所以它包含一个值为 [documnet] 的特殊属性 .name

　　>>>soup.name'[document]'

　　 Comment

　　上面三个覆盖了HTML 和 XML 中的所有内容。但是还有一些特殊对象。

　　>>> markup = "">>> soup = BeautifulSoup(markup)>>> comment = soup.b.string>>> type(comment)

　　< class' bs4. element. Comment'>>>> comment'Hey, buddy. Want to buy a used parser?'

　　Comment 对象是一个特殊类型的 NavigableString 对象

　　 Comment 对象会使用特殊的格式输出：

　　>>> print(soup.b.prettify())< b>

　　遍历文档树

　　:chestnut: ：

>>> html_doc = """

... <html><head><title>The Dormouse's story</title></head>

... <body>

... <pclass="title"><b>The Dormouse's story</b></p>

...

... <pclass="story">Once upon a time there were three little sisters; and their names were

... <aclass="sister"id="link1">Elsie</a>,

... <aclass="sister"id="link2">Lacie</a> and

... <aclass="sister"id="link3">Tillie</a>;

... and they lived at the bottom of a well.</p>

...

... <pclass="story">...</p>

... """

>>>

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(html_doc, 'html.parser')

　　子节点

　　一个Tag 可能包含多个字符串或其它的 Tag ，这些都是这个 Tag 的子节点。 Beautiful Soup 提供了许多操作和遍历子节点的属性。

　　 Tag的名字

　　操作文档树最简单的方法就是告诉想获取标签的名称：

>>> soup.head

<head><title>The Dormouse's story</title></head>

>>> soup.title

<title>The Dormouse's story</title>

>>> soup.body.b

<b>The Dormouse's story</b>

　　 .contents 和 .children

　　 .contents 属性可以将tag 的子节点以列表的方式输出：

　　>>> head_tag = soup.head

>>> head_tag

<head><title>The Dormouse's story</title></head>

>>> head_tag.contents

[<title>The Dormouse's story</title>]

# .contents 返回的是列表

>>> title_tag = head_tag.contents[0]

>>> title_tag

<title>The Dormouse's story</title>

>>> title_tag.contents

["The Dormouse's story"]

　　注意：字符串没有子节点，所以字符串没有 .contents 属性。

　　>>> for child in title_tag.children: ... print(child) ...

　　The Dormouse's story

　　 .descendants

　　 .contents 和 .children 属性仅包含tag 的直接子节点。 .descendants 属性可以对所有tag 的子孙节点进递归循环

>>> for child in head_tag.descendants:

... print(child)

...

<title>The Dormouse's story</title>

The Dormouse's story

　　 .string

　　>>> title_tag.string"The Dormouse's story"

　　 .strings 和 stripped_strings

　　· 如果 tag 中包含多个字符串，可以使用 .strings 来循环获取。

　　 for string in soup.strings:

　　print(repr(string))

　　# u"The Dormouse's story"

　　# u'\n\n'

　　# u"The Dormouse's story"

　　# u'\n\n'

　　# u'Once upon a time there were three little sisters; and their names were\n'

　　# u'Elsie'

　　# u',\n'

　　# u'Lacie'

　　# u' and\n'

　　# u'Tillie'

　　# u';\nand they lived at the bottom of a well.'

　　# u'\n\n'

　　# u'...'

　　# u'\n'

　　· 输出的字符串中可以包含了很多空格或空行，使用 .stripped_strings 可以去除多余空白内容。

　　 for string in soup.stripped_strings:

　　print(repr(string))

　　# u"The Dormouse's story"

　　# u'Once upon a time there were three little sisters; and their names were'

　　# u'Elsie'

　　# u','

　　# u'Lacie'

　　# u'and'

　　# u'Tillie'

　　# u';\nand they lived at the bottom of a well.'

　　# u'...'

　　父节点

　　 .parent

　　 .parent 属性获取某个元素的父节点。

　　字符串也有父节点

　　的父节点是BeautifulSoup 对象

　　BeautifulSoup 对象的父节点是 None

　　 .parents

　　 .parents 递归获取所有的父辈节点。

　　兄弟节点

　　使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点

　　通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出。

　　搜索文档树

　　Beautiful Soup 定义了很多搜索方法。例如： find() 和 find_all() 。

　　过滤器

　　常见的过滤器类型，如下几种：

　　字符串

　　最简单的过滤器，例如：查找 <b> 标签可以写成 find_all('b') 。

　　正则表达式

　　匹配符合正则表达式的内容。

　　列表

　　匹配列表中所有元素内容。

　　 TRUE

　　可以匹配任何值。

　　方法

　　可以定义一个接受一个参数的方法，返回布尔类型。如果是TRUE 表示当前元素匹配找到，否则为找到。

　　 find_all

　　f∈dall(name,ars,recursive,str∈g,kwargs)f∈dall(name,ars,recursive,str∈g,kwargs)**

　　搜索所有当前tag 的所有 tag 子节点，并判断是否符合过滤器的条件。

　　1. name

　　 name 参数可以查找所有名字为name 的 tag 。

　soup.find_all("title")

# [<title>The Dormouse`s story</title>]

　　1. keyword 参数

　　如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag 的属性来搜索。

# id

soup.find_all(id = "links")

# [<a class="sister" id="link2">Lacie</a>]

# href

>>> soup.find_all(href = re.compile('elsie'))

[<a class="sister" id="link1">Elsie</a>]

# attrs

data_soup = BeautifulSoup('<div data-foo = "value">foo!</div>')

>>> data_soup.find_all(attrs = {"data-foo": "value"})

#[<div data-foo="value">foo!</div>]

按Class 搜索

按照类名搜索，但是由于 class 是保留字，所以使用 class_ 代替。

>>> soup.find_all("a", class_="sister")

[<aclass="sister"id="link1">Elsie</a>, <aclass="sister"id="link2">Lacie</a>, <aclass="sister"id="link3">Tillie</a>]

　　1. String 参数

　　使用string 参数搜索和使用 name 参数的可选值一样。

　　>>> soup.find_all(string="Elsie")

　　['Elsie']

　　1. limit 参数

　　使用limit 限制返回的数量

　　>>> soup.find_all("a", limit=2)

[<aclass="sister"id="link1">Elsie</a>, <aclass="sister"id="link2">Lacie</a>]

　　1. recursive 参数

　　将recursive 设置为 False, 只会搜索 tag 的直接子节点。

　　 find

　　find() 和 find_all() 不同的是，前者直接返回结果，后者返回包含值的列表。

　　 CSS选择器

　　在BeautifulSoup 对象的 select() 方法传入字符串参数，即可以使用CSS 选择器。

来源：紫电清霜

阅读全文

0 0