BeautifulSoup

来源：互联网发布：大数据处理常用算法编辑：程序博客网时间：2024/06/12 07:17

正则表达式的写法用得不熟练，叫Beautiful Soup，有了它我们可以很方便 地提取出HTML或XML标签中的内容

1. Beautiful Soup的简介

Beautiful Soup是python的一个库，最主要的功能是 从网页抓取数据

创建 Beautiful Soup 对象

创建一个字符串

html = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

创建 beautifulsoup 对象
soup = Beautiful(html)
还可以用本地 HTML 文件来创建对象
soup = BeautifulSoup(open('index.html'))

打印一下 soup 对象的内容，格式化输出
`print soup.prettify()

<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title" name="dromouse">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    <!-- Elsie -->   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ;and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
1. Tag
就是 HTML 中的一个个标签
HTML 标签加上里面包括的内容就是 Tag

怎样用 Beautiful Soup 来方便地获取 Tags

print soup.title#<title>The Dormouse's story</title>

#### soup加标签名轻松地获取这些标签的内容
#### 查找的是在所有内容中的第一个符合要求的标签，
如果要查询所有的标签，我们在后面进行介绍。

Tag，它有两个重要的属性，是 name 和 attrs，

name

print soup.nameprint soup.head.name#[document]#head

soup 对象本身比较特殊，它的 name 即为 [document]，对于其他内部标签，输出的值便为标签本身的名称。

attrs

print soup.p.attrs#{'class': ['title'], 'name': 'dromouse'}

p 标签的所有属性打印输出了出来，得到的类型是一个字典。

单独获取某个属性，

print soup.p['class']#['title']print soup.p.get('class')#['title']

属性和内容等等进行修改

NavigableString(可以遍历的字符串)
已经得到了标签的内容

获取标签内部的文字怎么办呢

print soup.p.string#The Dormouse's story

BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.

分别获取它的类型，名称，以及属性来感受一下

print type(soup.name)#<type 'unicode'>print soup.name# [document]print soup.attrs#{} 空字典

Comment

输出的内容仍然不包括注释符号

print soup.aprint soup.a.stringprint type(soup.a.string)

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie<class 'bs4.element.Comment'>

Elsie是注释，把注释符号去掉了

使用前最好做一下判断，判断代码如下

if type(soup.a.string)==bs4.element.Comment:    print soup.a.string

遍历文档树

（1）直接子节点

.contents
将tag的子节点以列表的方式输出

<head>  <title>   The Dormouse's story  </title> </head>print soup.head.contents #[<title>The Dormouse's story</title>]

输出方式为列表
列表索引来获取它的某一个元素

print soup.head.contents[0]#<title>The Dormouse's story</title>

.children
它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。
list 生成器对象

<head>  <title>   The Dormouse's story  </title> </head>print soup.head.children#<listiterator object at 0x7f71457f5710>

获得里面的内容呢？

for child in  soup.body.children:    print child

<p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>

（2）所有子孙节点

.contents 和 .children 属性仅包含tag的直接子节点
.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:    print child

（3）节点内容

.string 属性

如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容

<head>  <title>   The Dormouse's story  </title> </head>print soup.head.string#The Dormouse's storyprint soup.title.string#The Dormouse's story

tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None
print soup.html.string# None

（4）多个内容

.strings

获取多个内容，不过需要遍历获取

for string in soup.strings:    print(repr(string))    # u"The Dormouse's story"    # u'\n\n'    # u"The Dormouse's story"    # u'\n\n'    # u'Once upon a time there were three little sisters; and their names were\n'    # u'Elsie'    # u',\n'    # u'Lacie'    # u' and\n'    # u'Tillie'    # u';\nand they lived at the bottom of a well.'    # u'\n\n'    # u'...'    # u'\n'

.stripped_strings

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以 去除多余空白内容

for string in soup.stripped_strings:    print(repr(string))    # u"The Dormouse's story"    # u"The Dormouse's story"    # u'Once upon a time there were three little sisters; and their names were'    # u'Elsie'    # u','    # u'Lacie'    # u'and'    # u'Tillie'    # u';\nand they lived at the bottom of a well.'    # u'...'

（5）父节点
#### .parent 属性

（6）全部父节点

.parents 属性

.parents 属性可以递归得到元素的所有父辈节点

 content = soup.head.title.stringfor parent in  content.parents:    print parent.name

titleheadhtml[document]

（7）兄弟节点

兄弟节点可以理解为和本节点处在统一级的节点

.next_sibling 属性

获取了该节点的下一个兄弟节点

.previous_sibling

则与之相反
节点不存在，则返回 None

（8）全部兄弟节点

.next_siblings

.previous_siblings

对当前节点的兄弟节点迭代输出

for sibling in soup.a.next_siblings:    print(repr(sibling))    # u',\n'    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>    # u' and\n'    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>    # u'; and they lived at the bottom of a well.'    # None

.next_element .previous_element 属性

不是针对于兄弟节点，而是在所有节点，不分层次,包括父节点

7.搜索文档树

（1）find_all( name , attrs , recursive , text , **kwargs )

搜索 当前tag的所有tag子节点,并判断是否符合过滤器的条件
参数
1）name 参数
查找 所有名字为 name 的tag,字符串对象会被自动忽略掉
A.传字符串

soup.find_all('b')# [<b>The Dormouse's story</b>]

print soup.find_all('a')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式
Beautiful Soup会通过正则表达式的 match() 来匹配内容
以b开头的标签,这表示<body>和<b>标签都应该被找到

import refor tag in soup.find_all(re.compile("^b")):    print(tag.name)# body# b

C.传列表
Beautiful Soup会将 与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签

soup.find_all(["a", "b"])# [<b>The Dormouse's story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.传 True

True 可以 匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

E.传方法

2）keyword 参数

8.CSS选择器

写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #

soup.select()，返回类型是 list

（1）通过标签名查找

print soup.select('title') #[<title>The Dormouse's story</title>]

（2）通过类名查找

print soup.select('.sister')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找

print soup.select("head > title")#[<title>The Dormouse's story</title>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]print soup.select('a[href="http://example.com/elsie"]')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')print type(soup.select('title'))print soup.select('title')[0].get_text()for title in soup.select('title'):    print title.get_text()

阅读全文

0 0