【爬虫学习笔记】BeautifulSoup用法分析（一）

来源：互联网发布：工作两年程序员编辑：程序博客网时间：2024/06/05 14:27

掌握了一些python的基础语法后，便可以考虑想要发展的方向了，爬虫是一个很不错的方向

在学习单线程爬虫的时候，势必会遇到BeatifulSoup，若不能熟练掌握它的用法，就很难往下面走了，下面开始介绍BeatifulSoup

笔记分为以下两篇文章：

BeatifulSoup用法分析（一）-本文

BeatifulSoup用法分析（二）

BeatifulSoup是一个第三方的库，就像requests,urllib2一样，他们都有着自己独特的用法和作用，为了方便编写文档，下文将用BS代指BeatifulSoup。简单的说，BS就是用来分析并提取网页数据的，分析用的是分析器，有好几种：

a、html.parser python标配标配的分析器速度很一般

b、lxml 第三方的这个分析速度就很棒了

c、html5lib 第三方的速度慢，但是容错性最好

d、lxml（分析xml页面）第三方的速度快，据说唯一支持xml的分析器

第三方的分析器都需要下载，命令：pip install lxml/html5lib（斜杠是或者） ,因为BS也是第三方的，同样需要下载：pip install bs4 ，bs4是指的BS版本4，有好几个版本的。官方的解释也贴一下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

1、导入模块

from bs4 import BeautifulSoup

2、创建一个html字符串用来模拟网页

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""

3、创建BS对象

hello = BeautifulSoup(html,'html.parser')

这个hello是自定义的变量，这个变量是个BS对象，通过使用BS对象的方法，进行下一步的数据提取。

4、格式化html数据

print hello.prettify()

由于返回结果内容行数过多，就不贴了，可自行运行代码查看。

说明：格式化的意思是按照html的格式进行排版输出

5、四个对象详细介绍

BS化后的对象，是一个类似一般网页结构的python对象，这个对象包含四个节点对象，必须理解这个四个节点，才能更容易的理解下文，

四个节点如下，为了便于理解，我在后面写出对应的中文解释：

Tag 标签对象

NavigableString 字符串对象

BeautifulSoup BS对象

Comment 注释对象

首先是tag，标签对象，网页的<title>The Dormouse's story</title>是一个标签对象，同理，大到<head>,小到,<a>都是标签对象（包括标签里面的内容），它们的提取方式：

print hello.head

输出：

<head><title>The Dormouse's story</title></head>

print hello.title

输出：
<title>The Dormouse's story</title>

说明：需要注意的是，一个标签对象是包含它的所有子标签以及后代标签的内容的。若html内容中有多个同名标签对象，那么方法只会获取第一个匹配到的tag对象。
提取到的标签对象在python中是一个BS.tag类型，可通过代码查看：

print type(hello.title)

输出：
<class 'bs4.element.Tag'>

那么对于tag类型的内容，还可以继续操作，tag有2个属性，name和attr，举例：

print hello.nameprint hello.title.nameprint hello.p.attrs

输出：

[document]
title
{u'class': [u'title'], u'name': u'dromouse'}

说明：tag.name就是这个标签本身的名称，如果是tag指的是整个html文件，那么他的name比较特殊，是[document]；tag.attrs就是标签内容中的属性，其返回的结果是字典类型，注意：这里只会返回第一个p标签中的属性。
字典中的unicode字符串转码utf-8可以这样操作（py3中应该不存在这个问题）：

dic = hello.p.attrsdef unicode_to_utf8(dict=None):    dic1 = {}    for item in dic.items():        #print item        if type(item[1]) == list:            dic1[item[0].encode('utf8')] = item[1][0].encode('utf8')        else:            dic1[str(item[0]).encode('utf8')] = str(item[1]).encode('utf8')    print dic1unicode_to_utf8(dict=dic)

输出：
{'class': 'title', 'name': 'dromouse'}

因为在tag中的每个属性都是唯一的，所以可以用以下方法获取tag对象中的属性值：

print hello.p.get('class')print hello.p['class']print hello.p['name']print hello.a['href']

输出：
[u'title']
[u'title']
dromouse
http://example.com/elsie

NavigableString
我们已经获取到tag里面的属性值了，获取tag中的文字内容也就不难了，所以把这个navigablestring理解为tag中的文字内容就行了。举例：

print hello.p.stringprint hello.a.string

输出：
The Dormouse's story
Elsie

打印的结果若有空格，在后面加.strip()就行了。

如果运行：

print hello.body.string

输出：

None

前面我们说了，一个tag是包含自己和自己的子标签以及后代标签的内容的，那为什么这个是none呢？

因为.string这个方法只能输出一个string，而这里的body标签对象包含有多个子标签，那就有多个string了，所以这里得用到.strings这个方法了；但注意，直接打印strings是不行的，要使用for进行遍历输出才能打印出所有的string：

for string in hello.body.stripped_strings:     print string

输出：

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
说明：这里用的是stripped_strings方法，是为了过滤空格和空行方便查看，你也可以用.strings尝试一下。

print type(hello.head.string)

输出：

同样，我们通过打印type可得知string的类型是BS中的NavigableString对象。

BeautifulSoup BS对象

这个BS对象其实就是一个大的Tag对象，是一个特殊的tag而已，一样有name/attrs，通常我们不理会它，举例：

print hello.nameprint hello.attrs

输出：

[document]
{}
说明：这个特殊的tag的名称是固定的，前面已经讲过，属性是空字典，也固定。

Comment 注释对象

如果你熟悉前端的话，那你应该明白是注释符，在有些tag中，除了文字（Navigable）对象，还有comment对象，但是，我们在输出文字对象的时候，也会输出comment的内容，因为注释符中的内容也是文字，所以comment对象是一种特殊的Navigable对象，举例：

print hello.aprint hello.a.stringprint type(hello.a.string)

输出：

<a class="sister" href="http://example.com/elsie" id="link1"></a>
Elsie
<class 'bs4.element.Comment'>

可以看到a标签中没有真正的文字对象，只有注释的文字对象，但也被.string方法获取到了。通过type发现这个文字其实是comment对象，所以，当我们想要获取纯Navigable对象的时候，需要用判断排除掉comment对象：

if type(hello.p.string) == bs4.element.NavigableString:    print 'string'if type(hello.a.string) == bs4.element.Comment:    print 'comment'

输出：

string
comment

好了，以上内容介绍的是以单个tag为对象的bs方法操作，接下来说说tag之间的关系及方法操作

6、DOM树的遍历
不必纠结标题，直接看内容，首先第一个，

（1）contents方法（直接子标签）：

print hello.body.contents

输出：

[u'\n', The Dormouse's story, u'\n', Once upon a time there were three little sisters;。。。。]

打印的内容不少，未全部贴出；这个方法的返回结果是一个列表，列表的每个值是一个直接子标签及其内容。在py2环境中默认是unicode编码，这里我们把它转换一下编码格式：

tag_list=[]for tag in hello.body.contents:    if tag != u'\n':        tag_list.append(tag.encode('utf8'))print tag_list

顺便过滤一下换行符，所以.contents就是获取当前Tag的子标签及其内容，下一个，

（2）children方法（类似contents），举例：

for child in hello.body.children:    print child

输出：

The Dormouse's story

Once upon a time there were three little sisters; and their names were

......(未完全贴出)

说明：children其实也是用来获取子标签及其内容的，但是返回的结果不是list而已，是一个生成器对象，需要用for遍历才能输出内容。当然，这里有一个明显的弊端，那就是打印出了许多空行，这样还需要再去过滤转换，不如直接用contents方法。

（3）.descendants方法（后代标签）：

for child in hello.descendants:    print child

输出：

<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

......(未完全贴出)

说明：这个方法是获取子标签的内容，以及子标签的子标签及其内容，俗点说，就是当前Tag的子孙后代所有标签节点都会被获取到。
（4）.parent方法（父标签）：

print hello.p.parent.nameprint hello.title.parent.namec = hello.p.string.parent.nameprint c

输出：

body
head
b
说明：理解前面的子标签和后代标签，这里的父标签当然就很容易理解了，简单说就是当前tag的上一级tag。
（5）.parents方法（所有父标签）：

c = hello.p.string.parentsfor c1 in c:    print c1.name

输出：

b
p
body
html
[document]

说明：parents获取所有上级的tag，返回的结果是一个生成器，所以同样需要用for遍历输出。
（6）.next_sibling .previous_sibling方法（兄弟标签）：

print hello.p.previous_siblingprint hello.p.next_sibling

输出为空。

print hello.a.previous_siblingprint hello.a.next_sibling

输出：

Once upon a time there were three little sisters; and their names were

,
说明：这里或许需要打印hello.body.prettify，即格式化显示一下html树结构的html内容，通过观察html树结构来理解兄弟标签的意思了：

<body> <p class="title" name="dromouse">  <b>   The Dormouse's story  </b> </p> <p class="story">  Once upon a time there were three little sisters; and their names were  <a class="sister" href="http://example.com/elsie" id="link1">   <!-- Elsie -->  </a>  ,  <a class="sister" href="http://example.com/lacie" id="link2">   Lacie  </a>  and  <a class="sister" href="http://example.com/tillie" id="link3">   Tillie  </a>  ;and they lived at the bottom of a well. </p> <p class="story">  ... </p></body>

说明：previous_silbing，next..分别代表前一个兄弟标签，后一个兄弟标签；p标签的前一个兄弟标签为空我们可以理解，因为p标签前面没有同级标签了，那为什么它的下一个兄弟标签也是空，据观察树结构数据我们发现p标签下一个兄弟标签明明是p啊，为什么是空呢？其实正是因为p标签下一个兄弟标签刚好是tag对象，所以这里为空了。这样理解你会悟到精髓：两个同级tag之间默认有一个空行占位，这个空行可以被其他字符占位，如逗号句号等字符，这些空行、逗号句号都会被当做兄弟标签。另外在树结构中，如果tag的上一行或下一行的内容刚好是纯string对象，那么这个string对象也会被当做是兄弟标签。

（7）.next_siblings .previous_siblings（所有兄弟标签）

for a in hello.p.next_siblings:    print a

输出：

Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
说明：可以看到，这个方法也是需要迭代输出的。他输出了后面的两个p标签及其内容。前一个兄弟标签同理

（8）.next_element .previous_element（前后标签）

print hello.p.previous_element

输出：

“空行”

print hello.p.next_element

输出：

The Dormouse's story

说明：前后标签是用来获取当前tag 的上一个/下一个tag的，不管前后tag与它是何关系，如父标签、兄弟标签，都会被获取到。这里p的上一个tag是空行，原理同前面的兄弟标签一样，只是这里还会跨父标签，即父子tag之间也会默认有空行占位。

（9）.next_elements .previous_elements（所有前后标签）

我们举一个例子：

for a in hello.p.previous_elements:     print repr(a)

输出：

u'\n'
<body>\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.\n...\n</body>
u'\n'
u"The Dormouse's story"
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<html><head><title>The Dormouse's story</title></head>\n<body>\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.\n...\n</body></html>
u'\n'

说明：这个输出很容易看明白，它就是输出了第一个p标签的前面所有标签及其内容而已，如果不太明白，请对照html树结构来看。.next_elements同理

文章到此篇幅已经不短了，这篇文章只介绍了最基本（很少用到）的BS对象的方法，在下一篇文章我会介绍它的常用方法，你一定会很需要，请期待...

资源群（IT各领域、非技术）645026970

阅读全文

1 0