BeautifulSoup中find(),find_all(),select()函数

来源：互联网发布：逃生剧情解析知乎编辑：程序博客网时间：2024/05/16 11:00

find()函数：输出第一个可匹配对象，即find_all()[0].
find_all()函数：（以下来自官方文档）
＊
findAll(name=None, attrs={}, recursive=True, text=None, limit=None,
**kwargs) 返回一个列表。这些参数会反复的在这个文档中出现。其中最重要的是name参数和keywords参数(译注：就是**kwargs参数)。
参数name 匹配tags的名字，获得相应的结果集。有几种方法去匹配name，最简单用法是仅仅给定一个tag name值。
1.下面的代码寻找文档中所有b标签： soup.findAll(‘b’)
2.你可以传一个正则表达式，下面的代码寻找所有以b开头的标签:

 import re tagsStartingWithB = soup.findAll(re.compile('^b'))

输出：[tag.name for tag in tagsStartingWithB]
3.你可以传一个list或dictionary，下面两个调用是查找所有的title和p标签，他们获得结果一样，但是后一种方法更快一些:

soup.findAll(['title', 'p'])

输出：

[<title>Page title</title>, <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll({'title' : True, 'p' : True})

输出：

 [<title>Page title</title>,  <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

4.你可以传一个True值，这样可以匹配每个tag的name：也就是匹配每个tag。
allTags = soup.findAll(True)
输出：

[tag.name for tag in allTags][u'html', u'head', u'title', u'body',u'p', u'b', u'p', u'b']

这看起来不是很有用，但是当你限定属性(attribute)值时候，使用True就很有用了。
5.你可以传callable对象，就是一个使用Tag对象作为它唯一的参数，并返回布尔值的对象。
findAll使用的每个作为参数的Tag对象都会传递给这个callable对象，并且如果调用返回True，则这个tag便是匹配的。
6.下面是查找两个并仅有两个属性的标签(tags)：

 soup.findAll(lambda tag: len(tag.attrs) == 2)

输出：

 [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

7.下面是寻找单个字符为标签名并且没有属性的标签：

 soup.findAll(lambda tag: len(tag.name) == 1and not tag.attrs)

输出：

 [<b>one</b>, <b>two</b>]

8.keyword参数用于筛选tag的属性。下面这个例子是查找拥有属性align且值为center的所有标签：

soup.findAll(align="center")

输出：

[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]

如同name参数，你也可以使用不同的keyword参数对象，从而更加灵活的指定属性值的匹配条件（但不能使用class等python保留字）。
9.你可以向上面那样传递一个字符串，来匹配属性的值。你也可以传递一个正则表达式，一个列表(list)，一个哈希表(hash)，特殊值True或None，或者一个可调用的以属性值为参数的对象(注意：这个值可能为None)。一些例子：

soup.findAll(id=re.compile("para$"))

输出：

[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(align=["center", "blah"])

输出：

[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(align=lambda(value): value and len(value) < 5)

输出：

[<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

10.特殊值True和None更让人感兴趣。 True匹配给定属性为任意值的标签，None匹配那些给定的属性值为空的标签。一些例子如下：

soup.findAll(align=True)

输出：

[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

[tag.name for tag in soup.findAll(align=None)]
[u’html’, u’head’, u’title’, u’body’, u’b’, u’b’]
如果你需要在标签的属性上添加更加复杂或相互关联的(interlocking)匹配值，如同上面一样，以callable对象的传递参数来处理Tag对象。在这里你也许注意到一个问题。如果你有一个文档，它有一个标签定义了一个name属性,会怎么样？你不能使用name为keyword参数，因为Beautiful Soup已经定义了一个name参数使用。你也不能用一个Python的保留字例如for作为关键字参数。 BeautifulSoup提供了一个特殊的参数attrs，你可以使用它来应付这些情况。 attrs是一个字典，用起来就和keyword参数一样：

soup.findAll(id=re.compile("para$"))

输出：

[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(attrs={'id' : re.compile("para$")})

输出：

 [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

你可以使用attrs去匹配那些名字为Python保留字的属性，例如class, for, 以及import；或者那些不是keyword参数但是名字为Beautiful Soup搜索方法使用的参数名的属性，例如name, recursive, limit, text, 以及attrs本身。

from BeautifulSoup import BeautifulStoneSoupxml = '<person name="Bob"><parent rel="mother" name="Alice">' xmlSoup = BeautifulStoneSoup(xml) xmlSoup.findAll(name="Alice")

输出：
[]
xmlSoup.findAll(attrs={“name” : “Alice”})
输出：

 [parent rel="mother" name="Alice"></parent>]

使用CSS类查找
对于CSS类attrs参数更加方便。例如class不仅是一个CSS属性，也是Python的保留字。你可以使用soup.find(“tagName”, { “class” : “cssClass” })搜索CSS class，但是由于有很多这样的操作，你也可以只传递一个字符串给attrs。这个字符串默认处理为CSS的class的参数值。

from BeautifulSoup import BeautifulSoupsoup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""")soup.find("b", { "class" : "lime" })

输出：

<b class="lime">Lime</b>

soup.find("b", "hickory")

输出：

<b class="hickory">Hickory</b>

*
select()函数可以取得含有特定CSS属性的元素
例如：

import requestsfrom bs4 import BeautifulSouphtml_sample = '\<html>\<head>\    <meta charset="UTF-8">\    <title></title>\</head>\<body>\<h1 id=''title''>This is a test!</h1>\<a href=''#'' class=''link''>This is link1!</a>\<a href=''#'' class=''link''>This is link2!</a>\<a href=''#'' class=''link''>This is link3!</a>\<a href=''#'' class=''link''>This is link4!</a>\hello world!<br>hello python!\</body>\</html>'soup = BeautifulSoup(html_sample, 'html.parser')

1.使用select()函数找出所有id为title的元素（id前面需加＃）

alink = soup.select('#title')print(alink)

输出结果为：

[<h1 id="title">This is a test!</h1>]

2.使用select()函数找出所有class为link的元素（class前面需加 .）

alink = soup.select('.link')print(alink)

输出结果为：

[<a class="link" href="#">This is link1!</a>, <a class="link" href="#">This is link2!</a>, <a class="link" href="#">This is link3!</a>, <a class="link" href="#">This is link4!</a>]

阅读全文

0 0