BeautifulSoup学习笔记3

来源:互联网 发布:手机淘宝信誉度怎么看 编辑:程序博客网 时间:2024/06/08 02:01

前一篇笔记记录了find_all()方法和过滤器的类型。
这一篇笔记会详细记录find_all()方法,以及find()方法。

find_all()方法需要记录的笔记太多了。这篇笔记只保留了find_all()方法。
其他的方法过滤器种类,参数类型与find_all()方法类似,下篇笔记会一起整理

还是用这个爱丽丝文档的例子:

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

1 find_all()

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

**kwargs是可变的keyword arguments列表

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

1.1 name

name参数可以查找所有名字为name的标签。注意,是查找标签。
name参数可以是任一类型的过滤器(字符串,正则表达式,列表,True,方法),上一篇笔记介绍过滤器,用到的都是find_all()方法(链接地址:http://blog.csdn.net/sinat_36651044/article/details/74936398)。

>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc,"html.parser")>>> soup.find_all('title')[<title>The Dormouse's story</title>]>>> >>> soup.find_all(['title','b'])[<title>The Dormouse's story</title>, <b>The Dormouse's story</b>]>>> 

1.2 keyword

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes.
可以理解成根据标签的属性来搜索标签。

>>> tag1 = soup.a  #这里根据标签名字只搜索到一个标签tag1,tag1有很多属性>>>> tag1.attrs{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

参数可以是字符串,正则表达式,方法或 True这些过滤器。

>>> soup.find_all(id="link2")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>>>> soup.find_all(id=True)[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>>>>> import re>>> soup.find_all(href=re.compile("lacie"))[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>>>> soup.find_all(href=re.compile("lacie"),id=True)[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> 

通过attrs参数定义的一个字典参数可以用来来搜索包含特殊属性的标签。
这里class是Python关键词,使用class作为参数会导致语法错误,使用attrs参数定义的一个字典参数可以完成搜索。与下一小节CSS搜索有点儿重复。。

>>> soup.find_all(attrs={"id":"link2"})[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>>>> soup.fine_all(class='sister')SyntaxError: invalid syntax>>> soup.find_all(attrs={"class":"sister"})[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> 

1.3 Searching by CSS class

上小节讲到class作为Python关键词,使用class作为参数会导致语法错误。
BeautifulSoup4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag。
class_ 参数同样接受不同类型的过滤器 :字符串,正则表达式,方法或True :

>>> soup.find_all(class_='sister')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.find_all(class_=re.compile('tle'))[<p class="title"><b>The Dormouse's story</b></p>]>>>>>> soup.find_all(class_=True)[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]>>> >>> def has_six_characters(css_class):    return css_class is not None and len(css_class) == 6>>> soup.find_all(class_ = has_six_characters)[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

标签的class可以是多值属性。

>>> CSS_SOUP = BeautifulSoup('<p class="body strikeout">YUAN</p>',"html.parser")>>>>>> CSS_SOUP.find_all("p",class_="body")  # 搜索类名包含body的p标签[<p class="body strikeout">YUAN</p>]>>>>>> CSS_SOUP.find_all("p",class_="strikeout")[<p class="body strikeout">YUAN</p>]>>>>>> CSS_SOUP.find_all("p",class_="strikeout body") #完全匹配类名,顺序不符,搜索不到结果[]>>>>>> CSS_SOUP.find_all("p",class_="body strikeout")[<p class="body strikeout">YUAN</p>]>>> 

1.4 recursive

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False :

>>> soup.prettify<bound method Tag.prettify of <html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>>>>>>>> soup.find_all("title")[<title>The Dormouse's story</title>]>>>>>> soup.find_all("title",recursive=False)[]>>> 

1.5 The string argument

find_all( name , attrs , recursive , text , **kwargs )

With string you can search for strings instead of tags.(注意,是搜索字符串)
As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True.

>>> soup.find_all(text="Elsie")['Elsie']>>> soup.find_all(string="Elsie")['Elsie']>>> >>> soup.find_all(text=re.compile("Dormouse"))["The Dormouse's story", "The Dormouse's story"]>>>>>>  #如果要搜索标签,在find_all()方法中加上标签的name参数>>> soup.find_all('b',text=re.compile("Dormouse"))[<b>The Dormouse's story</b>]>>> >>> soup.find_all("a",text="Elsie")[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>> 

1.6 The limit argument

find_all()方法加上limit参数可以现在返回结果的数量。
limit=1,find_all()方法和find()方法等价,不同的是,find_all()方法返回一个列表,find()方法直接返回结果。

>>> soup.find_all('a')[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.find_all('a',limit=1)[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>> soup.find('a')<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>>>> 
原创粉丝点击