Python爬虫学习纪要(十二):BeautifulSoup相关知识点4
来源:互联网 发布:国家大力发展人工智能 编辑:程序博客网 时间:2024/06/06 18:06
5)find_all()
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
>>> soup.find_all('title')
[<title>The Dormouse's story</title>]
>>> soup.find_all('p', 'title')
[<p class="title"><b>The Dormouse's story</b></p>]
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all(id='link2')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.find(string = re.compile('sisters'))
'Once upon a time there were three little sisters; and their names were\n'
5.1)The [name] argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all('title')
# [<title>The Dormouse's story</title>]
5.2)The [keyword] arguments
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:
soup.find_all(href=re.compile('elsie'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
This code finds all tags whose id attribute has a value, regardless of what the value is:'
>>> soup.find_all(id=True)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
You can filter multiple attributes at once by passing in more than one keyword argument:
>>> soup.find_all(href=re.compile('elsie'), id = 'link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
5.3)Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
>>> soup.find_all('a', class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:
>>> soup.find_all(class_=re.compile('itl'))
[<p class="title"><b>The Dormouse's story</b></p>]
5.4)The [string] argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:
>>> soup.find_all(string='Elsie')
['Elsie']
>>> soup.find_all(string=['Tillie', 'Elsie', 'Lacie'])
['Elsie', 'Lacie', 'Tillie']
>>> soup.find_all(string=re.compile('Dormouse'))
["The Dormouse's story", "The Dormouse's story"]
>>> def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)
>>> soup.find_all(string=is_the_only_string_within_a_tag)
["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']
>>> soup.find_all('a', string='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> soup.find_all('a', text='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
5.5)The [limit] argument
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all('a', limit=2)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
6)get_text()
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup, 'lxml')
>>> soup.get_text()
'\nI linked to example.com\n'
>>> soup.i.get_text()
'example.com'
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
>>> soup.find_all('title')
[<title>The Dormouse's story</title>]
>>> soup.find_all('p', 'title')
[<p class="title"><b>The Dormouse's story</b></p>]
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all(id='link2')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.find(string = re.compile('sisters'))
'Once upon a time there were three little sisters; and their names were\n'
5.1)The [name] argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all('title')
# [<title>The Dormouse's story</title>]
5.2)The [keyword] arguments
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:
soup.find_all(href=re.compile('elsie'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
This code finds all tags whose id attribute has a value, regardless of what the value is:'
>>> soup.find_all(id=True)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
You can filter multiple attributes at once by passing in more than one keyword argument:
>>> soup.find_all(href=re.compile('elsie'), id = 'link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
5.3)Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
>>> soup.find_all('a', class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:
>>> soup.find_all(class_=re.compile('itl'))
[<p class="title"><b>The Dormouse's story</b></p>]
5.4)The [string] argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:
>>> soup.find_all(string='Elsie')
['Elsie']
>>> soup.find_all(string=['Tillie', 'Elsie', 'Lacie'])
['Elsie', 'Lacie', 'Tillie']
>>> soup.find_all(string=re.compile('Dormouse'))
["The Dormouse's story", "The Dormouse's story"]
>>> def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)
>>> soup.find_all(string=is_the_only_string_within_a_tag)
["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']
>>> soup.find_all('a', string='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> soup.find_all('a', text='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
5.5)The [limit] argument
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all('a', limit=2)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
6)get_text()
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup, 'lxml')
>>> soup.get_text()
'\nI linked to example.com\n'
>>> soup.i.get_text()
'example.com'
阅读全文
0 0
- Python爬虫学习纪要(十二):BeautifulSoup相关知识点4
- Python爬虫学习纪要(一):BeautifulSoup相关知识点
- Python爬虫学习纪要(二):BeautifulSoup相关知识点2
- Python爬虫学习纪要(十一):BeautifulSoup相关知识点3
- python beautifulsoup 爬虫学习
- Python爬虫学习纪要(三):正则表达式
- Python爬虫学习纪要(四):正则表达式1
- Python爬虫学习纪要(五):正则表达式2
- Python爬虫学习纪要(九):Requests 库学习笔记4
- 数据库学习纪要(十二):SQL Sever介绍-4
- Python爬虫包 BeautifulSoup 学习(一) 简介与安装
- Python爬虫包 BeautifulSoup 学习(二) 异常处理
- Python爬虫包 BeautifulSoup 学习(三) 实例
- Python爬虫包 BeautifulSoup 学习(五) 实例
- Python爬虫包 BeautifulSoup 学习(六) 递归抓取
- Python爬虫包 BeautifulSoup 学习(七) children等应用
- Python爬虫包 BeautifulSoup 学习(八) parent等应用
- Python爬虫包 BeautifulSoup 学习(十一) CSS 选择器
- SVD 及其应用
- 线性代数 01.01 n阶行列式的定义
- Solr安装配置
- 设计模式--原型模式
- 《重构——改善既有代码的设计》【PDF】下载
- Python爬虫学习纪要(十二):BeautifulSoup相关知识点4
- 设计模式--原型模式
- Oracle 建立索引及SQL优化
- java异常
- 12月19日 数据结构 周二
- 关于段错误的知识总结
- java的equals字符串返回false
- 四分树
- Meizu Log.v Log.d 打印不出来