Python爬虫学习纪要（十二）：BeautifulSoup相关知识点4

来源：互联网发布：国家大力发展人工智能编辑：程序博客网时间：2024/06/06 18:06

5）find_all()
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
>>> soup.find_all('title')
[<title>The Dormouse's story</title>]
>>> soup.find_all('p', 'title')
[The Dormouse's story]
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all(id='link2')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.find(string = re.compile('sisters'))
'Once upon a time there were three little sisters; and their names were\n'

5.1)The [name] argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

soup.find_all('title')
# [<title>The Dormouse's story</title>]

5.2)The [keyword] arguments
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:

soup.find_all(href=re.compile('elsie'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.

This code finds all tags whose id attribute has a value, regardless of what the value is:'
>>> soup.find_all(id=True)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

You can filter multiple attributes at once by passing in more than one keyword argument:
>>> soup.find_all(href=re.compile('elsie'), id = 'link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

5.3)Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
>>> soup.find_all('a', class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:
>>> soup.find_all(class_=re.compile('itl'))
[The Dormouse's story]

5.4)The [string] argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:
>>> soup.find_all(string='Elsie')
['Elsie']
>>> soup.find_all(string=['Tillie', 'Elsie', 'Lacie'])
['Elsie', 'Lacie', 'Tillie']
>>> soup.find_all(string=re.compile('Dormouse'))
["The Dormouse's story", "The Dormouse's story"]

>>> def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)

>>> soup.find_all(string=is_the_only_string_within_a_tag)
["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']

>>> soup.find_all('a', string='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> soup.find_all('a', text='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

5.5)The [limit] argument
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all('a', limit=2)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

6)get_text()
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
>>> markup = '<a href="http://example.com/">\nI linked to example.com\n</a>'
>>> soup = BeautifulSoup(markup, 'lxml')
>>> soup.get_text()
'\nI linked to example.com\n'
>>> soup.i.get_text()
'example.com'

阅读全文

0 0