Python爬虫学习纪要(十一):BeautifulSoup相关知识点3

来源:互联网 发布:淘宝网童装店哪家好 编辑:程序博客网 时间:2024/06/04 19:19
一、family tree
1).parent
In the example “three sisters” document, the <head> tag is the parent of the <title> tag:
>>> title_tag = soup.title
>>> title_tag
<title>The Dormouse's story</title>
>>> title_tag.parent
<head><title>The Dormouse's story</title></head>

The title string itself has a parent: the <title> tag that contains it:
>>> title_tag.string.parent
<title>The Dormouse's story</title>

The parent of a top-level tag like <html> is the BeautifulSoup object itself:
>>> html_tag = soup.html
>>> type(html_tag.parent)
<class 'bs4.BeautifulSoup'>

And the .parent of a BeautifulSoup object is defined as None:
>>> print(soup.parent)
None

2).parents
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:
>>> link = soup.a
>>> link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)

========================================
p
body
html
[document]

二、Going sideways
Consider a simple document like this:
>>> sibling_soup = BeautifulSoup('<a><b>text1</b><c>text2</c></b></a>', 'lxml')
>>> print(sibling_soup.prettify())
<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>

The <b> tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

三、Searching the tree
find() and find_all(). 

1) A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the <b> tags in the document:
>>> soup.find_all('b')
[<b>The Dormouse's story</b>]

2) A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method. This code finds all the tags whose names start with the letter “b”; in this case, the <body> tag and the <b> tag:
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
print(tag.name)

===============================================
body
b

3) A list
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a> tags and all the <b> tags:
>>> soup.find_all(['a', 'b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4) A function
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise.

Here’s a function that returns True if a tag defines the “class” attribute but doesn’t define the “id” attribute:
>>> def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')

Pass this function into find_all() and you’ll pick up all the <p> tags:

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

If you pass in a function to filter on a specific attribute like 【href】, the argument passed into the function will be the attribute value, not the whole tag. Here’s a function that finds all a tags whose href attribute does not match a regular expression:
>>> def not_la(href):
return href and not re.compile('lacie').search(href)
>>> soup.find_all(href= not_la)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The function can be as complicated as you need it to be. Here’s a function that returns True if a tag is surrounded by string objects:
>>> from bs4 import NavigableString
>>> def surrounded_by_string(tag):
return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString))

>>> for tag in soup.find_all(surrounded_by_string):
print(tag.name)

body
p
a
a
a
p

阅读全文
0 0
原创粉丝点击