Python爬虫学习纪要(十一):BeautifulSoup相关知识点3
来源:互联网 发布:淘宝网童装店哪家好 编辑:程序博客网 时间:2024/06/04 19:19
1).parent
In the example “three sisters” document, the <head> tag is the parent of the <title> tag:
>>> title_tag = soup.title
>>> title_tag
<title>The Dormouse's story</title>
>>> title_tag.parent
<head><title>The Dormouse's story</title></head>
The title string itself has a parent: the <title> tag that contains it:
>>> title_tag.string.parent
<title>The Dormouse's story</title>
The parent of a top-level tag like <html> is the BeautifulSoup object itself:
>>> html_tag = soup.html
>>> type(html_tag.parent)
<class 'bs4.BeautifulSoup'>
And the .parent of a BeautifulSoup object is defined as None:
>>> print(soup.parent)
None
2).parents
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:
>>> link = soup.a
>>> link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
========================================
p
body
html
[document]
二、Going sideways
Consider a simple document like this:
>>> sibling_soup = BeautifulSoup('<a><b>text1</b><c>text2</c></b></a>', 'lxml')
>>> print(sibling_soup.prettify())
<html>
<body>
<a>
<b>
text1
</b>
<c>
text2
</c>
</a>
</body>
</html>
The <b> tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.
三、Searching the tree
find() and find_all().
1) A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the <b> tags in the document:
>>> soup.find_all('b')
[<b>The Dormouse's story</b>]
2) A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method. This code finds all the tags whose names start with the letter “b”; in this case, the <body> tag and the <b> tag:
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
print(tag.name)
===============================================
body
b
3) A list
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a> tags and all the <b> tags:
>>> soup.find_all(['a', 'b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4) A function
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise.
Here’s a function that returns True if a tag defines the “class” attribute but doesn’t define the “id” attribute:
>>> def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
Pass this function into find_all() and you’ll pick up all the <p> tags:
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]
If you pass in a function to filter on a specific attribute like 【href】, the argument passed into the function will be the attribute value, not the whole tag. Here’s a function that finds all a tags whose href attribute does not match a regular expression:
>>> def not_la(href):
return href and not re.compile('lacie').search(href)
>>> soup.find_all(href= not_la)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
The function can be as complicated as you need it to be. Here’s a function that returns True if a tag is surrounded by string objects:
>>> from bs4 import NavigableString
>>> def surrounded_by_string(tag):
return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString))
>>> for tag in soup.find_all(surrounded_by_string):
print(tag.name)
body
p
a
a
a
p
- Python爬虫学习纪要(十一):BeautifulSoup相关知识点3
- Python爬虫学习纪要(一):BeautifulSoup相关知识点
- Python爬虫学习纪要(二):BeautifulSoup相关知识点2
- Python爬虫学习纪要(十二):BeautifulSoup相关知识点4
- Python爬虫包 BeautifulSoup 学习(十一) CSS 选择器
- python beautifulsoup 爬虫学习
- Python爬虫学习纪要(三):正则表达式
- Python爬虫学习纪要(四):正则表达式1
- Python爬虫学习纪要(五):正则表达式2
- Python爬虫学习纪要(八):Requests 库学习笔记3
- 数据库学习纪要(十一):SQL Sever介绍-3
- Python爬虫包 BeautifulSoup 学习(一) 简介与安装
- Python爬虫包 BeautifulSoup 学习(二) 异常处理
- Python爬虫包 BeautifulSoup 学习(三) 实例
- Python爬虫包 BeautifulSoup 学习(五) 实例
- Python爬虫包 BeautifulSoup 学习(六) 递归抓取
- Python爬虫包 BeautifulSoup 学习(七) children等应用
- Python爬虫包 BeautifulSoup 学习(八) parent等应用
- Gensim官方教程翻译(一)——语料库与向量空间(Corpora and Vector Spaces)
- Gym 101617J dp
- RocketMQ(7)——通信协议
- TensorFlow技术解析与实战 8 第一个tensorflow程序
- MediaPlayer 在华为手机上的内存泄漏问题
- Python爬虫学习纪要(十一):BeautifulSoup相关知识点3
- idea
- Hive2:Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
- Django框架学习笔记(4.简单的总结)
- Python 开发者的 6 个必备库
- Spring学习,bean作用域
- RocketMQ(8)——消息高可靠
- Preprocessing data-sklearn数据预处理
- Gensim官方教程翻译(二)——主题与转换(Topics and Transformations)