Python Beautiful Soup库详解

来源：互联网发布：mysql bin.000004 编辑：程序博客网时间：2024/05/29 09:14

BeautifulSoup对应一个HTML/XML文档的全部内容

from bs4 import BeautifulSoupsoup=BeautifulSoup("<p>asd</p>","html.parser")print(soup.prettify())

输出：

asd

Beautiful Soup库解析器

Beautiful Soup类的基本元素

实例：

所用链接：https://python123.io/ws/demo.html

内容：

<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class=“course”>Python is a wonderful general‐purpose programming language.You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic Python</a> and<a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2" id="link2">Advanced Python</a>.</p></body></html>

基本元素练习：

from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")#输出第一个标签为title的内容print(soup.title)#输出第一个标签为a的内容print(soup.a)#print(soup.prettify())#html页面规范格式显示#标签a的名字print(soup.a.name)#包含标签a的上一层标签print(soup.a.parent.name)#标签的属性，以键值对形式print(soup.a.attrs)#标签内的内容print(soup.a.string)

输出：

<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
a
p
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
Basic Python

HTML的基本格式：

标签树的下行遍历：

.contents子节点的列表，将<tag>所有儿子节点存入列表
.children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.head)print(soup.head.contents)#body的子节点信息print(soup.body.contents)#5个子节点print(len(soup.body.contents))

输出：

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', The demo python introduces several python courses., '\n', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., '\n']
5

遍历儿子节点：

for child in soup.body.children:    print(child)

The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.

遍历子孙节点：

for child in soup.body.descendants:    print(child)

The demo python introduces several python courses.
The demo python introduces several python courses.
The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python

标签树的上行遍历：

.parent节点的父亲标签
.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点

print(soup.title.parent)
<head><title>This is a python demo page</title></head>

for parent in soup.a.parents:    print(parent.name)

p
body
html
[document]

标签的平行遍历：平行遍历只发生在同一个父节点之下的子节点之间

标签对与标签对之间的文字也会算作节点

.next_sibling 返回按照HTML文本顺序的下一个平行节点标签.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签.next_siblings 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签.previous_siblings 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签


内容查找函数find_all
<>.find_all(name, attrs, recursive, string, **kwargs)返回一个列表类型，存储查找的结果 
①name : 对标签名称的检索字符串
from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.find_all('a'))
输出：[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

print(soup.find_all(['a','b']))

输出：[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
输出所有标签：

for tag in soup.find_all(True):    print(tag.name)
html
head
title
body
p
b
p
a
a

输出所有名字中带有a的标签：

#使用正则表达式re库for tag in soup.find_all(re.compile('a')):    print(tag.name)
head
a
a

attrs: 对标签属性值的检索字符串，可标注属性检索 
print(soup.find_all("p","course"))print(soup.find_all(id="link1"))print(soup.find_all(id="link"))print(soup.find_all(id=re.compile("link")))

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
[]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive: 是否对子孙全部检索，默认True

print(soup.find_all('a'))print(soup.find_all('a',recursive=False))

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
[]

string: <>…</>中字符串区域的检索字符串

print(soup.find_all(string="Basic Python"))print(soup.find_all(string=re.compile("python")))

['Basic Python']
['This is a python demo page', 'The demo python introduces several python courses.']

注：<tag>(..)等价于 <tag>.find_all(..) soup(..)等价于 soup.find_all(..)

扩展方法：
<>.find() 搜索且只返回一个结果，同.find_all()参数
<>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数
<>.find_parent() 在先辈节点中返回一个结果，同.find()参数
<>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数
<>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

阅读全文

0 0