Python Beautiful Soup库详解

来源:互联网 发布:mysql bin.000004 编辑:程序博客网 时间:2024/05/29 09:14


BeautifulSoup对应一个HTML/XML文档的全部内容 


from bs4 import BeautifulSoupsoup=BeautifulSoup("<p>asd</p>","html.parser")print(soup.prettify())

输出:

<p>
 asd
</p>


Beautiful Soup库解析器




Beautiful Soup类的基本元素



实例:

所用链接:https://python123.io/ws/demo.html

内容:

<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class=“course”>Python is a wonderful general‐purpose programming language.You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic Python</a> and<a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2" id="link2">Advanced Python</a>.</p></body></html> 



基本元素练习

from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")#输出第一个标签为title的内容print(soup.title)#输出第一个标签为a的内容print(soup.a)#print(soup.prettify())#html页面规范格式显示#标签a的名字print(soup.a.name)#包含标签a的上一层标签print(soup.a.parent.name)#标签的属性,以键值对形式print(soup.a.attrs)#标签内的内容print(soup.a.string)

输出:

<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
a
p
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
Basic Python


HTML的基本格式:



标签树的下行遍历:

.contents子节点的列表,将<tag>所有儿子节点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历

from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.head)print(soup.head.contents)#body的子节点信息print(soup.body.contents)#5个子节点print(len(soup.body.contents))

输出:

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
5

遍历儿子节点:

for child in soup.body.children:    print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

遍历子孙节点:

for child in soup.body.descendants:    print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:


<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python

标签树的上行遍历:

.parent节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点

print(soup.title.parent)
<head><title>This is a python demo page</title></head>
for parent in soup.a.parents:    print(parent.name)

p
body
html
[document]
标签的平行遍历:平行遍历只发生在同一个父节点之下的子节点之间
标签对与标签对之间的文字也会算作节点
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 
内容查找函数find_all
<>.find_all(name, attrs, recursive, string, **kwargs)返回一个列表类型,存储查找的结果 
name : 对标签名称的检索字符串
from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.find_all('a'))
输出:[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

print(soup.find_all(['a','b']))

输出:[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
输出所有标签:
for tag in soup.find_all(True):    print(tag.name)
html
head
title
body
p
b
p
a
a
输出所有名字中带有a的标签:
#使用正则表达式re库for tag in soup.find_all(re.compile('a')):    print(tag.name)
head
a
a

attrs: 对标签属性值的检索字符串,可标注属性检索
print(soup.find_all("p","course"))print(soup.find_all(id="link1"))print(soup.find_all(id="link"))print(soup.find_all(id=re.compile("link")))

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
[]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]


recursive: 是否对子孙全部检索,默认True

print(soup.find_all('a'))print(soup.find_all('a',recursive=False))

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
[]


string: <>…</>中字符串区域的检索字符串

print(soup.find_all(string="Basic Python"))print(soup.find_all(string=re.compile("python")))
['Basic Python']
['This is a python demo page', 'The demo python introduces several python courses.']


注:<tag>(..)等价于 <tag>.find_all(..)      soup(..)等价于 soup.find_all(..)

扩展方法:
<>.find() 搜索且只返回一个结果,同.find_all()参数
<>.find_parents() 在先辈节点中搜索,返回列表类型,同.find_all()参数
<>.find_parent() 在先辈节点中返回一个结果,同.find()参数
<>.find_next_siblings() 在后续平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_next_sibling() 在后续平行节点中返回一个结果,同.find()参数
<>.find_previous_siblings() 在前序平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_previous_sibling() 在前序平行节点中返回一个结果,同.find()参数

原创粉丝点击