Python Beautiful Soup库详解
来源:互联网 发布:mysql bin.000004 编辑:程序博客网 时间:2024/05/29 09:14
BeautifulSoup对应一个HTML/XML文档的全部内容
from bs4 import BeautifulSoupsoup=BeautifulSoup("<p>asd</p>","html.parser")print(soup.prettify())
输出:
<p>
asd
</p>
Beautiful Soup库解析器
Beautiful Soup类的基本元素
实例:
所用链接:https://python123.io/ws/demo.html
内容:
<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class=“course”>Python is a wonderful general‐purpose programming language.You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic Python</a> and<a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2" id="link2">Advanced Python</a>.</p></body></html>
基本元素练习:
from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")#输出第一个标签为title的内容print(soup.title)#输出第一个标签为a的内容print(soup.a)#print(soup.prettify())#html页面规范格式显示#标签a的名字print(soup.a.name)#包含标签a的上一层标签print(soup.a.parent.name)#标签的属性,以键值对形式print(soup.a.attrs)#标签内的内容print(soup.a.string)
输出:
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
a
p
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
Basic Python
HTML的基本格式:
标签树的下行遍历:
.contents子节点的列表,将<tag>所有儿子节点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.head)print(soup.head.contents)#body的子节点信息print(soup.body.contents)#5个子节点print(len(soup.body.contents))
输出:
<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
5
遍历儿子节点:
for child in soup.body.children: print(child)<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
遍历子孙节点:
for child in soup.body.descendants: print(child)<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
标签树的上行遍历:
.parent节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点
print(soup.title.parent)<head><title>This is a python demo page</title></head>
for parent in soup.a.parents: print(parent.name)
p
body
html
[document]
标签的平行遍历:平行遍历只发生在同一个父节点之下的子节点之间
标签对与标签对之间的文字也会算作节点
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
内容查找函数find_all<>.find_all(name, attrs, recursive, string, **kwargs)返回一个列表类型,存储查找的结果输出所有标签:①name : 对标签名称的检索字符串from bs4 import BeautifulSoupimport requestsr=requests.get("https://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"html.parser")print(soup.find_all('a'))输出:[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]print(soup.find_all(['a','b']))
输出:[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
for tag in soup.find_all(True): print(tag.name)html
head
title
body
p
b
p
a
a
输出所有名字中带有a的标签:
#使用正则表达式re库for tag in soup.find_all(re.compile('a')): print(tag.name)head
a
a
attrs: 对标签属性值的检索字符串,可标注属性检索print(soup.find_all("p","course"))print(soup.find_all(id="link1"))print(soup.find_all(id="link"))print(soup.find_all(id=re.compile("link")))
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
[]
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
recursive: 是否对子孙全部检索,默认True
print(soup.find_all('a'))print(soup.find_all('a',recursive=False))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
[]
string: <>…</>中字符串区域的检索字符串
print(soup.find_all(string="Basic Python"))print(soup.find_all(string=re.compile("python")))['Basic Python']
['This is a python demo page', 'The demo python introduces several python courses.']
注:<tag>(..)等价于 <tag>.find_all(..) soup(..)等价于 soup.find_all(..)
扩展方法:
<>.find() 搜索且只返回一个结果,同.find_all()参数
<>.find_parents() 在先辈节点中搜索,返回列表类型,同.find_all()参数
<>.find_parent() 在先辈节点中返回一个结果,同.find()参数
<>.find_next_siblings() 在后续平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_next_sibling() 在后续平行节点中返回一个结果,同.find()参数
<>.find_previous_siblings() 在前序平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_previous_sibling() 在前序平行节点中返回一个结果,同.find()参数
- Python Beautiful Soup库详解
- python中Beautiful Soup库使用详解
- Python中Beautiful Soup库详细教程
- python beautiful soup库的用法
- 【Python】【爬虫】关于Beautiful Soup库
- python Beautiful Soup文档
- Python Beautiful Soup简介
- Python Beautiful Soup Example
- [Python]安装Beautiful Soup
- python 安装 Beautiful Soup
- Python模块Beautiful Soup
- 爬虫---Beautiful Soup库
- Beautiful Soup 库入门
- Beautiful Soup库入门
- Beautiful Soup库入门
- python Beautiful Soup分析网页
- Python爬虫利器Beautiful Soup
- python-Beautiful Soup解析数据
- 数据结构:线性表之静态链表
- SVN客户端的下载安装
- Latex 参考文献
- Ubuntu17.04 CUDA8.0 Cudnn v7 tensorflow1.3-GPU安装
- 折半查找
- Python Beautiful Soup库详解
- 225. Implement Stack using Queues
- JavaScript高程学习笔记之函数表达式(7)
- 【BZOJ4003】【JLOI2015】城池攻占(左偏树)
- 文章标题
- 干货!我开发 Android 是如何界面设计的
- 莫比乌斯入门:bzoj 1101 Zap(Mobius)
- 文章标题
- 优先队列(priority_queue)