BeautifulSoup学习笔记6
来源:互联网 发布:navicat连接rds数据库 编辑:程序博客网 时间:2024/05/22 18:05
上几篇笔记记录了BeautifulSoup对文档的搜索功能。
find_all()方法,加上合适的过滤器,几乎能满足网页爬虫的需要。
这篇文章先把搜索文档最后的CSS选择器讲完,然后简单地介绍修改文档树和最后的结果输出。
下篇文章详细介绍bs4编码和输出格式。
bs4文档就看完了。
1 CSS selectors
Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself.
这里有一篇文章:CSS 选择器参考手册http://www.w3school.com.cn/cssref/css_selectors.asp
下面用到的语法都能在这篇文章中找到。
接着用爱丽丝文档作为例子:
>>> html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>""">>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc,"html.parser")
1.1 find tags
>>> soup.select('p')[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]>>>
1.2 Find tags by CSS class:
>>> soup.select(".sister")[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1.3 Find tags by ID:
>>> soup.select("#link2")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> soup.select("a#link2")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>
1.4 Find tags by attribute value
>>> soup.select("a[href$='lacie']") #选择href属性值以lacie结尾的a标签[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> soup.select("a[href^='http']") #选择href属性值以http开头的a标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.select("a[href*='example']") #选择href属性值包含example字符串的a标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>>
1.5 Find tags beneath other tags
>>> soup.select('body a') #这里a不是body的直接子标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>>
1.6 Find tags directly beneath other tags:
>>> soup.select("p > a") #选择父元素为 <div> 元素的所有 <p> 元素。[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>>>>> soup.select("p > a:nth-of-type(2)") #选择属于其父元素<p>的第二个<a>元素。[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>>>> soup.select("p > #link1") # 选择 id="link1" 的所有元素。[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>>
1.7 Find the siblings of tags:
>>> soup.select("#link1 + .sister")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> >>> soup.select("#link1 ~ .sister")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> >>>
2 unwrap(),extract()
BeautifulSoup还有修改文档树的功能,可以修改标签名,修改标签属性,修改string内容。
修改文档树中有一个方法unwrap(),可以移除某个标签,有一个extract()方法,也可以移除某个标签:
>>> from bs4 import BeautifulSoup>>> html_doc = """<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;""">>> soup = BeautifulSoup(html_doc,"html.parser")>>> soup.a.unwrap()<a class="sister" href="http://example.com/elsie" id="link1"></a>>>> soupElsie,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;>>> soup.a.extract()<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>>>> soupElsie, and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;>>>
3 Output
3.1 prettify()
prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行。
BeautifulSoup 对象和它的tag节点都可以调用 prettify() 方法
加上print()方法,好看多了
>>> soup.prettify()'<html>\n <head>\n <title>\n The Dormouse\'s story\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The Dormouse\'s story\n </b>\n </p>\n <p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n Elsie\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n </p>\n <p class="story">\n ...\n </p>\n </body>\n</html>'>>>>>> print(soup.prettify())<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ;and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body></html>>>>
3.2 Non-pretty printing
The str() function returns a string encoded in UTF-8.
You can also call encode() to get a bytestring, and decode() to get Unicode.
>>> soup.a<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>>>> str(soup.a)'<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'>>>
下一篇笔记会详细介绍编码和输出格式,然后bs4文档就看完了
3.3 get_text()
>>> print(soup.get_text())The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....>>> print(soup.get_text('***')) #使用***作为分隔符输出纯文本内容***The Dormouse's story*********The Dormouse's story******Once upon a time there were three little sisters; and their names were***Elsie***,***Lacie*** and***Tillie***;and they lived at the bottom of a well.******...***>>> print(soup.get_text('***',strip=True)) #去除文本内容的前后空白The Dormouse's story***The Dormouse's story***Once upon a time there were three little sisters; and their names were***Elsie***,***Lacie***and***Tillie***;and they lived at the bottom of a well.***...>>>
- BeautifulSoup学习笔记6
- BeautifulSoup学习笔记
- BeautifulSoup学习笔记
- BeautifulSoup学习笔记
- BeautifulSoup学习笔记
- BeautifulSoup学习笔记1
- BeautifulSoup学习笔记2
- BeautifulSoup学习笔记3
- BeautifulSoup学习笔记4
- BeautifulSoup学习笔记5
- BeautifulSoup学习笔记7
- BeautifulSoup 库学习笔记
- BeautifulSoup的使用学习笔记
- Python学习笔记:BeautifulSoup模块
- BeautifulSoup官网学习笔记
- Python学习笔记--BeautifulSoup、urllib、threading模块
- python学习笔记 BeautifulSoup趴数据
- 模拟登录以及BeautifulSoup学习笔记
- 减少OpenCV读取高分辨率图像的时间
- 高效跟踪文献,不可不知的10种方法
- Java流程结构
- 特征缩放
- HDU1087
- BeautifulSoup学习笔记6
- 【GDOI2018模拟7.12】C
- mysql基本sql语句总结(二)
- [USACO07OPEN]吃饭Dining
- java中的字符,字符串,数字之间的转换
- kd树建立的python实现
- 对象基础
- linux服务器开发二(系统编程)--进程相关
- 简单的RelativeLayout 相对布局 及其常用属性