BeautifulSoup学习笔记6

来源:互联网 发布:navicat连接rds数据库 编辑:程序博客网 时间:2024/05/22 18:05

上几篇笔记记录了BeautifulSoup对文档的搜索功能。
find_all()方法,加上合适的过滤器,几乎能满足网页爬虫的需要。

这篇文章先把搜索文档最后的CSS选择器讲完,然后简单地介绍修改文档树和最后的结果输出。
下篇文章详细介绍bs4编码和输出格式。
bs4文档就看完了。

1 CSS selectors

Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself.

这里有一篇文章:CSS 选择器参考手册http://www.w3school.com.cn/cssref/css_selectors.asp
下面用到的语法都能在这篇文章中找到。

接着用爱丽丝文档作为例子:

>>> html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>""">>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc,"html.parser")

1.1 find tags

>>> soup.select('p')[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]>>> 

1.2 Find tags by CSS class:

>>> soup.select(".sister")[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.3 Find tags by ID:

>>> soup.select("#link2")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> soup.select("a#link2")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> 

1.4 Find tags by attribute value

>>> soup.select("a[href$='lacie']")  #选择href属性值以lacie结尾的a标签[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> soup.select("a[href^='http']")  #选择href属性值以http开头的a标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> soup.select("a[href*='example']")  #选择href属性值包含example字符串的a标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> 

1.5 Find tags beneath other tags

>>> soup.select('body a') #这里a不是body的直接子标签[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> 

1.6 Find tags directly beneath other tags:

>>> soup.select("p > a") #选择父元素为 <div> 元素的所有 <p> 元素。[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>>>>> soup.select("p > a:nth-of-type(2)")  #选择属于其父元素<p>的第二个<a>元素。[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>>>>> soup.select("p > #link1") # 选择 id="link1" 的所有元素。[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]>>> 

1.7 Find the siblings of tags:

>>> soup.select("#link1 + .sister")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]>>> >>> soup.select("#link1 ~ .sister")[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]>>> >>> 

2 unwrap(),extract()

BeautifulSoup还有修改文档树的功能,可以修改标签名,修改标签属性,修改string内容。

修改文档树中有一个方法unwrap(),可以移除某个标签,有一个extract()方法,也可以移除某个标签:

>>> from bs4 import BeautifulSoup>>> html_doc = """<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;""">>> soup = BeautifulSoup(html_doc,"html.parser")>>> soup.a.unwrap()<a class="sister" href="http://example.com/elsie" id="link1"></a>>>> soupElsie,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;>>> soup.a.extract()<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>>>> soupElsie, and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;>>> 

3 Output

3.1 prettify()

prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行。
BeautifulSoup 对象和它的tag节点都可以调用 prettify() 方法
加上print()方法,好看多了

>>> soup.prettify()'<html>\n <head>\n  <title>\n   The Dormouse\'s story\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The Dormouse\'s story\n   </b>\n  </p>\n  <p class="story">\n   Once upon a time there were three little sisters; and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">\n    Elsie\n   </a>\n   ,\n   <a class="sister" href="http://example.com/lacie" id="link2">\n    Lacie\n   </a>\n   and\n   <a class="sister" href="http://example.com/tillie" id="link3">\n    Tillie\n   </a>\n   ;\nand they lived at the bottom of a well.\n  </p>\n  <p class="story">\n   ...\n  </p>\n </body>\n</html>'>>>>>> print(soup.prettify())<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    Elsie   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ;and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>>>> 

3.2 Non-pretty printing

The str() function returns a string encoded in UTF-8.
You can also call encode() to get a bytestring, and decode() to get Unicode.

>>> soup.a<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>>>> str(soup.a)'<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'>>> 

下一篇笔记会详细介绍编码和输出格式,然后bs4文档就看完了

3.3 get_text()

>>> print(soup.get_text())The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....>>> print(soup.get_text('***'))  #使用***作为分隔符输出纯文本内容***The Dormouse's story*********The Dormouse's story******Once upon a time there were three little sisters; and their names were***Elsie***,***Lacie*** and***Tillie***;and they lived at the bottom of a well.******...***>>> print(soup.get_text('***',strip=True))  #去除文本内容的前后空白The Dormouse's story***The Dormouse's story***Once upon a time there were three little sisters; and their names were***Elsie***,***Lacie***and***Tillie***;and they lived at the bottom of a well.***...>>> 
原创粉丝点击