BeautifulSoup库的用法详解

来源：互联网发布：seo外链发到什么平台编辑：程序博客网时间：2024/06/05 19:50

BeautifulSoup库是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
BeautifulSoup库的安装，可参见博客：http://blog.csdn.net/qq_29186489/article/details/78581249
常用的解析库如下：
这里写图片描述
基本使用如下所示：

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.prettify())print(soup.title.string)

标签选择器
选择元素

from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.title)print(type(soup.title))print(soup.head)print(soup.p)

获取名称

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.title.name)

获取属性

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.a.attrs['href'])print(soup.a['href'])

获取内容

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.a.string)

嵌套选择

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.head.title.string)

子节点和子孙节点

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")#以列表的形式返回子节点print(soup.p.contents)#以迭代器的形式返回子节点print(soup.p.children)for i,child in enumerate(soup.p.children):    print(i,child)#以迭代器的形式返回子孙节点print(soup.p.descendants)for i,child in enumerate(soup.p.descendants):    print(i,child)

父节点和祖先节点

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")#获取a标签的父节点print(soup.a.parent)#获取a标签的父节点和祖先节点print(list(enumerate(soup.a.parents)))

获取兄弟节点

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.p.previous_siblings)print(soup.p.next_siblings)

标准选择器
根据name进行查找

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.find_all("p"))for item in soup.find_all("p"):    print(item.find_all("a"))

根据属性进行查找

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")print(soup.find_all(attrs={"id":"link2"}))print(soup.find_all(attrs={"class":"title"}))print(soup.find_all(id="link2"))print(soup.find_all(class_="title"))

如果使用find方法，返回单个元素
find_parents()返回所有祖先节点
find_parent()返回直接父节点
find_next_siblings()返回后面所有兄弟节点
find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings()返回前面所有的兄弟节点
find_previous_sibling()返回前面第一个的兄弟节点
find_all_next()返回节点后所有符合条件的节点
find_next()返回节点后第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点
find_previous()返回第一个符合条件的节点
CSS选择器
通过select()直接传入CSS选择器即可完成选择

#_*_coding: utf-8_*_from bs4 import BeautifulSoupimport requestsr=requests.get("http://python123.io/ws/demo.html")soup=BeautifulSoup(r.text,"lxml")#CSS选择器print(soup.select(".title b"))print(soup.select("#link2"))#获取属性值print(soup.select("#link2")[0]["class"])print(soup.select("#link2")[0].attrs["class"])#获取标签里面的文本内容print(soup.select("#link2")[0].get_text())

总结：
1：推荐使用LXML解析库，必要时使用html.parse
2：标签悬着器筛选功能弱，但是速度快
3：建议使用find(),find_all()查询匹配单个结果或者多个结果
4：如果对CSS熟悉的话，建议使用select()
5：熟悉常用的获取属性和文本值的方法

阅读全文

2 0