Beautiful Soup写爬虫

来源：互联网发布：阿里云服务器安全设置编辑：程序博客网时间：2024/06/03 16:30

1.概念：Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.。

2.安装：三种方式

easy_install beautifulsoup4

pip install beautifulsoup4

3)直接下载安装包，下载后解压，运行

sudo python setup.py install

安装lxm解析器l：

easy_install lxml

pip install lxml

另一个可供选择的解析器：html5lib

easy_install html5lib

pip install html5lib

3. 安装之后就可以开始编写爬虫了。

中文官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

1）先引入bs4

from bs4 import BeautifulSoup
2)传入一段字符串或文件句柄
soup = BeautifulSoup(open("index.html"))

文档先被转换成Unicode，并且HTML的实例都被转换成Unicode编码。
Beautiful Soup选择最合适的解析器来解析这段文档。

4. 对象的种类：Beautiful Soup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象。
所有对象可以归纳为4种：Tag、NavigableString、BeautifulSoup、Comment
1）Tag: 与XML或HTML原声文档中的tag相同
tag = soup.p
tag属性的获取方法：tag.xxx, tag['xxx']
eg：tag.name; tag['class']
2) NavigableString：可遍历的字符串：可以通过unicode（）将NavigableString转换成Unicode字符串：
unicode_string = unicode(tag.string)

3）BeautifulSoup对象表示一个文档的全部内容，大部分时候，可以作为tag对象：
BeautifulSoup对象没有真正的HTML或XML的tag，但是可以用.name查看属性
print soup.name
输出：[document]
4）comment对象：特殊的NavigableString对象：
使用特殊格式输出：
print(soup.b.prettify())
5.一个简单的爬虫：# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen("http://blog.csdn.net/u014032819"),'html.parser')print soup.titleprint type(soup.p)print soup.name
注：这里爬取网页而非本地文件或声明变量，需要引入urllib2包。

6.遍历文档树：
1）tag名：
如上一段代码用到的：soup.head; soup.title; soup.body.用来获取这个名字的第一个tag
获取全部这个名儿的tag：sopu.find_all('a')
tag的.content属性和.children属性：
.content：将tag的子节点一列表方式输出，只包括直接子节点：
在上个例子的基础上继续改：
# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen("http://blog.csdn.net/u014032819"),'html.parser')print soup.titlehead_tag = soup.headprint head_tag.contentsmeta_tag = head_tag.contents[1]print meta_tagprint meta_tag.contents

获得head下的内容，获取head第二个子节点的内容，打出第二个自己点的子节点。因为是meta标签，所以是空值。
注：NavigableString字符串没有.content属性。所以打第一个节点时会报错。
.descendants可以对所有tag的子孙节点可以进行递归循环：
# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen("http://blog.csdn.net/u014032819"),'html.parser')head_tag = soup.headfor child in head_tag.descendants:   print(child)print len(list(soup.children))print len(list(soup.descendants))

运行可见，6个子节点，912个后代节点
.string:如果tag只有一个NavigableString类型子节点，那么这个tag可以使用.string得到子节点。
# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen("http://blog.csdn.net/u014032819"),'html.parser')print soup.titlehead_tag = soup.headprint head_tag.contentsmeta_tag = head_tag.contents[2]print meta_tagprint meta_tag.string
.strings和stripped_strings：tag中包含多个字符串。
使用strings会有空行，stripped_strings可以去掉。
for string in soup.strings:    print(repr(string))

for string in soup.stripped_strings:    print(repr(string))

2）父节点：.parent:父节点,  .parents所有父辈
3）兄弟节点：.next_sibling, .previous_sibling
4）回退和前进：当前的解析对象：next_elements, .previous_elements

7.　搜索文档树（过滤器）
1）find()与find_all()
find()返回一个， find_all()返回所有
# -*- coding: utf-8 -*-import urllib2from bs4 import BeautifulSoupsoup = BeautifulSoup(urllib2.urlopen("http://blog.csdn.net/u014032819"),'html.parser')print soup.find('title')print soup.find_all('title')

2）find_all(name, attrs, recursive, string, 8*kwargs):
a）传入name，查找所有b标签
soup.find_all('b')
b）传入正则表达式作为参数，bs可以根据表达式匹配内容。下面例子是找到左右以b开头的标签。import refor tag in soup.find_all(re.compile("^b")):    print(tag.name)
c）传入列表，查找列表中所有项的标签。soup.find_all("p", "title")
d）传入True，返回所有tag，但不返回字符串节点。
定义函数，传入函数，得到符合的标签。
e）传入keyword参数：
soup.find_all(id='link2')
f）按css搜索：通过class类名查找
soup.find_all("a", class_="sister")
css选择器：
soup.select("p nth-of-type(3)")
g）通过string搜索文档中的字符串内容。soup.find_all(string="Apple")
h）limit参数：限制返回结果的数量。soup.find_all("a", limit=2)
i）recursive参数：只找直接子节点，无视其他。
soup.html.find_all("title", recursive=False)

阅读全文

0 0