python爬虫提取数据之Beautifulsoup4简单使用

来源：互联网发布：js上传图片到服务器编辑：程序博客网时间：2024/06/17 20:52

实现原理

遍历字符串,将文档树变成对象树,对象--属性--对象  结构

基本思路

创建对象转化
利用对象获取想要的数据
对象的属性
- 获取相应的标签,内容
- 优点,简单
- 缺点无法根据属性进行查找

对象的方法

find_all(),select()两个方法功能相近,相比来说find_all功能更强大,因为text=的存在,且不能使用正则表达式查找
find_all()
- 可以通过标签,属性,正则表达式查找,还能以字符串查找,组合查找
select()
- 可以通过标签,类名,id,属性,组合查找
- 利用css选择器的部分知识

基本使用

 from bs4 import BeautifulSoup html=''    #html字符串 #创建  Beautiful Soup 对象 soup = BeautifulSoup(html,“lxml”)#'lxml'为解析方式,不写会有警告 #打开本地  HTML ⽂件的⽅式来创建对象 #soup = BeautifulSoup(open('index.html')) print soup#直接输出对象,结果原样打印字符串 #格式化输出soup 对象的内容

四大对象种类

    Beautiful Soup将复杂HTML文档转换成复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:    一: Tag        Tag 通俗点讲就是 HTML 中的一个个标签,即标签对象        例如:            soup.p            soup.a            print type(soup.p)# <class 'bs4.element.Tag'>            注意它查找的是在所有内容中的第一个符合要求的标签            Tag两个重要的属性            name                soup不是Tag对象,但也有name属性soup 对象本身比较特殊，它的                                      name 即为  [document]                soup.p.name为p            attrs                soup.p.attrs,获取p标签的属性,以字典的形式                soup.p['color'],获取属性                soup.p['class'] = "newClass"                del soup.p['class']      二: NavigableString          获取标签中的内容          print soup.p.string #The Dormouse's story          print type(soup.p.string)              <class 'bs4.element.NavigableString'>      三: BeautifulSoup          BeautifulSoup 对象表示的是整个文档。与 Tag 对象类似，是特殊的           Tag，我们可以分别获取它的类型，name，以及属性          print type(soup.name)# <type 'unicode'>          print soup.name# [document]          print soup.attrs # ⽂档本身的属性为空  {}      四: Comment          这个对象是元素内容的对象          Comment 对象特殊类型的 NavigableString 对象，其输出的内容不包括              注释符号          print type(soup.a.string)# <class 'bs4.element.Comment'>

遍历文档树

对象属性:    contents           print soup.head.contents#head标签的子标签(子),以list返回    chldren           for i in soup.head.chldren:#可迭代对象               print(i)    descendants 属性,所有子孙节点:            print soup.head.descendants             soup.string搜索文档树find_all 返回值为对象的list    find_all(name, attrs, recursive, text,**kwargs)        name参数            可以查找所有名字为 name 的tag            参数类型                A.传字符串                          soup.find_all('b'),以list返回                B.传正则表达式                          soup.find_all(re.compile("^b"))                C.传列表                          soup.find_all(["a", "b"])         Keyword参数                      soup.find_all(id='link2')         text 参数             通过 text 参数搜文档中的字符串内容，与 name 参数的可选值同             样, text 参数接受 字符串 , 正则表达式 , 列表             soup.find_all(text="Elsie")             soup.find_all(text=["Tillie", "Elsie", "Lacie"])             soup.find_all(text=re.compile("Dormouse"))CSS选择器select()方法    soup.select() ，返回类型是 对象的list    （1）通过标签名查找         print soup.select('title')    （2）通过类名查找        print soup.select('.sister')    （3）通过 id 名查找         print soup.select('#link1')    （4）组合查找         组合查找即多条件查找 与css选择器写法类似，标签名与类名、id名进行的组         合原理是相同的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开         print soup.select('p #link1')          直接⼦标签查找，则使⽤  > 分隔         print soup.select("head > title")    （5）属性查找    查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同    一节点，所以中间不能加空格，否则会无法匹配到         print soup.select('a[class="sister"]')    同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格        print soup.select('p a[href="http://example.com/elsie"]')获取内容    get_text()方法来获取它的内容,获取所有的,包括子标签    soup = BeautifulSoup(html, 'lxml')    print type(soup.select('title'))    print soup.select('title')[0].get_text()    for title in soup.select('title'):    print title.get_text(0)

阅读全文

0 0