BeautifulSoup4的安装及使用

来源：互联网发布：软件毕业设计题目编辑：程序博客网时间：2024/04/30 00:43

一、BeautifulSoup4的安装

  方法一：cmd->easy_install BeautifulSoup
   方法二：从http://www.crummy.com/software/BeautifulSoup/bs4/download/
下载->cmd->进入下载的文件目录->pythonsetuyp.py install

二、 BeautifulSoup4的使用
1、导入
    from bs4 import BeautifulSoup
    注意：要是BeautifulSoup的版本为3.x，则导入方式为：from BeautifulSoup importBeautifulSoup
2、example
    html文件：
    html_doc = """

The Dormouse's story

Once upon a time there werethree little sisters; and their names wereElsie,Lacie andTillie; and they lived at the bottom of awell.

...

"""

代码：
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

   soup.X(X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

    #

    soup.p

    #

The Dormouse's story

   soup.a （注：仅仅返回第一个结果）

    # Elsie

   soup.find_all('a') （find_all 可以返回所有）

    # [Elsie,

    # Lacie,

    # Tillie]

   find还可以按属性查找
   soup.find(id="link3")
    # Tillie

   要取某个标签的某个属性，可用函数有 find_all,get
    for link insoup.find_all('a'):
     print(link.get('href'))
    #http://example.com/elsie
    #http://example.com/lacie
    #http://example.com/tillie

   要取html文件中的所有文本，可使用get_text()
   print(soup.get_text())
    # TheDormouse's story
    # TheDormouse's story
    # Once upona time there were three little sisters; and their names were
    #Elsie,
    # Lacieand
    #Tillie;
    # and theylived at the bottom of a well.
    # ...

   如果是打开html文件，语句可用：
    soup =BeautifulSoup(open("index.html"))
   BeautifulSoup中的Object
tag （对应html中的标签）
    tag.attrs(以字典形式返回tag的所有属性）
  可以直接对tag的属性进行增、删、改，跟操作字典一样

    tag['class']= 'verybold'

    tag['id'] =1

    tag

    #<blockquote class="verybold"id="1">Extremelybold</blockquote>

    deltag['class']

    deltag['id']

    tag

    #<blockquote>Extremelybold</blockquote>

   tag['class']

    # KeyError:'class'

   print(tag.get('class'))

    # None

    X.contents(X为标签，可返回标签的内容）

    eg.

    head_tag =soup.head

   head_tag

    #<head><title>TheDormouse'sstory</title></head>

   head_tag.contents

   [<title>The Dormouse'sstory</title>]

    title_tag =head_tag.contents[0]

   title_tag

    #<title>The Dormouse'sstory</title>

   title_tag.contents

    # [u'TheDormouse's story']

   解决解析网页出现乱码问题：
    importurllib2
    2    fromBeautifulSoup import BeautifulSoup
    3
    4    page =urllib2.urlopen('http://www.leeon.me');
    5    soup =BeautifulSoup(page,fromEncoding="gb18030")
    6
    7    printsoup.originalEncoding
    8    printsoup.prettify()

0 0