python-Beautiful Soup解析数据

来源：互联网发布：h3c snmp 网管软件编辑：程序博客网时间：2024/06/06 00:04

安装Beautiful Soup

下面说一下在Windows下面如何安装Beautiful Soup:
1.到http://www.crummy.com/software/BeautifulSoup/网站上上下载，最新版本是4.1.3。
2.下载完成之后需要解压缩，假设放到D:/python下。
3.运行cmd，切换到D:/python/beautifulsoup4-4.1.3/目录下（根据自己解压缩后的目录和下载的版本号修改），
cd /d D:/python/beautifulsoup4-4.1.3
4.运行命令：
setup.py build
setup.py install
5.在IDE下from bs4 import BeautifulSoup，没有报错说明安装成功。

安装Beautiful Soup使用

#!/usr/bin/python#coding:utf-8from bs4 import BeautifulSoupimport urllibimport urllib2import rehtml = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """soup = BeautifulSoup(html)print "-------soup格式化打印------"print soup.prettify()print "它查找的是在所有内容中的第一个符合要求的标签"print soup.titleprint soup.headprint soup.aprint soup.pprint "-----对于标签，它有两个重要的属性，是 name 和 attrs----"print soup.nameprint soup.head.nameprint soup.title.nameprint soup.p.attrsprint soup.p.stringprint "--通过标签名查找--"print soup.select('title')print "--通过类名查找--"print soup.select('.sister')print "--通过 id 名查找--"print soup.select('#link1')print "-- 组合查找  查找 p 标签中，id 等于 link1的内容--"print soup.select('p #link1')print "--直接子标签查找--"print soup.select("head > title")print "--属性查找--"print soup.select('a[class="sister"]')print soup.select('a[href="http://example.com/elsie"]')print soup.select('p a[href="http://example.com/elsie"]')

输出如下：

E:\python\python_jdk\python.exe E:/python/py_pro/safly/Python_Demo.py-------soup格式化打印------<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title" name="dromouse">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    <!-- Elsie -->   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ; and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>它查找的是在所有内容中的第一个符合要求的标签<title>The Dormouse's story</title><head><title>The Dormouse's story</title></head><a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a><p class="title" name="dromouse"><b>The Dormouse's story</b></p>-----对于标签，它有两个重要的属性，是 name 和 attrs----[document]headtitle{u'class': [u'title'], u'name': u'dromouse'}The Dormouse's story--通过标签名查找--[<title>The Dormouse's story</title>]--通过类名查找--[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]--通过 id 名查找--[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]-- 组合查找  查找 p 标签中，id 等于 link1的内容--[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]--直接子标签查找--[<title>The Dormouse's story</title>]--属性查找--[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>][<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>][<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]Process finished with exit code 0

阅读全文

0 0