Python解析XML简单介绍

来源：互联网发布：矢量控制知乎编辑：程序博客网时间：2024/05/17 08:58

1. 自己保存为free.xml

[html] view plaincopy
<?xml version='1.0' encoding='utf-8'?>  
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>  
  <title>dive into mark</title>  
  <subtitle>currently between addictions</subtitle>  
  <id>tag:diveintomark.org,2001-07-29:/</id>  
  <updated>2009-03-27T21:56:07Z</updated>  
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>  
  <entry>  
    <author>  
      <name>Mark</name>  
      <uri>http://diveintomark.org/</uri>  
    </author>  
    <title>Dive into history, 2009 edition</title>  
    <link rel='alternate' type='text/html'  
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>  
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>  
    <updated>2009-03-27T21:56:07Z</updated>  
    <published>2009-03-27T17:20:42Z</published>  
    <category scheme='http://diveintomark.org' term='diveintopython'/>  
    <category scheme='http://diveintomark.org' term='docbook'/>  
    <category scheme='http://diveintomark.org' term='html'/>  
    <summary type='html'>Putting an entire chapter on one page sounds  
      bloated, but consider this &mdash; my longest chapter so far  
      would be 75 printed pages, and it loads in under 5 seconds&hellip;  
      On dialup.</summary>  
  </entry>  
  <entry>  
    <author>  
      <name>Mark</name>  
      <uri>http://diveintomark.org/</uri>  
    </author>  
    <title>Accessibility is a harsh mistress</title>  
    <link rel='alternate' type='text/html'  
      href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>  
    <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>  
    <updated>2009-03-22T01:05:37Z</updated>  
    <published>2009-03-21T20:09:28Z</published>  
    <category scheme='http://diveintomark.org' term='accessibility'/>  
    <summary type='html'>The accessibility orthodoxy does not permit people to  
      question the value of features that are rarely useful and rarely used.</summary>  
  </entry>  
  <entry>  
    <author>  
      <name>Mark</name>  
    </author>  
    <title>A gentle introduction to video encoding, part 1: container formats</title>  
    <link rel='alternate' type='text/html'  
      href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>  
    <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>  
    <updated>2009-01-11T19:39:22Z</updated>  
    <published>2008-12-18T15:54:22Z</published>  
    <category scheme='http://diveintomark.org' term='asf'/>  
    <category scheme='http://diveintomark.org' term='avi'/>  
    <category scheme='http://diveintomark.org' term='encoding'/>  
    <category scheme='http://diveintomark.org' term='flv'/>  
    <category scheme='http://diveintomark.org' term='GIVE'/>  
    <category scheme='http://diveintomark.org' term='mp4'/>  
    <category scheme='http://diveintomark.org' term='ogg'/>  
    <category scheme='http://diveintomark.org' term='video'/>  
    <summary type='html'>These notes will eventually become part of a  
      tech talk on video encoding.</summary>  
  </entry>  
</feed>  

-------------------------------------------------------------------------------------------------------------------------------------------

2. Python 解析XML

Python可以使用几种不同的方式解析xml文档。它包含了dom和sax解析器，这里用的是ElementTree库, Python自带的一个标准库。

2.1 调用解析XML

[python] view plaincopy
>>> import xml.etree.ElementTree as etree      # ElementTree属于Python标准库的一部分，它的位置为xml.etree.ElementTree  
  
>>> tree = etree.parse('examples/feed.xml')    # Linux下可以这么写  
>>> tree = etree.parse('C:\\feed.xml')  <span style="white-space:pre"> </span># Windows下可以这么写  
    # 这里就是解析xml文件, parse的参数可以使文件名, 也可以使流对象  
      
>>> root = tree.getroot()                   <span style="white-space:pre"> </span># 获取根元素  
>>> root                                 <span style="white-space:pre">    </span># 显示如下  
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>  
#  其中 http://www.w3.org/2005/Atom 是名字空间, feed是标签名, 所以根元素被表示为{http://www.w3.org/2005/Atom}feed  
#  ElementTree使用{namespace}localname来表达xml元素  

2.2 枚举一个元素的子元素

在ElementTree API中，元素的行为就像列表一样。列表中的项即该元素的子元素。

[python] view plaincopy
>>> root                            # 显示如下, 这里是显示对象  
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>  
  
>>> root.tag                           # 显示元素的tag, 注意区别上面的  
'{http://www.w3.org/2005/Atom}feed'  
  
>>> len(root)                          # 元素的子元素的个数(元素的行为就像列表一样)  
8  
  
>>> root[4]                          # 是不是像列表, 也可以用索引来操作  
<Element {http://www.w3.org/2005/Atom}link at e181b0>  
  
>>> for child in root:                     # 循环打印元素的子元素  
...   print(child)   
...   
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>  
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>  
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>  
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>  
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>  
<Element {http://www.w3.org/2005/Atom}entry at e2b720>  
<Element {http://www.w3.org/2005/Atom}entry at e2b510>  
<Element {http://www.w3.org/2005/Atom}entry at e2b750>  
# 从输出可以看到，根元素总共有8个子元素：所有feed级的元数据（title，subtitle，id，updated, link和3个entry）  

元素就是列表.

xml的结构就是树结构, 通过上面的代码, 枚举整个xml的元素的方法已经出来了.

2.3 获取元素的属性

xml不只是元素的集合；每一个元素还有其属性集。一旦获取了某个元素的引用，我们可以像操作Python的字典一样轻松获取到其属性。

[python] view plaincopy
>>> root.attrib                      <span style="white-space:pre">    </span># XML文件上的内容: <feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>  
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}  
# 根元素中显示属性带上名字空间. 比较下面的就知道了  
  
>>> root[4]                          <span style="white-space:pre">    </span># 子元素root[4], link元素  
<Element {http://www.w3.org/2005/Atom}link at e181b0>  
>>> root[4].attrib                   <span style="white-space:pre">    </span># 可以看到, 子元素root[4]有3个属性, 注意属性显示的格式  
{'href': 'http://diveintomark.org/',  
 'type': 'text/html',  
 'rel': 'alternate'}  
>>> root[3]                          <span style="white-space:pre">    </span># 子元素root[3], updated元素       
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>  
>>> root[3].attrib             <span style="white-space:pre">      </span># root[3]元素, 没有属性  
{}  

属性attrib对象是一个字典对象。

2.4.查找XML文档中的结点(任意元素)

2.4.1 元素的findall方法

[python] view plaincopy
>>> import xml.etree.ElementTree as etree  
>>> tree = etree.parse('examples/feed.xml')  
>>> root = tree.getroot()  
>>> root.findall('{http://www.w3.org/2005/Atom}entry')    # 通过findall方法查找匹配特定格式的子元素, 注意参数的格式  
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,  
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,  
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]  
# 找到了root元素的3个entry子元素  
  
>>> root.tag  
'{http://www.w3.org/2005/Atom}feed'  
>>> root.findall('{http://www.w3.org/2005/Atom}feed')     # root元素中并没有feed子元素.  
[]  
>>> root.findall('{http://www.w3.org/2005/Atom}author')   # root元素中并没有author子元素  
[]  

可以理解findall是某元素找子元素. 看下面代码,

[python] view plaincopy
>>> tree.findall('{http://www.w3.org/2005/Atom}entry')    # 注意这里, 对象tree（调用etree.parse()的返回值）  
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,  
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,  
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]  
# 可以理解tree的findall其实就是tree.getroot().findall('{http://www.w3.org/2005/Atom}author')  
#  root元素的确有3个entry 子元素  
  
>>> tree.findall('{http://www.w3.org/2005/Atom}author')   #  root元素并没有author'子元素  
[]  

2.4.2 见好就收的find方法(元素的find方法)

find()方法用来返回第一个匹配到的元素。

[python] view plaincopy
>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry') <span style="white-space:pre">    </span># 返回entry元素列表(因为有3个entry子元素)  
>>> len(entries)  
3  
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title') # 查找entries[0]的title子元素  
>>> title_element.text                         # 注意 text, <title>与</title>之间的文本内容  
'Dive into history, 2009 edition'  
  
>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo')  # entries[0]并没有foo子元素  
>>> foo_element                            <span style="white-space:pre">  </span># foo_element返回值为None  
>>> type(foo_element)                      <span style="white-space:pre">  </span># foo_element现在没有类型  
<class 'NoneType'>  

从上面代码可以看到, element.find('...')返回的是false的话, 代表element没有子元素; element.find('...')返回的是None的话代表没有找到匹配的子元素, 这是两回事.

在布尔上下文中，如果ElementTree元素对象不包含子元素，其值则会被认为是False（即如果len(element)等于0）。这就意味着if element.find('...')并非在测试是否find()方法找到了匹配项；这条语句是在测试匹配到的元素是否包含子元素！想要测试find()方法是否返回了一个元素，则需使用if element.find('...') is not None。(不是很明白!!!)

2.4.3 直接查找某元素(不通过嵌套查找)

[python] view plaincopy
>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')  <span style="white-space:pre">    </span># 注意参数的格式开头多了两个斜干  
>>> all_links                      <span style="white-space:pre">      </span># 这两条斜线告诉findall()方法“不要只在直接子  
    <span style="white-space:pre">                              </span># 元素中查找；查找的范围可以是任意嵌套层次”。  
[<Element {http://www.w3.org/2005/Atom}link at e181b0>,  
 <Element {http://www.w3.org/2005/Atom}link at e2b570>,  
 <Element {http://www.w3.org/2005/Atom}link at e2b480>,  
 <Element {http://www.w3.org/2005/Atom}link at e2b5a0>]  
>>> all_links[0].attrib                                                
{'href': 'http://diveintomark.org/',  
 'type': 'text/html',  
 'rel': 'alternate'}  
>>> all_links[1].attrib                                                
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',  
 'type': 'text/html',  
 'rel': 'alternate'}  
>>> all_links[2].attrib  
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',  
 'type': 'text/html',  
 'rel': 'alternate'}  
>>> all_links[3].attrib  
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',  
 'type': 'text/html',  
 'rel': 'alternate'}  

疑问: 从上往下定位简单, 但是怎么从下往上定位, 例如我要找元素的父元素?

3. 小结

总的来说，ElementTree的findall()方法是其一个非常强大的特性，但是它的查询语言却让人有些出乎意料。官方描述它为“有限的XPath支持。”XPath是一种用于查询xml文档的W3C标准。对于基础地查询来说，ElementTree与XPath语法上足够相似，但是如果已经会XPath的话，它们之间的差异可能会使你感到不快。现在，我们来看一看另外一个第三方xml库，它扩展了ElementTree的api以提供对XPath的全面支持。

转自: http://woodpecker.org.cn/diveintopython3/xml.html