“Beginning Python”（八）“XML”

来源：互联网发布：海岛奇兵升级数据最新编辑：程序博客网时间：2024/06/05 22:32

本文主要学习《Beginning Python》中的第三个实践项目“Project 3: XML for All Occasions”。这个项目主要向读者展示：python编程中XML的应用。就像这个项目的标题所示，XML的功能很强大，可以应用于各种场合，而这个项目主要通过一个XML文件来生成一整个网站，包括：网页和网页目录（web pages and directories）。

学习这个项目，我们需要把握如下几点：

1）什么是XML？

2）XML有什么用？（有什么优点？）

3）python编程中如何使用XML？

下面我将沿着这三个问题，并结合这个项目来学习python - xml。

一、XML

如果你熟悉HTML，那么你对XML就不会太茫然，它们都是标记语言（Markup Language）家族中的一员。HTML全称是“Hyper Text Markup Language”（超文本标记语言），是一种公共的标记语言，遵循统一的标准格式，这个格式由W3C（World Wide Web Consortium）执行并维护。想要简单了解HTML的格式，可以访问W3C的网站：

http://www.w3.org/MarkUp/Guide/

相对HTML而言，XML可以称为自定义的标记语言，它没有统一的标准，提供了最大的灵活性。同样地，通过W3C也可以了解XML：http://www.w3school.com.cn/x.asp

有一种说法是：HTML被设计用来显示数据，XML被设计用来传输和存储数据。

HTML遵循统一的标准，因此，不同的操作系统，不同的设备，只要安装了浏览器，都可以查看HTML文件，并且在查看的时候仅呈现数据而隐藏标记。比如，我可以在Windows PC上访问www.w3.org，也可以在iphone和ipad上访问它，甚至我也可以在Kindle上访问它。的确，HTML用来显示数据很方便。

XML文件事实上是一个结构化的数据文件，它不仅能存储数据，而且能够很好的保存数据之间的结构关系。本文将要学习的这个项目就通过一个单个的XML文件保存了一整个网站的数据，并很好的保存了网页目录和网页之间的包含关系。事实上，我们常用的一些软件都包含了XML。比如：QT Designer生成的.ui文件就是一个XML文件，Visual Studio的.sln文件、.vcxproj文件等都是XML文件。

那么，什么是XML文件呢？

简单来讲，带自定义标签的数据文件就是XML文件，如下：

<note><to>George</to><from>John</from><heading>Reminder</heading><body>Don't forget the meeting!</body></note>

下一个问题：XML文件存储数据有什么优点？

回到Project 3，它用一个单一的XML文件保存了整个网站的数据，并可以通过这个XML文件生成整个网站。这么说了，XML文件有如下几个优点：

1）它的信息密度大；

2）它的信息丰富，不仅存储了数据，还存储了数据间的结构。

3）维护方便。

如果我们需要修改网站，进行修改这一个XML文件，然后再重新生成整个网站即可。

4）方便数据分析。

我们可以编写不同的解析程序，来提取XML文件中不同的信息部分。例如：

下面是这个项目的XML文件：

<website>  <page name="index" title="Home Page">    <h1>Welcome to My Home Page</h1>    <p>Hi, there. My name is Mr. Gumby, and this is my home page. Here    are some of my interests:</p>    <ul>      <li><a href="interests/shouting.html">Shouting</a></li>      <li><a href="interests/sleeping.html">Sleeping</a></li>      <li><a href="interests/eating.html">Eating</a></li>    </ul>  </page>  <directory name="interests">    <page name="shouting" title="Shouting">      <h1>Mr. Gumby's Shouting Page</h1>      <p>...</p>    </page>    <page name="sleeping" title="Sleeping">      <h1>Mr. Gumby's Sleeping Page</h1>      <p>...</p>    </page>    <page name="eating" title="Eating">      <h1>Mr. Gumby's Eating Page</h1>      <p>...</p>    </page>  </directory></website>

如果我们仅仅只是想知道它包含了哪些“Tag”，可以编写一个简单的程序提取如下：

同样的，我们也可以编程提取这个XML文件中包含了哪些“web page”：

......headlines = []parse('website.xml', HeadlineHandler(headlines))print('The following <h1> elements were found:')for h in headlines:    print(h)

输出：

二、SAX

下一个问题：python编程如何操作xml文件？

一般来说，有两种方式可以操作XML文件：DOM和SAX。

DOM模式简单来说：将整个XML文件读入，并根据TAG之间的关系，建立一个“树形”的数据结构，然后遍历这颗“树”，访问和操作目标数据。很明显，这种方式占用内存比较大。

SAX模式全称为“Simple API for XML”，它是“流操作”，不会一次读入整个XML文件。就像它的名字所示，SAX是一种简单快捷的XML操作方式。关于SAX可以访问它的官网：http://www.saxproject.org/

《Beginning Python》中简单地结束了Python中SAX的原理：它仿照了“事件触发，消息响应”的编程模式。SAX以“流”的方式读入（或输出）XML文件，当遇到一个Tag，就会触发一个“Tag Event”，然后调用对应的“Handler()”处理该事件。简单来说，它包含了一套“Tag -- Event -- Handler”机制。

需要注意的是，在“ContentHandler”类中（接口类），已经为一些基本的事件指定了对应的Handler，我们只需要重载对应的Handler()函数即可。其中，最基本的有三类：

“the beginning of an element (the occurrence of an opening tag),

the end of an element (the occurrence of a closing tag),

and plain text (characters)”

三、项目实现

我先贴出源代码，再分析其中几个关键点：

#website.pyfrom xml.sax.handler import ContentHandlerfrom xml.sax import parseimport osclass Dispatcher:    def dispatch(self, prefix, name, attrs=None):        mname = prefix + name.capitalize()        dname = 'default' + prefix.capitalize()        method = getattr(self, mname, None)        if callable(method): args = ()        else:            method = getattr(self, dname, None)            args = name,        if prefix == 'start': args += attrs,        if callable(method): method(*args)    def startElement(self, name, attrs):        self.dispatch('start', name, attrs)    def endElement(self, name):        self.dispatch('end', name)class WebsiteConstructor(Dispatcher, ContentHandler):    passthrough = False    def __init__(self, directory):        self.directory = [directory]        self.ensureDirectory()    def ensureDirectory(self):        path = os.path.join(*self.directory)        os.makedirs(path, exist_ok=True)    def characters(self, chars):        if self.passthrough: self.out.write(chars)    def defaultStart(self, name, attrs):        if self.passthrough:            self.out.write('<' + name)            for key, val in attrs.items():                self.out.write(' {}="{}"'.format(key, val))            self.out.write('>')    def defaultEnd(self, name):        if self.passthrough:            self.out.write('</{}>'.format(name))    def startDirectory(self, attrs):        self.directory.append(attrs['name'])        self.ensureDirectory()    def endDirectory(self):        self.directory.pop()    def startPage(self, attrs):        filename = os.path.join(*self.directory + [attrs['name'] + '.html'])        self.out = open(filename, 'w')        self.writeHeader(attrs['title'])        self.passthrough = True    def endPage(self):        self.passthrough = False        self.writeFooter()        self.out.close()    def writeHeader(self, title):        self.out.write('<html>\n  <head>\n    <title>')        self.out.write(title)        self.out.write('</title>\n  </head>\n  <body>\n')    def writeFooter(self):        self.out.write('\n  </body>\n</html>\n')parse('website.xml', WebsiteConstructor('public_html'))

1，Mix-in模式（mix-in superclasses）

很明显，“WebsiteConstructor”类同时继承自“Dispatcher”和“ContentHandler”。这种多重继承的方式，在python程序设计中很常见，参考《Beginning Python》的Chapter 7：

“Inheritance: One class may be the subclass of one or more other classes. The subclass then inherits all the methods of the superclasses. You can use more than one superclass, and this feature can be used to compose orthogonal (independent and unrelated) pieces of functionality. A common way of implementing this is using a core superclass along with one or more mix-in superclasses.”

这段话的意思是：多重继承可以让派生类组合使用不同基类“正交”的函数。更普遍的做法是，将一个基类作为核心，其他基类作为补充，即Mix-in模式。

在这个例子中，“ContentHandler”是接口类，它被当作核心基类，而“Dispatcher”则被当作“mix-in superclass”，它的作用主要是“分派事件处理函数”。

从上图可以看出：“ContentHandler”中有和“Dispatcher”相同的函数：startElement()和endElement()。那么，怎么保证实际调用的函数是属于哪一个基类？事实上，python是通过MRO去搜索，它是按照多继承的顺序去搜索的。参考博文《小心掉进python多继承的坑》：http://www.jianshu.com/p/71c14e73c9d9

2，Introspection（内省，元编程）

在类“Dispatcher”的“dispatch()”函数中，我们又看到了Python类的“Introspection”。关于“Introspection”，在我之前的博文中已介绍过了：http://blog.csdn.net/sagittarius_warrior/article/details/74518283

3，dispatcher

dispatcher的作用是“将事件分派给自定义的事件处理函数”，即“事件分发器”。

对于SAX的parse()函数来说，它只知道那些基本事件，如：

“the beginning of an element (the occurrence of an opening tag),

the end of an element (the occurrence of a closing tag),

and plain text (characters)”

而这个项目，碰到“opening tag”时，是需要区分“page”与“directory”的。一般的做法是：在“startElement()”函数中写if语句进行判断。如果需要区分的种类很多，那就要写很多if语句。

更好的办法就是：针对每一种“opening tag”实现一个特定的handler()，然后再用“dispatcher”根据“tag name”去指派给对应的handler。

4，状态跟踪

如果把XML文件的数据看作是一个“树型”数据结构，它实际上分为不同的嵌套层次。但是，SAX进行的是“流操作”，它是不会记录当前的嵌套层次的（DOM会记录嵌套层次）。因此，我们自己需要定义一些状态量，来记录需要关系的当前所处的嵌套层次。

这个项目实际上，我们只需要区分“page内”还是“page外”即可，故只设置了一个变量passthough。

5，路径list

在处理directories时，程序将目录路径当作list来处理，极大的方便了进出目录时对当前路径的跟踪。例如：进入下层目录，则append；退回上层目录，则pop。

阅读全文

0 0