“Beginning Python”（四）“Instant Markup 1”

来源：互联网发布：js去除数组重复元素编辑：程序博客网时间：2024/06/08 00:20

本文主要解读《Beginning Python》书后十个应用项目中的“Instant Markup”项目。它实现的是：将“plain text”（普通文本）转变为“Markup text”（标记文本），包括：html、xml、latex等。尽管该项目仅仅演示了“plain to html”，但是它也很容易扩展到其他“Markup text”。

注：关于html的入门知识，可以看：http://www.w3.org/MarkUp/Guide/

一、问题和目标

问题其实很明确，那就是：将“plain text”转换为“html text”，具体包括：

1）区分不同的文本块：headings、paragraphs

2）处理特殊文本块：list items、in-line text，如：emphasized text和URLs。

3）可扩展处理其他markup文本。

测试文本“test_input.txt”如下：

Welcome to World Wide Spam, Inc.These are the corporate web pages of *World Wide Spam*, Inc. We hopeyou find your stay enjoyable, and that you will sample many of ourproducts.A short history of the companyWorld Wide Spam was started in the summer of 2000. The businessconcept was to ride the dot-com wave and to make money both throughbulk email and by selling canned meat online.After receiving several complaints from customers who weren'tsatisfied by their bulk email, World Wide Spam altered their profile,and focused 100% on canned goods. Today, they rank as the world's13,892nd online supplier of SPAM.DestinationsFrom this page you may visit several of our interesting web pages:  - What is SPAM? (http://wwspam.fu/whatisspam)  - How do they make it? (http://wwspam.fu/howtomakeit)  - Why should I eat it? (http://wwspam.fu/whyeatit)How to get in touch with usYou can get in touch with us in *many* ways: By phone (555-1234), byemail (wwspam@wwspam.fu) or by visiting our customer feedback page(http://wwspam.fu/feedback).

二、技术分解

1，知识点分布

1）读写文件 - Chapter 11，fileinput

2）逐行迭代 - 同上

3）字符串处理 - Chapter 3

4）generator - Chapter 9

5）正则表达式 re - Chapter 10

2，子任务

1）文本分割

由于html中需要区分head1、head2和paragraph，我们的第一个子任务就是要根据“输入文本”（plain text）的特征，提取出（分割）标题行和段落块。

观察“test_input.txt”，很明显，它是以一个或多个空行来划分段落。

#util.pydef lines(file):    for line in file: yield line    yield '\n'def blocks(file):    block = []    for line in lines(file):        if line.strip():            block.append(line)        elif block:            yield ''.join(block).strip()            block = []

如上，util.py中包含了两种generator：lines和blocks。注意，它们不是普通的函数，而是generator。

lines将输入文件（流）转换为行，并逐行输出，建议VS单步调试，查看它的处理过程。很显然，输入文件流--file提供了一个（行）迭代器，lines只是借助这个行迭代器，将文件按行输出。

注：关于“File Iterators”这个知识点，可以查看Chapter 11。python的文件流和sys.stdin都是可以直接用for迭代的。

blocks的内部是通过一个list来实现的，它的代码很好理解：收集多个自然行（以回车符结束），组成一个list，直到遇到一个空行结束。其中，string.strip()函数默认是取出头尾的空格。此外，blocks会过滤空行。

2）添加标记（markup）

对于markup类文件，一般包括三个部分：

a. 头部信息

b. 主体段落

c. 尾部信息

参考html文件: http://www.w3.org/MarkUp/Guide/

三、代码分析

1，模块化

为了便于扩展和维护，我们需要将程序按照OOP的方法进行模块化设计，大致可以分为以下几个模块：

1）A Parser

它是一个集成类，主要功能包括：读文件和管理其他类。很明显，它会创建程序的入口对象。

2）Rules

每一个规则对应一种文本块。

3）Filters

封装正则表达式，过滤行内文本（deal with in-line elements）。注意，它针对的是行内，而不是文本块。

4）Handlers

生成输出文本，每一个handler对应一类输出文本。事实上，它是该程序扩展性的基石，定义不同的handler就可以生成不同的markup text。

类图关系如下：

关于UML可以参考：http://design-patterns.readthedocs.io/zh_CN/latest/read_uml.html

阅读全文

0 0