Python笔记（8）re模块，正则表达式

来源：互联网发布：知乎计算机编辑：程序博客网时间：2024/05/20 15:40

正则表达式（re）

some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems.

----Jamie Zawinski

re模块是用来支持正则表达式的。Andrew Kuchling 的“Regular Expression HOWTO”(http://amk.ca/python/howto/regex/)是一个python中正则表达式的一个非常好的资源

什么是正则表达式？

正则表达式是一个用来匹配字符串的一个模式，最简单的正则表达式就是一个简单的字符串，比如'python'这个可以用来匹配'python'这个词语。你可以用正则表达式在一串字符中来匹配一个你要查找的内容，或者替换他，或者将其分割成不能的字符片段

通配符

re能够使用一组包特殊字符的模式来匹配任意的字符，举个例子句点“.”可以用来匹配任意的字符（除了换行符）。所以句点又叫做通配符。

转义字符

当你在re的模式中包含了特殊字符，那么就需要转义字符来使得re模式的语义正确了，转义字符是一个反斜线(\)但是在python字符串中，如果你要输入一个反斜线，那么也需要一个转义字符，所以如果你要匹配'python.org'需要'python\\.org'这样声明。因为你要进行两层转义，才能得到想要的结果，第一层是python解释器，第二层是re模块。如果你想只是用一个反斜线，可以使用原始字符串，可以这样写 r'python\.org'.

字符集

有时候你可能需要对字符串过多的控制，这时你需要将一些字符集放到一个中括号中,这样就可以匹配所有字符集中所包含的字符串了，比如

[pj]ython可以匹配python和jython

[a-z]可以匹配所有的小写字母

[a-zA-Z0-9]可以匹配所有的大小写字母和数字

如果要避开某些字符，只需要在字符集前面加上一个'^'符号就可以了，比如

[^abc]表示匹配所有的除了a,b,c三个字符外的字符。

选择和子模式(Alternatives and Subpatterns)

有时候你并不是想同配所有的字符，你可能只想同配某些东西，比如只想匹配python和perl的话，可以用管道符号'|'分割你想匹配的模式

'python|perl'用来匹配python或者perl

有时候你不并不需要用选择操作付来分割两个完整的模式，而只是一部分，这种情况，你可以其他的子模式用括号扩起来，比如

'p(ython|erl)'

可选和重复子模式（optional and Repeated Subpatterns)

问号（?）操作符，用来表示可选，既可以出现一次也可以不出现：

r'(http://)?(www.\.)?python\.org'可以匹配如下内容

'http://www.python.org'

'http://python.org'

'www.python.org'

'python.org'

除了问号，还有一些其他的子模块操作符

（pattern)*:表示可以出现一次，也可以出现无数次。

（pattern)+:表示pattern可以出现一次或者多次。

(pattern){m,n}:表示匹配出现m次到n次的pattern。

字符串的开头与结尾

'^'表示在字串的开头,'$'表示在字串的结尾

'^ht+p'：会匹配'http://ptyhon.org'(或者'httttttp://python.org)，但是不会匹配'www.http.org'。

re模块中的一些函数：

compile(pattern[, flags])

用来将一个字符串编译成一个pattern对象。它来进行计算的话将会有更高的效率，如果你想对一个正则表达式进行search或者match操作最好先将它转换为一个pattern对象pattern对象中包含search和match的方法re.search(pat, string)与pat.(string)是等价的，pattern对象也能用在普通的正则表达式函数中。

search(pattern, string[, flag])

如果存在的话search函数可以找到string中的第一个pattern的子串，返回一个MatchObject(值为True），没有找到的话则返回None（值为false),所以可以这样写：

if re.search(pat, string):

print 'Found it!'

match(pattern, string[, flag])

match用来匹配在字符串的开头处的模式，如果你要匹配整个字符串的话，可以在模式串的末尾加上一个'$'

split(pattern, string[, maxsplit=0])

这个可以利用pattern的出现来进行切割，类似于split方法，但是，这个是使用正则表达式，下面是一个例子

>>> some_text = 'alpha, beta,,,,gamma>>> re.split('[, ]+', some_text)['alpha', 'beta', 'gamma', 'delta']delta'

如果模式里面包含了括号，那么被括号扩起来的部分将留在被分割的字符之间。

maxsplit参数用来表示最大的被允许的分割次数。

findall(pattern, string)

findall用来返回一个字符串的list，里面包含了所有的匹配的字符串。

sub(pat, repl, string[, count=0])

sub用来替换字符串中最左边的n个符合匹配规则的模式串。

>>> pat = '{name}'>>> text = 'Dear {name}...'>>> re.sub(pat, 'Mr. Gumby', text)'Dear Mr. Gumby...'

escape(string)

看下面的例子就知道了

>>> re.escape('www.python.org')'www\\.python\\.org'>>> re.escape('But where is the ambiguity?')'But\\ where\\ is\\ the\\ ambiguity\\?'

flag

Python Library Reference (http://python.org/doc/lib/module-re.html)中的“module content”小节描述了flag参数的用法

匹配对象和分组(Match Object and Groups)

每当字符匹配成功都会返回一个MatchObject对象，这个对象中保留了一些字符串（字串）与模式之间匹配的一些信息。也包含了哪些是匹配了的模式串，哪些是子串的一些信息。这一部分叫做分组（group），分组的编号由最左的括号来决定，0号是整条字符串。请看下面例子：

'There (was a (wee) (cooper)) who (lived in Fyfe)'

分组如下：

0 There was a wee cooper who lived in Fyfe

1 was a wee cooper

2 wee

3 cooper

4 lived in Fyfe

分组号和替换函数：

正则表达式提供了一种非常强大的分组方法。一个简单的例子就可以体现它的强大，比如在替换字符串中使用分组号。任何替换的序列中包含了 '\\n'这个形式的东西都会被匹配了的模式串中的第n个分组所代替。下面的例子是将'*somethin*'替换为'<em>something</em>':

>>> emphasis_pattern = r'\*([^\*]+)\*'>>> re.sub(emphasis_pattern, r'<em>\1</em>', 'Hello, *world*!')'Hello, <em>world</em>!'

贪心和非贪心模式（greedy and nongreedy pattern）

重复操作符默认是贪心原则的，既一直匹配到无法再匹配为止，举个例子：

>>> emphasis_pattern = r'\*(.+)\*'>>> re.sub(emphasis_pattern, r'<em>\1</em>', '*This* is *it*!')'<em>This* is *it</em>!'

模式串匹配了从第一个一直到最后一个星号。如果不想这样，可以在后面加个问好(?)，这样的话，就是使用非贪心的版本。所有的重复操作符后面加了问号的话，都会调用非贪心的版本。

>>> emphasis_pattern = r'\*\*(.+?)\*\*'>>> re.sub(emphasis_pattern, r'<em>\1</em>', '**This** is **it**!')'<em>This</em> is <em>it</em>!'

一个简单的模板系统:

Python中的string formating就是一个简单的模板系统，但是re提供了更加强大的模板系统。如下面代码：

#templates.pyimport fileinput, re#Matches fields enclosed in square brackets:field_pat = re.compile(r'\[(.+?)\]')#We'll collect variables in this:scope = {}#This is used in re.sub:def replacement(match):    code = match.group(1)    try:        #If the field can be evaluated, return it:        return str(eval(code, scope))    except SyntaxError:        # Otherwise, execute the assignment in the same scope...        exec code in scope        # ... and return an empty string:        return ''#Get all the text as a single string:# (There are other ways of doing this; see Chapter 11)lines = []for line in fileinput.input():    lines.append(line)text = ''.join(lines)# Substitute all the occurrences of the field pattern:print field_pat.sub(replacement, text)

这段代码完成的工作有

为匹配的域定义一个模式
定义一个字典，作为模板的scope
定义一个replacement函数，完成以下的事情
- 从match中抓获group1，然后把获取的值存入code中。
- 对code中间的字符串在scope域中进行求值计算，或者(symtaxError被抛出）
- 在scope域中执行获取的字符串。
从文件中读取所有的行，放在一个List中,然后合成一个大的字符串。
利用replace函数的返回值替换所有的匹配的值

程序演示

创建文件magnus.txt

[name= 'Magnus Lie Hetland' ][email= 'magnus@foo.bar'][language = 'python']

创建文件templates.txt

[import time]Dear [name],I would like to learn how to program. I hear you usethe [language] language a lot -- is it something Ishould consider?And, by the way, is [email] your correct email address?Fooville, [time.asctime()]Oscar Frozzbozz

在终端中输入

$python templates.py magnus.txt templates.txt

结果如下：

Dear Magnus Lie Hetland,

I would like to learn how to program. I hear you use

the python language a lot -- is it something I

should consider?

And, by the way, is magnus@foo.bar your correct email address?

Fooville, Wed Apr 24 20:34:29 2008

Oscar Frozzbozz