python re 模块及正则表达式调用认识 (2)

来源：互联网发布：英国约克大学商科知乎编辑：程序博客网时间：2024/05/01 08:05

接《python re 模块及正则表达式调用认识》

\number

匹配与前面组编号相匹配的文本。从1开始编号到99。

For example, (.+) \1 matches 'thethe' or'5555',but not'thethe' (note the space after the group).

>>> print re.search(r'(.+) \1','the the').group()the the>>> print re.search(r'(.+) \1','the the').group(1)#只有一个组，并且有空格分隔the>>> print re.search(r'(.+) \1','the the').group(2)Traceback (most recent call last):  File "<pyshell#5>", line 1, in <module>    print re.search(r'(.+) \1','the the').group(2)IndexError: no such group>>> print re.findall(r'(.+) \1','the the')#从空格处分隔开为一个组['the']>>> print re.findall(r'(.+)\1','the the')#缺少空格，没有匹配[]>>> print re.findall(r'(.+) \1','the the')['the']>>> print re.findall(r'(.+) \1','thethe')#缺少空格，没有匹配[]>>> print re.findall(r'(.+)\1','thethe')#没有空格，the重复['the']>>> print re.findall(r'(.+) \1','55 55')['55']>>> >>> print re.findall(r'(.+) \1','678 55')  #没有发现与组编号匹配的文本，故不能匹配[]>>> print re.findall(r'(.+) \1','67 55')[]>>> >>> print re.findall(r'(.+) \1','67 55 43 56 32 67') #why 67不能匹配67？[]>>> print re.search(r'(.+) \1','67 55 43 56 32 67')None>>> print re.search(r'(.+) \1','67 55 67')None>>> print re.findall(r'(\d7) .* \1','67 55 67')#因为\1也是有位置顺序的，就是相当于(\d7)，中间不匹配的部分仍要占据位置['67']>>> 如果第一个数字是0，或者是3位的八进制数字，则不会被当作分组匹配，而是作为八进制数值的字符。在字符类中[]内的数字逃脱符都将当作普通字符对待。

\A 仅匹配字符串的开始标志

>>> print re.findall(r'\A67','67 55 67')['67']>>> print re.findall(r'67','67 55 67')['67', '67']>>>

匹配单词开始或结尾处的空字符串。这个单词可以是字母、下划线，数字的字符组合，因此结尾可以是空格或其他非字母、下划线及数字结尾的字符。

一般\b被定义为\w与\W间或者\w 与字符串开始/结尾的边界，所以严格意义上对字符是否属于字符数字集主要取决于UNICODE andLOCALE flags。

For example, r'\bfoo\b' matches'foo','foo.','(foo)','barfoobaz' but not'foobar' or'foo3'.Inside a character range,\b represents the backspace character, forcompatibility with Python’s string literals.

>>> print re.findall(r'\bfoo\b','foo')['foo']>>> print re.findall(r'\bfoo\b','foo.')['foo']>>> print re.findall(r'\bfoo\b','(foo)')['foo']>>> print re.findall(r'\bfoo\b','bar foo bar')['foo']>>> print re.findall(r'\bfoo\b','foobar')[]>>> print re.findall(r'\bfoo\b','foo3')[]>>>

\B 匹配不在单词开始或结尾处的空字符串

This means that r'py\B' matches'python','py3','py2',but not'py','py.', or'py!'.\B is just the opposite of\b, so is also subject to the settingsofLOCALE andUNICODE.

>>> print re.findall(r'py\B','python')['py']>>> print re.findall(r'py\B','py3')['py']>>> print re.findall(r'py\B','py')[]>>> print re.findall(r'py\B','py.')[]>>> print re.findall(r'py\B','py!')[]>>>

与\b正好相反

\d 匹配任何十进制数，等同于[0-9]

With UNICODE, it will match whatever is classified as a decimal digit in the Unicode character properties database.

\D 匹配任何非数字字符，等同于[^0-9]

With UNICODE, itwill match anything other than character marked as digits in the Unicodecharacter properties database.

\s 匹配任何空格字符，等同于[ \t\n\r\f\v]

TheLOCALE flag has no extra effect on matching of the space.If UNICODE is set, this will match the characters[\t\n\r\f\v]plus whatever is classified as space in the Unicode character properties database.

\S 匹配任何非空格字符，等同于[^ \t\n\r\f\v]

TheLOCALE flag has no extra effect on non-whitespace match. IfUNICODE is set, then any character not marked as space in theUnicode character properties database is matched.

\w 匹配任何字母数字字符，equivalent to the set[a-zA-Z0-9_]

With LOCALE, it will match the set[0-9_] pluswhatever characters are defined as alphanumeric for the current locale. IfUNICODE is set, this will match the characters [0-9_] plus whateveris classified as alphanumeric in the Unicode character properties database.

\W 匹配\w 定义的集合中不包含的字符，equivalent to the set [^a-zA-Z0-9_]

With LOCALE, it will match any character not in the set[0-9_], andnot defined as alphanumeric for the current locale. IfUNICODE is set,this will match anything other than[0-9_] plus characters classied asnot alphanumeric in the Unicode character properties database.

\Z 仅匹配字符串的结束标志: If both LOCALE andUNICODE flags are included for aparticular sequence, thenLOCALE flag takes effect first followed bytheUNICODE.

其它

If both LOCALE andUNICODE flags are included for aparticular sequence, thenLOCALE flag takes effect first followed bytheUNICODE.

Most of the standard escapes supported by Python string literals are alsoaccepted by the regular expression parser:

\a      \b      \f      \n\r      \t      \v      \x\\

7.2.2 re 模块

re.compile(pattern, flags=0)

将正在表达式字符串编译为正则表达式对象，可以通过match，search方法进行匹配。

表达式的规则可以通过flags的值进行修改指定。flags的值可以是如下任何通过按位或结果的变量。

The sequence

prog = re.compile(pattern)result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

但是通过re.compile()来保存正则表达式对象，在简短的程序中进行表达式的多次复用显得更高效一些。

re.DEBUG: Display debug information about compiled expression.

re.I
re.IGNORECASE: Perform case-insensitive matching;忽略大小写，并不受local影响

re.L
re.LOCALE: Make \w, \W, \b,\B,\s and\S dependent on thecurrent locale. 使用地区设置

re.M
re.MULTILINE

将^和$应用于包括整个字符串的开始和结尾的每一行。

re.S
re.DOTALL: Make the '.' special character match any character at all, including anewline; without this flag,'.' will match anythingexcept a newline.

使‘ . ’ 字符匹配所有字符，包括换行符

re.U
re.UNICODE: Make \w, \W, \b,\B,\d,\D,\s and\S dependenton the Unicode character properties database.使用\w,\W,\b,\B,\d,\D,\s and\S在unicode字符属性数据库中的信息

re.X
re.VERBOSE

忽略模式字符串中未转义的空格和注释#，除非有保留的反斜杠

a = re.compile(r"""\d +  # the integral part                   \.    # the decimal point                   \d *  # some fractional digits""", re.X)b = re.compile(r"\d+\.\d*")

以上写法等价

re.search(pattern,string,flags=0)

在string中搜索pattern的第一个匹配值。如果匹配成功，则返回MatchObject，如果未匹配到，则返回None。this isdifferent from a zero-length match.

re.match(pattern,string,flags=0)

检查字符串的开头是否有字符与pattern匹配。如果匹配成功，则返回MatchObject，如果未匹配到，则返回None。this isdifferent from a zero-length match.

即使在MULYILINE模式，也只匹配开始，而不是每行的开始。

re.split(pattern,string,maxsplit=0,flags=0)

根据pattern的位置拆分字符串。返回字符串列表，其包括与模式中任何分组匹配的文本。

如果maxsplit为非零，字符串的最后一部分将作为list的最后一个元素返回。也可以通过设置 maxsplit 值来限制分片数

>>> re.split('\W+', 'Words, words, words.')#以字符和数字以外的字符作为拆分标志，即逗号空格，点['Words', 'words', 'words', '']>>> re.split('\W+', 'Words, wer455, 3434')['Words', 'wer455', '3434']>>> re.split('(\W+)', 'Words, words, words.')#以逗号空格，点作为分组标志；捕获括号在 RE 中使用，它们的值也会当作列表的一部分返回['Words', ', ', 'words', ', ', 'words', '.', '']>>> re.split('(\W+)', 'Words, words, words., 1')['Words', ', ', 'words', ', ', 'words', '., ', '1']>>> re.split('\W+', 'Words, words, words., 1')['Words', 'words', 'words', '1']>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)['0', '3', '9']>>>

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

>>> re.split('(\W+)', '...words, words...')#得到的list前后结果都多分出了空串['', '...', 'words', ', ', 'words', '...', '']>>>

如果捕获括号在 RE 中使用，那么它们的值也会当作列表的一部分返回。

匹配失败则返回原串

>>> re.split('x*', 'foo')['foo']>>> re.split("(?m)^$", "foo\n\nbar\n")['foo\n\nbar\n']

re.findall(pattern,string,flags=0)

返回string中与pattern匹配的所有未重叠的字符作为一个list。从左到右进行搜索，按顺序进行返回。如果模式中包含分组，则返回与分组匹配的文本列表。如果使用了不止一个分组，那么列表中的每一项都是一个元组，包含每个分组的文本。Empty matches are included in the result unless they touch thebeginning of another match.

>>> p = re.compile(r'\d+')>>> print p.findall('one1two2three3four4')['1', '2', '3', '4']

>>> p = re.compile(r'(\D+)\d(\D+)\d(\D+)\d(\D+)\d')#进行分组，返回一个元组>>> print p.findall('one1two2three3four4')[('one', 'two', 'three', 'four')]>>>

>>> p = re.compile(r'((\D+)\d)')#双层分组>>> print p.findall('one1two2three3four4')[('one1', 'one'), ('two2', 'two'), ('three3', 'three'), ('four4', 'four')]>>>

re.finditer(pattern,string,flags=0)

与findAll含义相同，但返回一个迭代器对象。迭代器返回类型为MatchObject的项

>>> p = re.compile(r'\d+')>>> print p.finditer('one1two2three3four4')<callable-iterator object at 0x01FFCBF0>>>>

re.sub(pattern,repl,string,count=0,flags=0)

使用替换值repl替换string中最左侧、未重叠的pattern的出现位置。如果没有匹配，则原样返回。

repl 可以是一个字符串或者函数。如果是字符串，则斜杠转义仍然有效，即\n可以转换为换行，\r可以转换为回车之类等。\j 等不能转义的则保留原样。\6之类的反向引用，则被pattern 分组模式6中的子串所替换，即使用反向引用\6来引用模式中的分组。

>>> import re>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',#对def空格进行了替换，myfunc进行了保留，        r'static PyObject*\npy_\1(void)\n{',       'def myfunc():')'static PyObject*\npy_myfunc(void)\n{'>>>

>>> re.sub(r'\d+',r'hello',' I am 30 years old, height 127')#替换了数字部分' I am hello years old, height hello'>>> re.sub(r'we\d+',r'hello',' I am we30 years old, height we127')#替换了we数字部分' I am hello years old, height hello'>>> >>> re.sub(r'we(\d+)',r'hello\1',' I am we30 years old, height we127')' I am hello30 years old, height hello127'  #分组进行了反向引用，保留不变>>>

如果repl是一个函数，则使用MatchObject调用它，并返回替换字符串。

>>> def dashrepl(matchobj):if matchobj.group(0) == '-':return ' 'else: return '-'>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')'pro--gram files'  #只对-进行替换为空格>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)'Baked Beans & Spam' #对And 进行了替换>>> >>> def dashrepl(matchobj):    if matchobj.group(0) == '-':return 'HHH'    else: return '-'    >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')'pro--gramHHHfiles' #对- 进行替换为HHH

pattern可以是字符串或者RE对象。

count 是执行替换的最高次数，而且必须是非负整数。省略或者为0，则所有匹配都被替换。

对pattern的空匹配则只有not adjacent to a previous match 时才被替换。

>>> re.sub('x*', '-', 'abc')'-a-b-c-'>>> re.sub('t*', '-', 'abc')'-a-b-c-'>>> re.sub('tyyy*', '-', 'abc')# why？？？？'abc'>>> re.sub('yyy*', '-', 'abc')'abc'>>> re.sub('yy*', '-', 'abc')'abc'>>> re.sub('y*', '-', 'abc')'-a-b-c-'>>>

在字符串类型的repl参数中，还有一种表示法\g<name> 用于引用给定名称的分组，同(?P<name>...)定义的语法一样。\g<name>使用对应的分组编号，\g<2> 等同于\2，但不能使用这种模糊的替换方式\g<2>0。 \20 将被当作第20组分组查找，而不是第二分组后跟一个字符0.\g<0> 替换整个re匹配的子串。

re.subn(pattern,repl,string,count=0,flags=0): Perform the same operation as sub(), butreturn a tuple(new_string,number_of_subs_made).与sub()用法相同，但是返回一个元组，包含新字符串和替换次数。
>>> re.subn('\w+', 'abc','sfd34fdfd93r49df34')('abc', 1)>>> re.subn('\d+', 'abc','sfd34fdfd93r49df34')('sfdabcfdfdabcrabcdfabc', 4)>>> re.sub('\d+', 'abc','sfd34fdfd93r49df34')'sfdabcfdfdabcrabcdfabc'>>> re.sub('fd', 'FD','sfd34fdfd93r49df34')'sFD34FDFD93r49df34'>>> re.subn('fd', 'FD','sfd34fdfd93r49df34')('sFD34FDFD93r49df34', 3)>>>

re.escape(string): 返回一个字符串，其中的所有非字母数字字符都带有反斜杠
>>> re.escape('reer3434-=,.+_op')'reer3434\\-\\=\\,\\.\\+\\_op'>>>
re.purge()
Clear the regular expression cache.清除正则缓存。
exception re.error
当字符串传递给一个非法的正则匹配函数或者编译与匹配的过程中出现了某些错误时，就会报错。但pattern没有匹配到字符串时不会报这样的错误。

7.2.3 正则表达式对象

class re.RegexObject

正则表达式对象具有以下方法和属性：

search(string[,pos[,endpos]])

在string中搜索匹配值，如果找到匹配值，则返回相应的MatchObject 实例，否则返回None。这个不同于finding a zero-length match

pos参数给出在string中搜索的索引开始位置，默认为0，但这并不完全等同于字符切片；'^' pattern character 匹配从新行字符串的开始位置进行，但是

对搜索开始的索引却不是必须的。

endpos 参数限制了搜索字符串的长度；这看起来，好像字符串就是endpos这么长，因此只有字符串从pos到endpos-1会进行匹配搜索。如果endpos小于pos，将不会进行匹配，否则如果rx是一个compiled正则匹配对象的话，rx.search(string,0,50) is equivalent to rx.search(string[:50],0).

>>> import re>>> pattern = re.compile("d")>>> pattern.search("dog")<_sre.SRE_Match object at 0x015E9218>   #Match at index 0>>> pattern.search("dog",1)# No match; search doesn't include the "d">>> pattern.search("dogoerjeoertredffgfdgfgfg",1)<_sre.SRE_Match object at 0x01F49AA0>>>> pattern.search("dogoerjeoertredffgfdgfgfg",1,20)<_sre.SRE_Match object at 0x015E9218>>>> pattern.search("dogoerjeoertredffgfdgfgfg",1,6)# No match; search doesn't include the "d">>> pattern.search("dogoerjeoertredffgfdgfgfg",0,6)<_sre.SRE_Match object at 0x01F49AA0>>>>

match(string[,pos[,endpos]])

检查string的开头是否有匹配的字符，如果找到匹配值，则返回相应的MatchObject 实例，否则返回None。这个不同于finding a zero-length match

pos and endpos 的含义和search方法的相同。

>>> re.compile("o").match('dog') # No match as "o" is not at the start of "dog".>>> re.compile("o").match('dog',1) # Match as "o" is the 2nd character of "dog".<_sre.SRE_Match object at 0x01F28E58>>>>

If you want to locate a match anywhere in string, usesearch() instead 。如果想匹配字符串的任何位置，最好用search（）。

split(string, maxsplit=0)

等效于 split（）函数

findall(string[,pos[,endpos]])

等效于findall（）函数，使用compiled pattern，也可以使用pos和endpos指定搜索的开始和结束位置，同match类似。

finditer(string[,pos[, endpos]])

等效于finditer() 函数，使用compiled pattern，也可以使用pos和endpos指定搜索的开始和结束位置，同match类似。

sub(repl,string, count=0)

完全等同于sub()函数，使用compiled pattern。

subn(repl,string, count=0)

完全等同于subn()函数，使用compiled pattern。

flags

正则匹配flags。在编译正则表达式时使用flags参数，如果没有指定标志则使用0.This is a combination of the flags given tocompile() and any (?...) inline flags in the pattern.

groups

pattern的捕获组数目

groupindex

一个字典，将r'(?P<id>)'定义的符号分组名称映射到分组编号。如果pattern中没有分组编号则字典为空。

pattern

一个模式字符串，正则表达式从它编译而来。

7.2.4 Match Objects

class re.MatchObject

match objects 总是有一个布尔值True。可以通过match和search的返回值是否是None，可以判断是否有一个匹配存在。下面是一个if 的简单语句

match = re.search(pattern, string)if match:    process(match)

match object支持如下的方法和属性：