python正则表达式（基础）

来源：互联网发布：韩国实力知乎编辑：程序博客网时间：2024/06/07 20:02

一、正则表达式是包含文本和特殊字符的字符串，该字符串描述一个可以识别各种字符串的模式。

正则表达式匹配的字符 foo foo python python abc123 abc123

二、特殊符号和字符

表示法描述正则表达式示例 literal匹配文本字符串的字面值literal foo re1|re2匹配正则表达式re1或者re2 foo|bar .匹配任何字符（除\n） b.b ^匹配字符串起始部分 ^Dear $匹配字符串终止部分 /bin/sh$ *匹配0次或多次前面的正则表达式 [A-Za-z0-9]* +匹配1次或多次前面的正则表达式 [a-z]+\.com ?匹配0次或多次前面的正则表达式 goo? {N}匹配N次前面的正则表达式 [0-9]{3} {M,N}匹配M~N次前面的正则表达式 [0-9]{5,9} [....]匹配来自字符集的任意单一字符 [abrds] [..x-y..]匹配x~y范围中的任意单一字符 [0-9],[A-Za-z] [^...]不匹配此字符集中出现的任何一个字符，包括某一范围的字符（如果在次字符集中出现） [^abrds],[^A-Za-z] (*|+|?|{})?用于匹配上面频繁出现/重复出现符号的非贪婪版本（*、+、？、{}） .*?[a-z] (...)匹配封闭的正则表达式，然后另存为子组 ([0-9]{3})?,f(oo|u)bar \d匹配任何十进制数，与[0-9]一致（\D与\d相反,不匹配任何非数值型的数字） data\d+.txt \w匹配任何字母数字字符，与[A-Za-z0-9]相同（与\W相反） [A-Za-z]\w+ \s匹配任何空格字符，与[\n\t\r\v\f]相同（\S相反） of\sthe \b匹配任何单词边界（\B相反） \bThe\b \N匹配以保存的子组N price:\16 \c逐字匹配任何特殊字符c \., \\ , \* \A(\Z)匹配字符串的起始（结束）(^ $) \ADear

三、使用择一匹配符号匹配多个正则表达式

表示择一匹配的管道符号（|），eg: at| home 匹配字符at、home

四、正则表达式和python语言

1. python 通过使用re模块来支持正则表达式，re模块支持更强大而且更通用的perl（Perl5风格）风格的正则表达式，该模块允许多个线程共享同一个已编译的正则表达式对象，也支持命名子组；

2. re模块：核心函数和方法

compile(pattern,flag=0)使用任何可选的标记来编译正则表达式的模式，然后返回一个正则表达式对象 match(pattern，string,flag=0)尝试使用带有可选标记的正则表达式的模式来匹配字符串，如果匹配成功，返回匹配对象，失败，返回None search(pattern，string,flag=0)使用可选标记来搜索字符串中第一次出现的正则表达式模式，如果匹配成功，返回匹配对象，失败，返回None findall(pattern，string[,flags])查找字符串中所有（非重复）出现的正则表达式模式，并返回一个匹配列表 finditer(pattern，string[,flags])与findall相同但返回的是一个迭代器，对于每一次匹配，迭代器都会返回一个匹配对象 split(pattern，string,max=0)根据正则表达式的模式分隔符，split函数将字符串分割为列表，然后返回成功匹配的别表，分割最多操作max次 sub(pattern,repl,string,count=0)使用repl替换所有正则表达式的模式在字符串中出现的位置，除非定义count，否则就将替换所有出现的位置 purge()清楚隐式编译的正则表达式 group(num=0)返回整个匹配对象，或者编号为num的特定子组 groups(default=None)返回一个包含所有匹配子组的元组（没有匹配成功则为空元组） groupdict(default=None)返回一个包含所有匹配的命名子组的字典，所有的子组名称作为字典的键（没有匹配成功则返回一个空字典）

match()匹配字符串

成功

>>> import re

>>> m =re.match('foo','foo')

>>> if m is not None:

... m.group()

...

'foo'

失败

>>> m =re.match('foo','bar')

>>> if m is not None:m.group()

...

>>>

search()与match()对比

>>> m =re.match('foo','seafoo')

>>> if m is not None:m.group()

... #匹配失败

>>>

>>> m =re.search('foo','seafoo')

>>> if m is not None:m.group()

...

'foo' #搜索成功，但是匹配失败

>>>

可以看到此处match匹配失败会从起始部分开始匹配模式；也就是说模式中'f'将匹配到字符串的首字母's'上，这样匹配肯定是失败的，而search（）函数不但会搜索模式在字符串中第一次出现的位置，而且严格地对字符串从左到右搜索；

五、匹配多个字符串

>>> bt='bat|bet|bit' #正则表达式模式:bat bet bit

>>> m =re.match(bt,'bat') #bat是一个匹配

>>> if m is not None:m.group()

...

'bat'

>>>

>>> m =re.match(bt,'he bit me !') #不能匹配字符串

>>> if m is not None:m.group()

...

>>>

>>> m =re.search(bt,'he bit me !') #通过搜索查找‘bit’

>>> if m is not None:m.group()

...

'bit'

>>>

匹配任何单个字符

>>> anyend='.end'

>>> m =re.match(anyend,'bend') # .匹配b

>>> if m is not None:m.group()

...

'bend'

>>>

>>> m =re.match(anyend,'\nbend') #除了\n的任意字符

>>> if m is not None:m.group()

...

>>>

>>> m =re.search(anyend,'The end !') #在搜索中匹配' '

>>> if m is not None:m.group()

...

' end'

>>>

六、创建字符集[]

>>> pat='[cr][23][dp][o2]'

>>> m =re.match(pat,'c3po') #匹配c3po

>>> if m is not None:m.group()

...

'c3po'

>>>

>>> m =re.match(pat,'c2do') #匹配c2do

>>> if m is not None:m.group()

...

'c2do'

>>>

七、重复、特殊字符以及分组

>>> patt='\w+@(\w+\.)?\w+\.com'

>>> re.match(patt,'nobody@xxx.com').group()

'nobody@xxx.com'

>>> re.match(patt,'nobody@www.xxx.com').group()

'nobody@www.xxx.com'

>>> patt='\w+@(\w+\.)*\w+\.com'

>>> re.match(patt,'nobody@www.xxx.yyy.zzz.com').group()

'nobody@www.xxx.yyy.zzz.com'

>>> m =re.match('\w\w\w-\d\d\d','abc-123')

>>> if m is not None:m.group()

...

'abc-123'

>>>

>>> m =re.match('\w\w\w-\d\d\d','abc-xyz')

>>> if m is not None:m.group()

...

>>>

>>> m =re.match('(\w\w\w)-(\d\d\d)','abc-123')

>>> if m is not None:m.group()

...

'abc-123' #完整匹配

>>> m.group(1) #子组1

'abc'

>>> m.group(2) #子组2

'123'

>>> m.groups()

('abc', '123')

>>>

>>> m =re.match('(a(b))','ab') #两个子组

>>> m.group() #完整匹配

'ab'

>>> m.group(1) #子组1

'ab'

>>> m.group(2) #子组2

'b'

>>> m.groups()

('ab', 'b')

八、匹配字符串的起始和结尾以及单词边界

>>> m =re.search('^The','The end.') #匹配

>>> if m is not None:m.group()

...

'The'

>>>

>>> m =re.search('^The','end. The')

>>> if m is not None:m.group()

...

>>> m =re.search(r'\bthe','bite the dog') #在边界

>>> if m is not None:m.group()

...

'the'

>>>

>>> m =re.search(r'\Bthe','bitethe dog') #\B 不在边界

>>> if m is not None:m.group()

...

'the'

>>>

九、使用findall()finditer()查找每一次出现的位置

>>> re.findall('car','car')

['car']

>>> re.findall('car','scary')

['car']

>>> re.findall('car','car carry scary to the car')

['car', 'car', 'car', 'car']

>>> s='This and that'

>>> re.findall(r'((Th\w+) and (th\w+))',s,re.I) #re.I忽略大小写

[('This and that', 'This', 'that')]

>>> re.finditer(r'((Th\w+) and (th\w+))',s,re.I).next().groups()

('This and that', 'This', 'that')

>>> re.finditer(r'((Th\w+) and (th\w+))',s,re.I).next().group(1)

'This and that'

>>> re.finditer(r'((Th\w+) and (th\w+))',s,re.I).next().group(2)

'This'

>>> re.finditer(r'((Th\w+) and (th\w+))',s,re.I).next().group(3)

'that'

>>> [g.groups() for g in re.finditer(r'((Th\w+) and (th\w+))',s,re.I)]

[('This and that', 'This', 'that')]

>>>

十、使用sbu()和subn()搜索与替换

>>> re.sub('X','Mr.Smith','attn: X\n\n Dear X,\n ') #将字符串中X替换为Mr.Smith

'attn: Mr.Smith\n\n Dear Mr.Smith,\n '

>>> re.subn('X','Mr.Smith','attn: X\n\n Dear X,\n ')

('attn: Mr.Smith\n\n Dear Mr.Smith,\n ', 2)

>>> print re.sub('X','Mr.Smith','attn: X\n\n Dear X,\n ')

attn: Mr.Smith

Dear Mr.Smith,

>>> re.subn(r'[ae]','X','abcdef') #将字符串中ae替换成X

('XbcdXf', 2) #2 替换数

>>> re.sub(r'[ae]','X','abcdef')

'XbcdXf'

>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',r'\2/\1/\3','2/20/1991') #匹配的group()方法处理能够取出匹配分组编号还可以使用\N，N是在替换字符串中 '20/2/1991' #使用的分组编号

>>>

十一、在限定模式上使用split()分隔字符串

>>> re.split(':','str1:str2:str3')

['str1', 'str2', 'str3']

>>> import re

>>> DATA =(

... 'Mounttain View, CA 94040',

... 'Sunnyvale, CA',

... 'Los Altos, 94023',

... 'Cupertino 95014',

... 'China CN ')

>>> for datum in DATA:

... print re.split(r', |(?=(?:\d{5}|[A-Z]{2}))',datum)

...

['Mounttain View', 'CA 94040']

['Sunnyvale', 'CA']

['Los Altos', '94023']

['Cupertino 95014']

['China CN ']

>>>

阅读全文

0 0