正则表达式

来源:互联网 发布:网络语言cs是什么意思 编辑:程序博客网 时间:2024/06/05 14:58

reference: 

(1) https://docs.python.org/2/library/re.html 
(2) http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Note:'ex' means example

一:起步

  • import re 
    re.findall(r'\d{5}',"i love data 10000 year") 
    ['10000']

  • pattern=re.compile('\d{11}') 
    pattern.findall("my mobile number is 15967110036") 
    ['15967110036']

二:字符集

"." 任何字符,不包含换行符(default),而dotall模式下,包含换行符 
ex: 
re.findall('.','\n') 
[] 
re.findall('.','\n',re.DOTALL) 
['\n']

"\d" 数字[0-9] 
"\D" 非数字[ 0-9] 
"\s"空字符 [ \t\n\r\f\v] 记住:\r回车 \f换页 \v垂直制表 
"\S"非空字符 
"\w"[a-zA-Z0-9] 
"\W"[ 
a-zA-Z0-9] 
"[]" 自定义字符 [1-4,a,b,c]

ex: 
re.findall(r'\w{1,5}',"i love data 10000 years") 
['i', 'love', 'data', '10000', 'years']

三:个数

"*" [0,无穷] 
"+" [1,无穷] 
"?" [0,1] 
{m} m个 
{m,n} [m,n]个

贪婪模式的限制 
"*?" 
"+?" 
"??"

ex: 
re.findall(r'&.*&',"&i& love data 10000 &years&") 
['&i& love data 10000 &years&'] 
re.findall(r'&.*?&',"&i& love data 10000 &years&") 
['&i&', '&years&']

四:边界限制

"^" Matches the start of the string, and in MULTILINE mode also matches immediately after each newline 
"\A" Matches only at the start of the string 
"$" Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. 
"\Z" Matches only at the end of the string. 
"\b" 匹配单词边界 
"\B" 匹配非单词边界

\b与\A的区别: 
ex: 
re.findall(r'\bdata','.datailovc') 
['data'] 
re.findall(r'\Adata','.datailovc') 
[] 
re.findall(r'data\b','.ilovcdata') 
['data']

'^'和'\A'的区别' '$' 和'\Z'的区别

不在multiLine模式下,没有区别。 
ex 
re.findall(r'\Aa','abc') 
['a'] 
re.findall(r'^a','abc') 
['a']

re.findall(r'^a','abc\nabc',re.M) 
['a', 'a'] 
re.findall(r'\Aa','abc\nabc',re.M) 
['a']

五:匹配处理

() 
(?iLmsux) 
(?:...)非分组版本 
(?P ...) 
(?P=name) 
(?#...) 
(?=...) 
(?!...) 
(?<=...) 
(?<!...)

()与(?:) 区别 
ex: 
re.search(r'(my) mobile number is (\d{11})','my mobile number is 15967110036').group(0) 
'my mobile number is 15967110036' 
re.search(r'(my) mobile number is (\d{11})','my mobile number is 15967110036').group(1) 
'my' 
re.search(r'(my) mobile number is (\d{11})','my mobile number is 15967110036').group(2) 
'15967110036'

re.search(r'(?:my) mobile number is (\d{11})','my mobile number is 15967110036').group(1) 
'15967110036'

(?#): 
re.search(r'my(?#this is comment) mobile number is (\d{11})','my mobile number is 15967110036').group(0) 
'my mobile number is 15967110036'

(?iLmsux) 
re.search(r"(?i)L{3}123","lll123").group(0)

(?P<>) 
re.search(r'(?P<wode>my) mobile number is (\d{11})','my mobile number is 15967110036').group(1) 
'my' 
re.search(r'(?P<wode>my) mobile number is (\d{11})','my mobile number is 15967110036').group("wode") 
'my'

(?P=name) 
ex: 
re.search(r"(?P<fuhao>"\d{3})daf(?P=fuhao)","123daf123").group(0) 
'123daf123'

(?!) 
ex 
(?!...) 
re.findall(r'1(?!2)\d{10}',"my accountMobile is 15967110036") 
['15967110036']

(?<=...) 
ex 
re.findall(r'(?<=\s)\d{11}',"my accountMobile is 15967110036") 
['15967110036']

(?<!...) 
ex 
re.findall(r'(?<!\s)\d{11}',"my accountMobile is 15967110036") 
[]

(?(id/name)yes-pattern|no-pattern) 
ex: 
print re.search(r'(\d{2})abc(?(1)\d|abc)',"12abc3").group(0) 
12abc3

六:模式

re.IGNORECASE #忽略大小写 
re.LOCALE #/usr/share/i18n/locales 
Make \w, \W, \b, \B, \s and \S dependent on the current locale. 
re.MULTILINE:多行模式 
re.DOTALL:是否包含换行符 
re.UNICODE:Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案( Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database) 
re.VERBOSE: 
a = re.compile(r"""\d + # the integral part 
. # the decimal point 
\d * # some fractional digits""", re.X) 
b = re.compile(r"\d+.\d*")

七:函数以及例子

1:search和match的区别 
match只匹配字符串的开头是否匹配,search对位置没有限制 
例如: 
if(re.match('b','abc')):print 0 
... 
if(re.search('b','abc')):print 0 
... 
0

2:re.split()的用法 
re.split('\W+','Words,words,words') 
['Words', 'words', 'words'] 
re.split('(\W+)','Words,words,words') 
['Words', ',', 'words', ',', 'words'] 
re.split('(\W+)','Words,words,words',maxsplit=1) 
['Words', ',', 'words,words']

3:re.sub的用法 & subn

re.sub(r"abc","123","abcabc") 
'123123' 
re.sub(r"a","123","abcabc") 
'123bc123bc

re.subn(r"a","123","abcabc") 
('123bc123bc', 2) 
re.subn(r"a","123","aacabc") 
('123123c123bc', 3)

0 0