正则表达式

来源：互联网发布：美国大选数据分析编辑：程序博客网时间：2024/06/02 00:02

一些匹配概念

通配符：

‘.ython’ 可以匹配首字母以任意元素开头的字符，如 ‘python’，‘+ython’， ‘ ython’。

其中 . 为通配符，值匹配一个任意字符，两个及以上无法匹配

转义字符：

匹配‘python.org’：使用 ‘python\\.org’

为什么需要 \\ 而不用 \ ？两级转义——解释器转义；re模块转义。

实际上也可以使用原始字符串 r'python\.org'

字符集：

[a-z]：匹配从a到z 的任意一个字符

[a-zA-Z0-9]：匹配任意大小写字母和数字

[^abc]：匹配非abc的字符串

选择符和子模式：

Python|Perl： | 管道符号，匹配任意一个，两者是或的关系

P(ython|erl) ：跟数学的 1*（1+1）一个道理，里面的 | 仍然是管道，加上() 就是子模式

重复子模式

（pattern）*：0-多次

（pattern）+：1-多次

（pattern）{m,n}：m-n次

可选项：

r'(http://)?(www\.)?python\.org'：？前面的就是可选项，表示子模式可以出现0-1次。

为什么不用\\ 转义？注意前面是r ，指代原始字符

Python 中对字符串匹配可能会用到

str.startswith(str, beg=0,end=len(string));

str.endswith(suffix[, start[, end]]);

str.find()

在(文件中)一行中使用 endwith 方法的时候注意 ‘\n’ ——换行标识，当然对每一行使用了 rstrip() 方法的话就不用在关键字后面加 ‘\n’ 了。

def findKey(fname):    file = open(fname)    for line in file:        if line.startswith('xx') \                or line.endswith('xx\n'): # or line[:-1].endswith('xx'):            print line    file = open(fname)    for line in file:        line = line.rstrip()        if line.startswith('xx') or line.endswith('xx'):            print linefindKey('test.txt')

匹配字母或者下划线开头的字符串

#a = ''a = '_value'# a can not be null or useless, 'a and' put 1thboolean = a and ( a[0]== '_' or 'a' <= a[0] <= 'z')print boolean

Python正则表达式需要re模块

import restr = 'X\nxx python'#use 'r' can match exactlly what you write#regular = re.compile('x\n',re.I)regular = re.compile(r'x\n',re.I) #build one object,re.I,ignore lower or uppercase,add 'r' is better#print regular,type(regular)result = regular.match(str)  #return a object to store match result.#print regular.match(str)if result == None:    print 'Nothing find'else:    print result.span() #check index# use onceresult = re.match(r'x',str) # str is target string

上述代码中的 match() 方法是从0开始匹配，如果0位没有，则没有匹配失败，匹配失败之后如果打印 result 你会发现 result 是None，可以使用这个值进行判定

search() 方法顾名思义，寻找到第一个就ok

match() 直接匹配所给，不寻找，匹配不上就是失败

另外，每次匹配都要写一次正则表达式很麻烦：re.compile() 将正则表达式实例化就更方便了

例如：假设已经创建了一个正则表达式的对象——pat = re.compile(r'[a-zA-Z0-9]')

那么，pat.search(xstring) 等同于 re.search(r'[a-zA-Z0-9]')

当然，如果是一次性的匹配，也可以不用 re.compile() 方法：有一个需求，我想要去掉多余的符号，只要元素

split( 正则表达式，待匹配文本) 让你明白

import retext = 'alpha,,,beta,, gama delta'print re.split('[,]+', text)

正则表达式入门语法

匹配单个字符

1. 匹配以a开头，z 结尾的三个字母构成的字符串

result = re.match(r'a.z','a8z')print result.span()

—— . 小数点：匹配除了换行符（\n）以外的任意一个字符，一个字符，一个

所以下列代码是匹配失败的

result = re.match(r'a.z','a\nz')result = re.match(r'a.z','axxz')

2. 匹配以字母或者数字开头的字符串

result = re.match(r'[a-zA-Z0-9]','hello')print result.group()

—— [] 方括号：能够匹配方括号内任意一个字符

上述问题解决方法二：

result = re.match(r'[\w]','hello')print result.group()

—— \w：表示任意一个字母或数字或下划线，也就是 A~Z,a~z,0~9,_ 中任意一个

3. 匹配以括号[]内有任意一个字母或数字或下划线开头的字符串

result = re.match(r'\[[\w]\]','[0]891') # if you want to find a str including [],you should use \print result.group()

匹配多个字符

1. 匹配任意多个字符或者数字或下划线开头的字符串

result = re.match(r'[\w]*','dsuio28A$$$$')print result.group()

—— * 星号：匹配前一个规则 0~n 次

2. 匹配一个有效的 Python 变量（以下划线字母开头，所以必须存在1次或以上）

result = re.match(r'[_a-zA-Z]+[_\w]*','_python')print result.group()

3. 匹配 0- 99 的数字

第一反应可能会是：

result = re.match(r'[0-99]','99')print result.group()

发现结果其实只能匹配第一个数字，因为用的是上面的匹配一个字符，那么这里的 0 - 99 要匹配多个字符，也有可能是一个字符

result = re.match(r'[0-9]?[0-9]','55')print result.group()

一些有用的实例：

1. 用<em>something</em> 替换 *something*

pattern = r''

0 0