python cookbook：第二章字符串和文本

来源：互联网发布：java excel 下拉联动编辑：程序博客网时间：2024/05/17 06:03

2.1 针对任意多的分隔符拆分字符串

使用re.split()，例如：

<pre name="code" class="python">re.split(r'[;,\s]\s*',str)

正则表达式模式中的捕获组：如果使用，匹配文本也在最终的文本中。例如：

re.split(r'(;|,|\s)\s*',str)

2.2 在字符串开头或结尾处做文本匹配

使用str.startwith()和str.endwith()。这两个函数也允许多输入，例如：

str.startwith('http:\\','https:\\')

注意，必须是元组输入。

2.3 使用shell通配符做字符串匹配

可以使用fnmatch.fnmatch和fnmatch.fnmatchcase。

fnmatch('aaa.txt','*.txt')fnmatch('aaa.txt','aa?.txt')fnmatch('aaa1.txt','aaa[0-9].txt')

fnmatchcase()完全按照给定的大小写模式匹配。

2.4 文本匹配模式的匹配和查找

正则表达式模式对象的match方法尝试从字符串开头寻找匹配项。findall找到所有的匹配项。

我们也可以引入捕获组：

import redatepat=re.compile(r'(\d+)/(\d+)/(\d+)')m=datepat.match('11/11/2011')for i in [0,1,2,3]:    print m.group(i)

输出结果：

11/11/2011
11
11
2011

注意，指定模式最好使用原始字符串。

2.5 查找和替换文本

简单的文本模式，可以直接使用

str.replace('str1','str2')

复杂的，可以用re.sub()+捕获组。

re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',str)

2.6 以不区分大小写的方式做文本查找和替换

使用re模块，并且各个操作需要加入re.IGNORECASE标记。

复杂的情况可以使用支撑函数。

2.7 定义实现最短匹配的正则表达式

最长匹配：

str_pat=re.compile(r'\"(.*)\"')

最短匹配：

str_pat=re.compile(r'\"(.*?)\"')

在*后面加上?。

2.8 编写多行模式的正则表达式

句点(.)是不能匹配换行符的。需要添加对换行符的支持。

str_pat=re.compile(r'/\*((?:.|\n)*?)\*/')

(?:.|\n)指定了一个非捕获组。

也可以使用re.DOTALL标记，使得句点可以匹配所有字符。

2.9 将unicode文本统一标示为规范形式

使用unicodedata.normalize()

字符是全组成的：

unicodedata.normalize('NFC',str)

字符是部分组成的：

unicodedata.normalize('NFD',str)

2.10 用正则表达式处理Unicode字符

没大看懂，有空继续

2.11 从字符串中去掉不需要的字符

strip()可以从字符串的开始和结尾处去掉字符，lstrip()从左边去掉，rstrip()从右边去掉。这些操作不会对字符串中间的任何文本起作用。

如果要对中间文本操作，应该使用replace()。

2.12 文本过滤和清理

对于简单的问题，我们使用str.upper()，str.lower()，str.replace()，re.sub()来处理。

复杂的用str.translate()，要先建立一个小型转换表，然后使用该方法。

先补充：Python提供了ord和chr两个内置的函数，用于字符与ASCII码之间的转换。

>>> print ord('a') 97 >>> print chr(97) a

然后看一个例子：

remap={       ord('\t'):' ',       ord('\r'):' ',       ord('\n'):' '       }a=s.translate(remap)

2.13 对齐文本字符串

str.ljust()，str.rjust()，str.just()为左对齐，右对齐，居中对齐。有可选填充字符。

format()可以用来对齐。配合><^三个符号。

2.14 字符串连接和合并

一般的字符串连接可以使用join方法，片段较多时，不推荐使用+，效率低下。

2.15 给字符串中的变量名做插值处理

看几个例子：

s='{name} has {n} messages.'s=s.format(name='zcy',n=10)print s

name='zcy'n=10print vars()s='{name} has {n} messages.'s=s.format_map(vars())print s#这段代码在我的机器上报错，可能是python版本问题

sys._getframe(1)返回调用方的栈，可读不可写

# Class for performing safe substitutionsclass safesub(dict):    def __missing__(self, key):        return '{%s}' % keys = '{name} has {n} messages.'# (a) Simple substitutionname = 'Guido'n = 37print(s.format_map(vars()))# (b) Safe substitution with missing valuesdel nprint(s.format_map(safesub(vars())))# (c) Safe substitution + frame hackn = 37import sysdef sub(text):    return text.format_map(safesub(sys._getframe(1).f_locals))print(sub('Hello {name}'))print(sub('{name} has {n} messages'))print(sub('Your favorite color is {color}'))

2.16 以固定的列数重新格式化文本

import textwrapprint(textwrap.fill(str,45))

2.17 在文本中处理HTML和XML实体

使用html.escape()来替换特殊字符为实体。

使用HTMLParser.unescape()将实体替换回特殊字符。

2.18 文本分词

将模式转化为序列对。使用正则表达式捕获组实现。

import refrom collections import namedtupleNAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'NUM  = r'(?P<NUM>\d+)'PLUS = r'(?P<PLUS>\+)'TIMES = r'(?P<TIMES>\*)'EQ    = r'(?P<EQ>=)'WS    = r'(?P<WS>\s+)'master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))Token = namedtuple('Token', ['type','value'])def generate_tokens(pat, text):    scanner = pat.scanner(text)    for m in iter(scanner.match, None):        yield Token(m.lastgroup, m.group())for tok in generate_tokens(master_pat, 'foo = 42'):    print(tok)

2.19 编写一个简单的递归下降解析器

太长了没看

2.20 在字节串上执行文本操作

绝大部分操作一致，但是正则表达式要用字节串指定。

字节串和字符串的区别，详见p81-82

0 0

python cookbook：第二章 字符串和文本

python cookbook：第二章字符串和文本