Ruby for Rails 最佳实践Ⅻ

来源：互联网发布：如何使用万德数据库编辑：程序博客网时间：2024/05/21 05:39

第十二章正则表达式和基于它的字符串操作

一、什么是正则表达式

正则表达式可用于：在字符串中扫描某个模式的多次出现，进行字符串替换操作，基于匹配分界符将一个字符串分割为多个子字符串。

二、编写正则表达式

1. 正则表达式的字面构造方法

（1）字面构造方法就是一对正斜杠：//

（2）可以从两个方向使用 match：正则表达式对象和字符串对象都可以响应 match方法

puts "Match!" if /abc/.match("The alphabet starts with abc.")

puts "Match!" if "The alphabet starts with abc.".match(/abc/)

Ruby 还提供了模式匹配操作符 =~（等号加波浪号）

puts "Match!" if /abc/ =~ "The alphabet starts with abc."

puts "Match!" if "The alphabet starts with abc." =~ /abc/

（3）match 与 =~ 的主要不同之处在于匹配时的返回值不同：

=~ 返回匹配开始处的字符在字符串中的数值索引，而 match 返回 MatchData 类的实例

>> "The alphabet starts with abc" =~ /abc/

=> 25

>> /abc/.match("The alphabet starts with abc.")

=> #<MatchData:0x1b0d88>

2. 构造一个模式

（1）可以来构造正则表达式的组件包括以下几个：

■ 字面字符，表示“与该字符匹配”

■ 圆点通配符（.），表示“与任意一个字符匹配”

■ 字符类，表示“与这些字符中的某一个匹配”

（2）特殊字符

对于有特殊意义的字符来说，要匹配这些特殊字符本身，需要使用反斜杠（\）来对它们进行转义：^, $, ? , ., /, \, [, ], {, }, (, ), +, *

（3）通配字符.（圆点）：匹配除换行以外的任意一个的字符

（4）字符集：/[dr]ejected/ 匹配 d 或 r，其后接 ejected

在字符集中可以插入字符范围：/[a-z]/

为了与一个十六进制数字匹配，需要在字符集中使用多个字符范围：/[A-Fa-f0-9]/

非十六进制数字匹配的字符集，使用补集：/[^A-Fa-f0-9]/

（5）常见字符集的特殊转义序列

\d 与任何数字相匹配

\w 与任何数字、字母或下划线相匹配

\s 与任何空白字符（空格、制表符、换行符）相匹配

这些预定义字符集每一个都有一个补集

\D 与任何一个非数字相匹配

\W 与任何数字、字母或下划线之外的任何一个相匹配

\S 与任何一个非空白字符相匹配

三、关于匹配和 MatchData 的更多内容

1. 正则表达式构造的最重要技术之一是使用小括号来指定“捕获”，为了对字符串的某部分做些处理，捕获记法可以分离出与特定的子模式匹配的字符串并保存

str = "Peel,Emma,Mrs.,talented amateur"

/([A-Za-z]+),[A-Za-z]+,(Mrs?\.)/.match(str)

puts $1 # 输出：Peel

puts $2 # 输出：Mrs.

Ruby 自动填写了这些变量为全局变量，它们以数字作为名字：$1, $2 等等。

2. 匹配成功和匹配失败

（1）匹配一个电话号码并查询作为结果的 MatchData 对象

string = "My phone number is (123) 555-1234."

phone_re = /$(\d{3})$\s+(\d{3})-(\d{4})/

m = phone_re.match(string)

unless m

puts "There was no match--sorry."

exit

end

print "The whole string we started with: "

puts m.string

print "The entire part of the string that matched: "

puts m[0]

puts "The three captures: "

3.times do |index|

puts "Capture ##{index + 1}: #{m.captures[index]}"

end

puts "Here's another way to get at the first capture:"

print "Capture #1: "

puts m[1]

下面是代码输出结果

The whole string we started with: My phone number is (123) 555-1234.

The entire part of the string that matched: (123) 555-1234

The three captures:

Capture #1: 123

Capture #2: 555

Capture #3: 1234

Here's another way to get at the first capture:

Capture #1: 123

（2）获取子匹配的两种方法

从 MatchData 对象中获取子匹配的一种方式是直接索引对象，就像数组那样

完整匹配：m[0]

第一个捕获：m[1]

第二个捕获：m[2]

从 1 开始的这些索引值对应于包含捕获的子字符串的全局变量 $n 中的 n。

MathData 对象还有一个名为 captures 的方法，并且以下等式成立

m[1] == m.captures[0]

m[2] == m.captures[1]

（3）MatchData 的其它信息（电话号码匹配操作的补充代码）

print "The part of the string before the part that matched was:"

puts m.pre_match

print "The part of the string after the part that matched was:"

puts m.post_match

print "The second capture began at character "

puts m.begin(2)

print "The third capture ended at character "

puts m.end(3)

这段代码的输出如下

The string up to the part that matched was: My phone number is

The string after the part that matched was: .

The second capture began at character 25

The third capture ended at character 33

四、更多正则表达式技术

1. 量词和贪婪性

（1）表示“0个或1个”的特殊字符：问号（?）

/Mrs?\.?/ 它将匹配 ”Mr”, “Mrs”, “Mr.”, “Mrs.”

（2）表示“0个或多个”的特殊字符：星号（*）

</p>

< /p>

</ p>

</p

要匹配 HTML 文档中的闭标签 </p>：/<\s*\/\s*p\s*>/

（3）表示“1个或多个”的特殊字符：加号（+）

/\d+/ 匹配一个或多个连续的数字构成的任意序列

（4）贪婪量词和不贪婪量词

量词 * （0个或多个）和 + （1个或多个）是贪婪的：它们会匹配尽可能多的字符

string = "abc!def!ghi!"

match = /.+!/.match(string)

puts match[0] # 输出：abc!def!ghi!

可以通过在后面放置一个问号，使它们成为不贪婪量词

string = "abc!def!ghi!"

match = /.+?!/.match(string)

puts match[0] # 输出：abc!

（5）特定次数的重复

通过将一个数放在大括号 {} 中来达到此目的

/\d{3}-\d{4}/ 将匹配 555-1212

也可以指定一个范围，如 1～10个连续数字的任意字符

/\d{1,10}/

如果要匹配“三次或更多的数字”可以使用

/\d{3,}/

匹配五次连续出现“一个大写字母和一个数字”的模式

/([A-Z]\d){5}/

2. 锚和前视断言

（1）最常见的锚是行首（^）和行尾（$）

要确定哪些行是注释行，可以使用下面的正则表达式：/^\s*#/

（2）正则表达式的锚

记法

描述

示例

匹配的字符串示例

行首

行尾

字符串的开始

字符串的结尾

字符串的结尾（不含换行）

单词边界

/^\s*#/

/\.$/

/\AFour score/

/from the arth.\z/

/from the arth.\Z/

/\b\w+\b/

“ # A Ruby comment line”

“one\ntwo\nthree.\nfour”

“Four score”

“from the earth.”

“from the earth\n”

“!!!word***”(匹配 word )

（3）前视断言

假设要匹配一个数的序列，该序列以一个圆点结束。但是并不想把这个圆点本身作为匹配的一部分。实现该功能可以使用前视断言

str = "123 456. 789"

m = /\d+(?=\.)/.match(str)

现在 m[0] 是 “456”：字符串中其后面紧跟着一个圆点的那个数序列。

下面是对其中一些术语的解释：

■ 零宽度（zero-width）：该断言不会消耗字符串的任何字符。如果是模式继续的话，还可以对圆点进行匹配。

■ 肯定的（positive）意味着想要规定圆点存在（?=…），否定的前视断言（?!...）。

■ 前视断言（lookahead assertion）表示想要知道下一个指定的是什么，但并不匹配。

3. 修饰语：在正则表达式最后加入一个字母：/abc/i

修饰语 i 使该正则表达式的匹配操作是大小写不敏感的。

修饰语 m 使圆点通配符可以与任何字符相匹配，包括换行符。

str = "This (including\nwhat's in parens\n) takes up three lines."

m = /$.*?$/m.match(str)

匹配结果为

(including

what's in parens

)

4. 字符串与正则表达式之间的转换

>> str = "def"

=> "def"

>> /abc#{str}/

=> /abcdef/

在把字符串放入正则表达式之前，可以转义字符串中的特殊字符

>> Regexp.escape("a.c")

=> "a\.c"

>> Regexp.escape("^abc")

=> "\^abc"

将正则表达式转换成字符串：str = /abc/.inspect

五、使用正则表达式的常见方法

1. 要从一个字符串数组中找出所有至少有10个字符而且其中至少有一个数字的字符串：

array.find_all {|e| e.size > 10 and /\d/.match(e) }

2. String#scan

scan 方法扫描一个字符串，重复地进行测试以寻找指定模式的各个匹配，结果返回到一个数组中

>> "testing 1 2 3 testing 4 5 6".scan(/\d/)

=> ["1", "2", "3", "4", "5", "6"]

如果在传递给 scan 的正则表达式中使用了小括号分组，它会返回一个数组的数组

>> str = "Leopold Auer was the teacher of Jascha Heifetz."

=> "Leopold Auer was the teacher of Jascha Heifetz."

>> violinists = str.scan(/([A-Z]\w+)\s+([A-Z]\w+)/)

=> [["Leopold", "Auer"], ["Jascha", "Heifetz"]]

3. String#split

split 方法会将一个字符串分割为几个子字符串

line = "first_name=david;last_name=black;country=usa"

record = line.split(/=|;/)

4. sub/sub! 和 gsub/gsub!

gsub（全局替代）：会遍历整个字符串进行修改；

sub：最多只进行一次替代；

（1）sub 接受两个参数，一个正则表达式（或者字符串）和一个替代字符串

>> "typigraphical error".sub(/i/,"o")

=> "typographical error"

可以使用代码块来取代替代字符串参数

>> "capitalize the first vowel".sub(/[aeiou]/) {|s| s.upcase }

=> "cApitalize the first vowel"

（2）gsub ，与 sub 类似，只要字符串的剩余部分还能够与模式匹配就会继续替代

>> "capitalize every word".gsub(/\b\w/) {|s| s.upcase }

=> "Capitalize Every Word"

（3）在替代字符串中使用子匹配

要修正小写字母后面跟着大写字母这样的错误，可以这样做：

>> "aDvid".sub(/([a-z])([A-Z])/, '\2\1')

=> "David"

如果想将字符串的每一个单词都重复一遍，可以这样写

>> "double every word".gsub(/\b(\w+)/, '\1 \1')

=> "double double every every word word"

5. grep：直接基于正则表达式参数进行选择操作

>> ["USA", "UK", "France", "Germany"].grep(/[a-z]/)

=> ["France", "Germany"]

实际上可以使用 select 完成同样的事

["USA", "UK", "France", "Germany"].select {|c| /[a-z]/.match(c) }

选择一些国家后，将它们转换成大写并放到一个数组中返回

>> ["USA", "UK", "France", "Germany"].grep(/[a-z]/) {|c| c.upcase }

=> ["FRANCE", "GERMANY"]