java 正则表达式

来源：互联网发布：java退格符处理编辑：程序博客网时间：2024/06/06 01:41

转自：http://alan-hjkl.iteye.com/blog/1543548

7.量词（Quantifiers）匹配

Java正则表达式中的Quantifiers（量词）是用来指定匹配字符出现的次数的，java api 中有三种Quantifiers：greedy,reluctant and possessive.虽然三种quantifiers的作用很相似，但是三者还是有区别的

量词种类意义

贪婪

(greedy)

勉强

(reluctant)

侵占

(possessive)

X?X??X?+匹配 X 零次或一次X*X*?X*+匹配 X 零次或多次X+X+?X++匹配 X 一次或多次X{n}X{n}?X{n}+匹配 X n 次X{n,}X{n,}?X{n,}+匹配 X 至少 n 次X{n,m}X{n,m}?X{n,m}+匹配 X 至少 n 次，但不多于 m 次

我们现在就从贪婪（greedy）词开始，构建三个不同的正则表达式：字母a后面跟着？、*和+。接下来看一下，用这些表达式来测试输入的字符串是空字符串时会发生些什么：

Java代码  
Enter your regex: a?  
Enter input string to search:   
I found the text "" starting at index 0 and ending at index 0.  
  
Enter your regex: a*  
Enter input string to search:   
I found the text "" starting at index 0 and ending at index 0.  
  
Enter your regex: a+  
Enter input string to search:   
No match found.  

7.1 零长度匹配

在上面的例子中，开始的两个匹配时成功的，这是因为表达式a?和a*都允许字符出现零次。就目前而言，这个例子不像其它的，也许你注意到了开始索引和结束索引都是0。输入的空字符串没有长度，因此该测试简单地在索引0上匹配什么都没有，诸如此类的匹配称之为零上都匹配（zero-length matches）.零长度匹配会出现在以下几种情况：输入空的字符串、在输入字符串的开始处、在输入字符串最后字符的后面，或者是输入字符串任意两个字符之间。由于它们开始和结束的位置有着相同的索引，因此零长度匹配是很容易被发现的。

我们来看一下关于零长度匹配更多的例子。把输入的字符串改成单个字符"a",你是否注意到一些有意思的事情：

Java代码  
Enter your regex: a?  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
  
Enter your regex: a*  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
  
Enter your regex: a+  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.  

所有的三个量词都是用来寻找字母"a"的，但是前面两个在索引1处找到了零长度匹配，也就是说，在输入字符串最后一个字符的后面。回想一下，匹配把字符"a"看作是位于索引0和索引1之间的单元格中，并且测试工具一支循环下去直到不再有匹配为止。依赖与所使用的量词不同，最后字符后面的索引"什么也没有"的存在可以或者不可以触发一个匹配。

现在把输入的字符串改成一行5个"a"时，会得到下面的结果：

Java代码  
Enter your regex: a?  
Enter input string to search: aaaaa  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "a" starting at index 1 and ending at index 2.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "a" starting at index 3 and ending at index 4.  
I found the text "a" starting at index 4 and ending at index 5.  
I found the text "" starting at index 5 and ending at index 5.  
  
Enter your regex: a*  
Enter input string to search: aaaaa  
I found the text "aaaaa" starting at index 0 and ending at index 5.  
I found the text "" starting at index 5 and ending at index 5.  
  
Enter your regex: a+  
Enter input string to search: aaaaa  
I found the text "aaaaa" starting at index 0 and ending at index 5.  

在"a"出现零次或一次时，表达式a?寻找到所匹配的的每一个字符。

表达式a*找到两个单独的匹配：第一次匹配到所有的字母"a",然后是匹配到最后一个字符后面的索引5。

最后，a+匹配了所有出现的字母"a",忽略了在最后索引处"什么都没有"的存在。

在这里，你也许会感到疑惑，开始的两个量词在遇到除了"a"的字母时会有什么结果。例如，在"ababaaaab"中遇到字母"b"会发生什么呢？

下面我们来看一下：

Java代码  
Enter your regex: a?  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "" starting at index 3 and ending at index 3.  
I found the text "a" starting at index 4 and ending at index 5.  
I found the text "a" starting at index 5 and ending at index 6.  
I found the text "a" starting at index 6 and ending at index 7.  
I found the text "a" starting at index 7 and ending at index 8.  
I found the text "" starting at index 8 and ending at index 8.  
I found the text "" starting at index 9 and ending at index 9.  
  
Enter your regex: a*  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "" starting at index 3 and ending at index 3.  
I found the text "aaaa" starting at index 4 and ending at index 8.  
I found the text "" starting at index 8 and ending at index 8.  
I found the text "" starting at index 9 and ending at index 9.  
  
Enter your regex: a+  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "aaaa" starting at index 4 and ending at index 8.  

即使字母"b"在单元格1、3、8中出现，但在这些位置上的输出报告了零长度匹配。正则表达式a?不是特意地去寻找字母"b",它仅仅是去找字母"a"存在或者其中缺少的。如果量词允许匹配"a"零次，任何输入的字符不是"a"时将会作为零长度匹配。在前面的例子中，根据讨论的规则保证了a被匹配。

对于要精确地匹配一个模式n次时，可以简单地在一对花括号内指定一个数值：

Java代码  
Enter your regex: a{3}  
Enter input string to search: aa  
No match found.  
  
Enter your regex: a{3}  
Enter input string to search: aaa  
I found the text "aaa" starting at index 0 and ending at index 3.  
  
Enter your regex: a{3}  
Enter input string to search: aaaa  
I found the text "aaa" starting at index 0 and ending at index 3.  

这里，正则表达式a{3}在一行中寻找连续出现三次的字母"a".第一次测试失败的原因在于，输入的字符串没有足够的a用来匹配；第二次测试输入的字符串正好包括三个"a"，触发了一次匹配；第三次测试也触发了一次匹配，这是由于在输出的字符串的开始部分正好有三个"a".接下来的事情与第一次的匹配时不相关的，如果这个模式将在这一点后继续出现，那它将会触发接下来的匹配：

Java代码  
Enter your regex: a{3}  
Enter input string to search: aaaaaaaaa  
I found the text "aaa" starting at index 0 and ending at index 3.  
I found the text "aaa" starting at index 3 and ending at index 6.  
I found the text "aaa" starting at index 6 and ending at index 9.  

对于需要一个模式出现至少n次时，可以在这个数字后面加上一个逗号（,）：

Java代码  
Enter your regex: a{3,}  
Enter input string to search: aaaaaaaaa  
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.  

输入一样的字符串，这次测试仅仅找到了一个匹配，这是由于一个中有九个"a"满足了"至少"三个"a"的要求。

最后，对于指定出现次数的上限，可以在花括号添加第二个数字。

Java代码  
Enter your regex: a{3,6}    //寻找一行中至少连续出现3个（但不多于6个）"a"  
Enter input string to search: aaaaaaaaa  
I found the text "aaaaaa" starting at index 0 and ending at index 6.  
I found the text "aaa" starting at index 6 and ending at index 9.  

这里，第一次匹配在6个字符的上限时被迫终止了。第二个匹配包括了剩余的三个a（这是匹配所允许最小的字符个数）。如果输入的字符串在少掉一个字母，这里将不会有第二个匹配，之后仅剩余两个a。

7.2 捕获组和字符类中的量词

到目前为止，仅仅测试了输入的字符串包括一个字符的量词。实际上，量词仅仅可能附在一个字符后面一次，因此正则表达式abc+的意思就是"a后面接着b，在接着一次或多次的c",它的意思并不是指abc一次或多次。然而，量词也可能附在字符类和捕获组的后面，比如，[abc]+表示一次或多次的a或b或c，（abc）+表示一次或者多次的"abc"组。

我们来指定（dog）组在一行中三次进行说明。

Java代码  
Enter your regex: (dog){3}  
Enter input string to search: dogdogdogdogdogdog  
I found the text "dogdogdog" starting at index 0 and ending at index 9.  
I found the text "dogdogdog" starting at index 9 and ending at index 18.  
  
Enter your regex: dog{3}  
Enter input string to search: dogdogdogdogdogdog  
No match found.  

上面的第一个例子找到了三个匹配，这是由于量词用在了整个捕获组上。然后，把圆括号去掉，这时的量词{3}现在仅用在了字母"g"上，从而导致这个匹配失败。

类似地，也能把量词应用于整个字符类：

Java代码  
Enter your regex: [abc]{3}  
Enter input string to search: abccabaaaccbbbc  
I found the text "abc" starting at index 0 and ending at index 3.  
I found the text "cab" starting at index 3 and ending at index 6.  
I found the text "aaa" starting at index 6 and ending at index 9.  
I found the text "ccb" starting at index 9 and ending at index 12.  
I found the text "bbc" starting at index 12 and ending at index 15.  
  
Enter your regex: abc{3}  
Enter input string to search: abccabaaaccbbbc  
No match found.  

上面的第一个例子中，量词{3}应用在整个字符类上，但是第二个例子这个量词仅用在字母"c"上。

7.3 贪婪、勉强和侵占量词间的不同

在贪婪、勉强和侵占三个量词间有着细微的不同。

贪婪量词之所以称之为"贪婪的"，是由于它们强迫匹配器读入(或者称之为吃掉)整个输入的字符串，来优先尝试第一次匹配，如果第一次尝试匹配（对整个输入的字符串）失败，匹配器会通过回退整个字符串的一个字符再一次进行尝试，不断的进行处理直到找到一个匹配，或者左边没有更多的字符用来回退了。赖于在表达式中使用的量词，最终它将尝试地靠着1或0个字符的匹配。

但是，勉强量词采用相反的路径：从输入字符串的开始处开始，因此每次勉强地吞噬一个字符来寻找匹配，最终它们尝试整个输入的字符串。

最后，侵占量词始终是吞掉整个输入的字符串，尝试着一次（仅有一次）匹配。不像贪婪量词那样，侵占量词绝不会回退，即使这样是允许全部的匹配成功。

为了说明一下，看看输入的字符串是xfooxxxxxxfoo时。

Java代码  
Enter your regex: .*foo     //贪婪量词  
Enter input string to search: xfooxxxxxxfoo  
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.  
  
Enter your regex: .*?foo        //勉强量词  
Enter input string to search: xfooxxxxxxfoo  
I found the text "xfoo" starting at index 0 and ending at index 4.  
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.  
  
Enter your regex: .*+foo        //侵占量词  
Enter input string to search: xfooxxxxxxfoo  
No match found.  

第一个例子使用贪婪量词.*,寻找紧跟着字母"f" "o" "o" 的"任何东西"零次或者多次。由于量词是贪婪的，表达式的.*部分第一次"吃掉"整个输入的字符串。在这一点,全部表达式不能成功地进行匹配，这是由于最后三个字母（"f" "o" "o"）已经被消耗掉了。那么匹配器会慢慢地每次回退一个字母，直到返还的"foo"在最右边出现，这时匹配成功并且搜索终止。

然而，第二个例子采用勉强量词，因此通过首次消耗"什么也没有"作为开始。由于"foo"并没有出现在字符串的开始，它被强迫吞掉第一个字母（"x"）,在0和4处触发了第一个匹配。测试工具会继续处理，直到输入的字符串耗尽位置。在4和13找到了另外一个匹配。

第三个例子的量词是侵占，所以在寻找匹配时失败，在这种情况下，整个输入的字符串被 .*+消耗了，什么都没有剩下来满足表达式末尾的"foo"。

你可以在想抓取所有的东西，且绝不回退的情况下使用侵占量词，在这种匹配不是立即被发现的情况下，它将会优于等价的贪婪量词。

0 0