正则表达式学习指南(十六)----Possessive Quantifiers
来源:互联网 发布:sql语句更改数据 编辑:程序博客网 时间:2024/05/08 06:42
Possessive Quantifiers
When discussing the repetition operators or quantifiers, I explained the difference between greedy and lazy repetition. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. A greedy quantifier will first try to repeat the token as many times as possible, and gradually give up matches as the engine backtracks to find an overall match. A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match.
Because greediness and laziness change the order in which permutations are tried, they can change the overall regex match. However, they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found.
Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches.
How Possessive Quantifiers Work
Several modern regular expression flavors, including the JGsoft, Java and PCRE have a third kind of quantifier: the possessive quantifier. Like a greedy quantifier, a possessive quantifier will repeat the token as many times as possible. Unlike a greedy quantifier, it willnot give up matches as the engine backtracks. With a possessive quantifier, the deal is all or nothing. You can make a quantifier possessive by placing an extra+ after it. E.g. * is greedy, *? is lazy, and*+ is possessive. ++, ?+ and {n,m}+ are all possessive as well.
Let's see what happens if we try to match "[^"]*+" against"abc". The " matches the ". [^"] matches a, b and c as it is repeated by the star. The final" then matches the final " and we found an overall match. In this case, the end result is the same, whether we use a greedy or possessive quantifier. There is a slight performance increase though, because the possessive quantifier doesn't have to remember any backtracking positions.
The performance increase can be significant in situations where the regex fails. If the subject is"abc (no closing quote), the above matching process will happen in the same way, except that the second" fails. When using a possessive quantifier, there are no steps to backtrack to. The regular expression does not have any alternation or non-possessive quantifiers that can give up part of their match to try a different permutation of the regular expression. So the match attempt fails immediately when the second " fails.
Had we used a greedy quantifier instead, the engine would have backtracked. After the" failed at the end of the string, the [^"]* would give up one match, leaving it withab. The " would then fail to match c. [^"]* backtracks to just a, and" fails to match b. Finally, [^"]* backtracks to match zero characters, and " failsa. Only at this point have all backtracking positions been exhausted, and does the engine give up the match attempt. Essentially, this regex performs as many needless steps as there are characters following the unmatched opening quote.
When Possessive Quantifiers Matter
The main practical benefit of possessive quantifiers is to speed up your regular expression. In particular, possessive quantifiers allow your regex to fail faster. In the above example, when the closing quote fails to match, weknow the regular expression couldn't have possibly skipped over a quote. So there's no need to backtrack and check for the quote. We make the regex engine aware of this by making the quantifier possessive. In fact, some engines, including theJGsoft engine detect that [^"]* and " are mutually exclusive when compiling your regular expression, and automatically make the star possessive.
Now, linear backtracking like a regex with a single quantifier does is pretty fast. It's unlikely you'll notice the speed difference. However, when you're nesting quantifiers, a possessive quantifier may save your day. Nesting quantifiers means that you have one or more repeated tokens inside a group, and the group is also repeated. That's whencatastrophic backtracking often rears its ugly head. In such cases, you'll depend on possessive quantifiers and/oratomic grouping to save the day.
Possessive Quantifiers Can Change The Match Result
Using possessive quantifiers can change the result of a match attempt. Since no backtracking is done, and matches that would require a greedy quantifier to backtrack will not be found with a possessive quantifier. E.g.".*" will match "abc" in "abc"x, but ".*+" will not match this string at all.
In both regular expressions, the first " will match the first" in the string. The repeated dot then matches the remainder of the stringabc"x. The second " then fails to match at the end of the string.
Now, the paths of the two regular expressions diverge. The possessive dot-star wants it all. No backtracking is done. Since the" failed, there are no permutations left to try, and the overall match attempt fails. The greedy dot-star, while initially grabbing everything, is willing to give back. It will backtrack one character at a time. Backtracking toabc", " fails to match x. Backtracking to abc, " matches". An overall match "abc" was found.
Essentially, the lesson here is that when using possessive quantifiers, you need to make sure that whatever you're applying the possessive quantifier to should not be able to match what should follow it. The problem in the above example is that the dot also matches the closing quote. This prevents us from using a possessive quantifier. The negated character class in the previous section cannot match the closing quote, so we can make it possessive.
Using Atomic Grouping Instead of Possessive Quantifiers
Technically, possessive quantifiers are a notational convenience to place an atomic group around a single quantifier. All regex flavors that support possessive quantifiers also support atomic grouping. But not all regex flavors that support atomic grouping support possessive quantifiers. With those flavors, you can achieve the exact same results using an atomic group.
Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to(?>(?:a|b)*) but not to (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.
E.g. (?:a|b)*+b and (?>(?:a|b)*)b both fail to matchb. a|b will match the b. The star is satisfied, and the fact that it's possessive or the atomic group will cause the star to forget all its backtracking positions. The secondb in the regex has nothing left to match, and the overall match attempt fails.
In the regex (?>a|b)*b, the atomic group forces the alternation to give up its backtracking positions. I.e. if ana is matched, it won't come back to try b if the rest of the regex fails. Since the star is outside of the group, it is a normal, greedy star. When the secondb fails, the greedy star will backtrack to zero iterations. Then, the secondb matches the b in the subject string.
This distinction is particularly important when converting a regular expression written by somebody else using possessive quantifiers to a regex flavor that doesn't have possessive quantifiers. You could, of course, let a tool like RegexBuddy do the job for you.
- 正则表达式学习指南(十六)----Possessive Quantifiers
- 正则表达式学习指南(十一)----Quantifiers(Repetition)
- 正则表达式中的Quantifiers
- Greedy quantifiers, Reluctant quantifiers, Possessive quantifiers
- Java正则表达式的Quantifiers
- Greedy Reluctant Possessive 正则表达式
- 正则表达式:Greedy、Reluctant、Possessive 区别
- Differences Among Greedy, Reluctant, and Possessive Quantifiers
- Greedy vs. Reluctant vs. Possessive Quantifiers
- Greedy vs. Reluctant vs. Possessive Quantifiers
- 关于 Java正则表达式中的Possessive数量修饰词的理解
- [转载]正则表达式 的greedy、reluctant和possessive量词
- 正则表达式Greedy、Reluctant、Possessive三种策略的区别
- java 正则表达式Greedy、Reluctant、Possessive的理解
- 正则表达式中Greedy、Reluctant、Possessive数量词的区别
- Boost学习指南之:正则表达式
- 正则表达式学习指南(三)----字符
- 正则表达式学习指南(九)----Alternation
- 正则表达式学习指南(十四)----Unicode
- Android去哪儿客服端(6)
- 正则表达式学习指南(十五)----Mode Modifiers
- iPhone上AR体验产品及相关链接
- 字符集笔记
- 正则表达式学习指南(十六)----Possessive Quantifiers
- 2012-01-17-03
- 正则表达式学习指南(十七)----Atomic Grouping
- 程序员新年要做的10个决定
- 正则表达式学习指南(十八)----Lookahead and Lookbehind
- 正则表达式学习指南(十九)----Testing The Same Part of a String for More Than One
- Google将不再是纯粹的“互联网”
- 正则表达式学习指南(二十)----Continuing from The Previous Match
- pqmagic 8.0中文版—硬盘分区魔术师