Regex在.Net中的平衡组-----Part 2 of 2

来源：互联网发布：网络钓鱼演示编辑：程序博客网时间：2024/05/14 15:06

Introduction

Visit here

This is the second article in a short series where I go in depth with the .NET RegEx engine and Regex class. The first part treated nested RegEx constructions in depth. In this part, I'll study the balancing group and the .NET Regex class and related objects - again using nested constructions as my main focus.

If you are an experienced RegEx developer, please feel free to fast forward to the part "Manipulating nested constructions."

RegEx Anatomy

The Capture class is an essential part of the Regex class. In fact both the Group and the Match class inherit from the Capture class. The class is a specific .NET invention and even though many developers won't ever need to call this class explicitly, it does have some cool features, for example with nested constructions.

According to the .NET documentation, an instance of the Capture class contains a result from a single sub expression capture.

This is easier to grasp with an example.

a+

This is a very simple RegEx that matches 1 or more successive a's. Given the input string aa the RegEx will match the whole string "in one capture". The Match object created will contain one Group object (because a Match object always contains one ordinal 0 Group object), and this Group object will contain one Capture. But we can change this behaviour by manipulating the RegEx a bit:

(a)+

Now the RegEx contains 2 Group objects:

Groups[0]: aa //i.e. the whole matchGroups[1]: a  //i.e. the last capture in the parenthesis

Remember that when putting something in a parenthesis, what you actually do is that you tell the RegEx engine to treat what's inside the parenthesis as a Group. It is important that the Group class consists of zero or more Capture objects and always returns the latest capture in the group.

What about the rest of the captures? You guessed right. They are still stored in the Captures collection in the Group object:

Groups[0].Value: aa             //i.e. the whole matchGroups[1].Value: a              //i.e. the last capture in the groupGroups[1].Captures[0].Value: a  //i.e. the first capture in the groupGroups[1].Captures[1].Value: a  //i.e. the second capture in the group                                    //     - this is equal to Groups[1]

At first glance this might just seem awkward - why would you need all of those Capture objects? But they do have some cool features.

Manipulating Nested Constructions

Remember the nested constructions from Part I in this article series? Here is what they looked like:

"(?>    "/b (?<DEPTH> )    |    /b" (?<-DEPTH> )    |    [^"]*)*(?(DEPTH)(?!))"

With this RegEx we'll use this input string:

he said "The Danish capital "Copenhagen" is a nice place" and laughed

Now, applying the RegEx to this input string returns:

match.Value:                          "The Danish ... nice place"match.Groups["DEPTH"].Value:          ""match.Groups["DEPTH"].Captures.Count: 0

This looks a bit strange. The RegEx engine creates a Group, but it doesn't contain anything... The reason is that we emptied the group from within the RegEx. Remember how (?<DEPTH>) pushes a capture on the stack and (?<-DEPTH>) pops the stack? Because the double-quotes are nested, the RegEx continues pushing/popping the stack until it ends up empty. This is what the code (?(DEPTH)(?!)) is actually testing! Therefore the DEPTH Group is created but has no captures.

What if we want to get information about the captures afterwards? We've just deleted them to test correct nesting! - And we've observed that the DEPTH Group is empty... The answer to this challenge is the balancing group. The .NET documentation is almost unreadable on this topic. Thus I won't quote it. Instead, I'll try demonstrating the idea.

We already know that a named capture creates a stack and pushes each capture on the stack. This is done with the code (?<STACKNAME>).

We also know that we can pop the stack with the code (?<-STACKNAME>).

Finally we can test if the stack exists with the code (?(STACKNAME) then | else).

As you might already notice, the balanced group looks like a cross between (?<STACKNAME>) and (?<-STACKNAME>). And actually, that's a good way to look at it.

Let's walk through an example:

(?# line 1) ((?# line 2)    (?<OPEN>)"/b(?# line 3)   |(?# line 4)   /b"  (?<QUOTE-OPEN>)(?# line 5)   |(?# line 6)   [^"]*(?# line 7) )*(?# line 8) (?(OPEN)(?!))

This matches either an opening quote (followed by a word boundary), a closed quote (following a word boundary) or any character which is not a double quote.

This example does the same as we saw in the examples with nested parenthesis in the last article. It pushes an empty element on the OPEN stack (line 2) when an opening quote is matched, and it pops the OPEN stack when a closing element is matched (line 4). Finally it tests whether the stack is empty (line 8).

But the pop command looks a bit different: (?<QUOTE-OPEN>). This command can be divided into two parts. If we begin backwards, the last part is identical to (?<-OPEN>) which pops the OPEN stack. The first part on the other hand is pushing an element on a new stack - the QUOTE stack. I've illustrated what happens in the figure below.

Hence the QUOTE stack contains two captures which can be addressed at runtime like this.

//return "Copenhagen":Match.Groups["QUOTE"].Captures[0];//return "The Danish capital "Copenhagen" is a nice place":Match.Groups["QUOTE"].Captures[1];

In other words. The balancing group pushes and pops two different stacks at the same time. Let's call the stacks PUSH and POP respectively. First it looks at the top of the POP stack. Then it addresses everything that is matched "from" the position of the capture in the top of the POP stack and "up until" the current position. It then pushes this on the PUSH stack and pops the POP stack. Go through the figure again one step at a time - it's really worth understanding the balanced group. Below I've tried to generalize this concept.

In fact, the pop command that we've used before: (?<-DEPTH>), is a balanced group! It omits the PUSH stack and only pops a stack. The case is that in the balanced group syntax (?<NAME1-NAME2>) the first part (NAME1) is optional. If we leave it out, the balanced group just pops NAME2.

Peeking the Stack: Nesting with Multiple Parenthetic Symbols

Unfortunately it is not possible to peek a stack and test the result directly. For example you might want to match constructions nested with both parenthesises and square brackets such as ([]).

On puzzleware.net the algorithm below is posted for this purpose (rewritten a bit):

1. Do ONE of the following (a) to (e)    a. Match '(' and push it on the stack LEVEL    b. If the top of the stack LEVEL is '(' then try to match ')'       and pop the stack    c. Match '[' and push it on the stack LEVEL    d. If the top of the stack LEVEL is '[' then try to match ']'       and pop the stack    e. Otherwise match any character (or nothing) except a (, ), [ and ]2. Repeat (1) zero or more times.3. Finally test if the stack LEVEL is empty

The basic idea is to keep the latest kind of opening parenthesis matched on top of the stack. With this knowledge we only allow the correct closing parenthesis to match the opening parenthesis.

As already described, the problem is how to peek the stack. This is not directly possible which is also stated in the mentioned blog post here, and therefore this is only an algorithm of how it "could" be done.

But, actually, if we use balanced grouping we can get around this problem. It won't be pretty, but it works.

We want to make sure that we only capture the correct closing bracket. So, take a look at a short example:

Paul a [les table repeint(es)]

This sentence is from Chomsky's 'The Minimalist Program'. First, we'll rewrite the peeking algorithm a bit:

The previous 1b looked like this:1b. If the top of the stack LEVEL is '(' then try to match ')'     and pop the stackThe new version looks like this:1b.   i.   Lookahead to make sure the next symbol is ')'    ii.  If the symbol before the last capture on the stack was a '('        then try to match ')'   iii. Pop the stack

This makes a difference because now we can use a balanced group. Here's the RegEx:

(?# line 01) (?>(?# line 02)       /( (?<LEVEL>)(?<CURRENT>)(?# line 03)     | (?# line 04)       (?=/))(?# line 05)       (?<LAST-CURRENT>)(?# line 06)       (?(?<=/(/k<LAST>)(?# line 07)         (?<-LEVEL> /))(?# line 08)       )(?# line 09)     |(?# line 10)       /[ (?<LEVEL>)(?<CURRENT>)(?# line 11)     |(?# line 12)       (?=/])(?# line 13)       (?<LAST-CURRENT>)(?# line 14)       (?(?<=/[/k<LAST>)(?# line 15)         (?<-LEVEL> /] )(?# line 16)       )(?# line 17)     |(?# line 18)       [^()/[/]]*(?# line 19)   )+(?# line 20) (?(LEVEL)(?!))

I don't blame you if you can't grasp this expression immediately, but I will encourage you to take the time and let me walk you through it - it's quite a rewarding example.

First, we break it down in two parts. Lines 2 - 8 match ( and ) while lines 10 - 15 match [ and ]. These are the main parts. Additionally line 18 matches any character that is not (, ), [ and ], and line 20 tests if the LEVEL stack is empty.

We'll focus on the first of the two main parts, i.e. lines 2 - 8. When we understand this part, we'll understand the whole expression.

First (line 2) we try to match an opening parenthesis (. If this succeeds, we push two different stacks with empty elements: LEVEL and CURRENT. The two stacks have different purposes. The LEVEL stack makes sure that the number of opening and closing parenthesis matched are equal. We'll use the CURRENT stack to "peek" the LEVEL stack. Hereby we make sure that we match the correct kind of opening parenthesis with the correct kind of closing parenthesis. I've put the word peek in double quotes indicating that we're actually cheating a bit, but we'll get back to this later.

Now, to match the opening ( with a closing ) we have set some restrictions. First line 4 states that the following symbol must be a closing parenthesis. This is not very surprising. But line 5 is quite interesting. Here we use balanced grouping to push a new stack (LAST). What happens is that we take everything which is matched since the last CURRENT stack until the current position and push this whole capture on the LAST stack. Remember that the CURRENT stack is always pushed just after a ( or [ (line 2 and 10). On the LAST stack we then push everything that is matched since the last ( or [ up until the current position.

In the figure below, I've illustrated the process.

I have left out the LEVEL stack because the only job for this stack is to test that the number of nestings are correct.

The relation between the OPEN stack and the LAST stack is more fun. Here the CURRENT stack is used to determine if the closing parenthesis should be ) or ]. The trick is a positive lookbehind (in lines 6 and 14). The RegEx engine looks behind to test if the latest match is equal to the top of the LAST stack. This will always be the case! Therefore it is possible to prefix this lookbehind statement with either a /( or a /[. If the symbol before the top of the LAST stack is ( then a ) is matched and vice versa.

This lookbehind is a bit expensive though! It would be much easier if it was possible to catch all of the parenthesises in one stack and solely test which kind of parenthesis resides on the top of the stack. But - as far as I know - this is not possible. At least not yet, so if you need to do a peek, the pattern described here is possibly the only way.

The expression has one major advantage though. As you can see in the figure, all of the parenthesis matched are stored in the LAST stack which is never popped. This enables us to request the parenthesis one at a time:

foreach (System.Text.RegularExpressions.Capture capture                        in match.Groups["LAST"].Captures){    Console.WriteLine(capture.Value);}

In our example this will return:

esles table repeint(es)

Points of Interest

The balancing group is hard to understand. And it is not very well documented. I hope this article does part of the job. The balancing group also turns out to be very useful in various cases, first of all when we need to address each of the captures in a nested pattern. Secondly we are able to use the balancing group to mimic a peek on the stack and match nested constructions with multiple parenthetic symbols.