How a Regex Engine Works Internally

来源:互联网 发布:模拟植物生长算法 编辑:程序博客网 时间:2024/05/24 07:22

First Look at How a Regex Engine Works Internally

Knowing how the regex engine works will enable you to craft better regexes more easily. It will help you understand quickly why a particular regex does not do what you initially expected. This will save you lots of guesswork and head scratching when you need to write more complex regexes.

There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. All the regex flavors treated in this tutorial are based on regex-directed engines. This is because certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex regex|regex not to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is "eager".

In this tutorial, after introducing a new regex token, I will explain step by step how the regex engine actually processes that token. This inside look may seem a bit long-winded at certain times. But understanding how the regex engine works will enable you to use its full power and help you avoid common mistakes.

The Regex-Directed Engine Always Returns the Leftmost Match

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

When applying cat to He captured a catfish for his cat., the engine will try to match the first token in the regex c to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, c matches c. The engine will then try to match the second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, c fails to match here and the engine carries on. At the 15th character in the match, c again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a matches a and t matches t.

The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches. The first match is considered good enough.

In this first example of the engine's internals, our regex engine simply appears to work like a regular text search routine. A text-directed engine would have returned the same result too. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works will have a profound impact on the matches it will find. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.

<script type="text/javascript"><!--google_ad_client = "pub-7485249680256358";google_ad_width = 728;google_ad_height = 90;google_ad_format = "728x90_as";google_ad_type = "text_image";google_ad_channel = "";google_color_border = "FF6600";google_color_bg = "FFF4E8";google_color_link = "0000FF";google_color_text = "000000";google_color_url = "803300";google_ui_features = "rc:0";//--></script><script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script>
原创粉丝点击