How a Regex Engine Works Internally
来源:互联网 发布:模拟植物生长算法 编辑:程序博客网 时间:2024/05/24 07:22
First Look at How a Regex Engine Works Internally
Knowing how the regex engine works will enable you to craft better regexes more easily. It will help you understand quickly why a particular regex does not do what you initially expected. This will save you lots of guesswork and head scratching when you need to write more complex regexes.
There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. All the regex flavors treated in this tutorial are based on regex-directed engines. This is because certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.
Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.
You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex regex|regex not to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is "eager".
In this tutorial, after introducing a new regex token, I will explain step by step how the regex engine actually processes that token. This inside look may seem a bit long-winded at certain times. But understanding how the regex engine works will enable you to use its full power and help you avoid common mistakes.
The Regex-Directed Engine Always Returns the Leftmost Match
This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
When applying cat to He captured a catfish for his cat., the engine will try to match the first token in the regex c to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, c matches c. The engine will then try to match the second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, c fails to match here and the engine carries on. At the 15th character in the match, c again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a matches a and t matches t.
The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches. The first match is considered good enough.
In this first example of the engine's internals, our regex engine simply appears to work like a regular text search routine. A text-directed engine would have returned the same result too. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works will have a profound impact on the matches it will find. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.
<script type="text/javascript"><!--google_ad_client = "pub-7485249680256358";google_ad_width = 728;google_ad_height = 90;google_ad_format = "728x90_as";google_ad_type = "text_image";google_ad_channel = "";google_color_border = "FF6600";google_color_bg = "FFF4E8";google_color_link = "0000FF";google_color_text = "000000";google_color_url = "803300";google_ui_features = "rc:0";//--></script><script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script>- How a Regex Engine Works Internally
- 正则表达式学习指南(四)----How a Regex Engine Works Internally
- How a CPU Works ?
- How Tomcat works之 Host and Engine
- How a Server Cluster Works
- How A Servlet container Works
- How A Servlet container Works
- [note]how a generator works
- How a Perl5 program works
- How tomcat works 读书笔记十三 Host和Engine
- How tomcat works——13 Host 和 Engine
- JavaScript: How does 'new' work internally
- How is map() implemented internally in Python?
- How JavaScript works: inside the V8 engine + 5 tips on how to write optimized code
- How Compressor Attack and Release Works? A Beginner Tutorial
- Thinking In Java 之 How a garbage collector works
- think in java笔记:How a garbage collector works
- How a Kalman filter works, in pictures (译文)
- 小错误集锦
- ReSharper UnitRun (feed add-in unit-test software)
- Java Class 文件解析
- 《财富》:Facebook为何难以留住成年用户?
- 倡导领养孤儿!
- How a Regex Engine Works Internally
- IE7与XPSP3安装的顺序问题
- ECLIPSE文件图标中没有VSS标记的办法
- struts报Cannot forward after response has been committed問題
- 构建sns社区的三板斧
- 用ASP开发试题库与在线考试系统
- asp中用正则表达式过滤字符,避免注入攻击
- Java学习路线
- 收养孤儿的法规和条件