C++ 新特性学习（三） — Regex库

来源：互联网发布：手机电话号码数据库编辑：程序博客网时间：2024/06/05 08:57

C++ STL终于会放点实用的东西了。可喜可贺。

这个，显然是正则表达式库，作为一个强大而又NB的库，我表示对其理解甚少，只能先研究下基本用法，更具体的用法要等实际应用中用到的时候在细看了。
PS：正则表达式的资料见 http://www.regexlab.com/
更多资料见 http://www.owent.net/?p=264

就这样吧，开始。
正则表达式这玩意是用自动机搞出来的，效率当然就是自动机的效率了。当然不同的实现效率是不一样的，至于STL的效率。我就不清楚了，不过姑且相信STL吧。

第一个注意：使用正则表达式的转义的时候，不要忘了C/C++的斜杠也是要转义的
正则表达式主要函数有三
std::regex_search
std::regex_match
std::regex_replace
第三个好说，看函数名就知道什么意思，但是前两个呢？
直接报答案吧，第一个是不完全匹配，第二个是完全匹配。

同时，在正则表达式库里还有两个重要的类
enum std::regex_constants::match_flag_type 这个看名字就能知道是设置匹配选项的，具体选项看内容就很容易看懂，也不用多解释了。
另一个是类模版std::match_results，传进去的类型是类的迭代器
如以下从VC里抄来的

typedef basic_regex<char> regex;

typedef basic_regex<wchar_t> wregex;

typedef match_results<const char> cmatch;

typedef match_results<const wchar_t> wcmatch;

typedef match_results<string::const_iterator> smatch;

typedef match_results<wstring::const_iterator> wsmatch;
这都是默认定义
这个用于记录匹配结果，匹配如果成功，它里面会有多个std::sub_match对象，分别指向匹配的结果
std::sub_match里有matched成员表示该项是否匹配成功，还有first和second成员分别指向匹配的目标的起始位置和结束位置，str()函数可以获取匹配的值
而同时std::match_results的prefix()和suffix()函数分别指向整个匹配式的头和尾。返回的类型也是std::sub_match，内容和上面的类似

这里有第二个注意：匹配结果里的数据是共享的，只是指针不同，所以要注意不要随意释放资源。
另外有第三个注意：匹配返回真的时候才会对传入的匹配项的变量修改，如果返回false，传入的std::match_results是不会变化的

接下来就是std::regex_replace了，说到这个还涉及到std::match_results的format函数，这是一个表示筛选匹配项的的东东
具体的嘛，看下面（只是把BOOST里的东西简单翻译以下，没有boost扩展的部分，并且只留下了VC++里tr1包含的功能，他说是Perl风格的）

占位符

含义

整个匹配值

$MATCH

和 $& 一样

${^MATCH}

和 $& 一样

被匹配字符串去除匹配目标后的结果（即）

$PREMATCH

和 $` 一样

${^PREMATCH}

和 $` 一样

当前匹配位置之后的全部文本（不包括匹配的字符串）

$POSTMATCH

和 $' 一样

${^POSTMATCH}

和 $' 一样

字符 '$'

第n和被匹配项的值

我表示boost的功能更强大不过这些已经够了。
另外转义字符如下

Escape

Meaning

Outputs the bell character: '\a'.

Outputs the ANSI escape character (code point 27).

Outputs a form feed character: '\f'

Outputs a newline character: '\n'.

Outputs a carriage return character: '\r'.

Outputs a tab character: '\t'.

Outputs a vertical tab character: '\v'.

\xDD

Outputs the character whose hexadecimal code point is 0xDD

\x{DDDD}

Outputs the character whose hexadecimal code point is 0xDDDDD

\cX

Outputs the ANSI escape sequence "escape-X".

If D is a decimal digit in the range 1-9, then outputs the text that matched sub-expression D.

Causes the next character to be outputted, to be output in lower case.

Causes the next character to be outputted, to be output in upper case.

Causes all subsequent characters to be output in lower case, until a \E is found.

Causes all subsequent characters to be output in upper case, until a \E is found.

Terminates a \L or \U sequence.

这个就懒得翻译和测试了，都是很简单的东西。

接下来std::regex_replace里的format也是传入这种东西，返回的就是替换后的字符串了。

另外正则表达式错误，会抛出异常，当然你也可以配合std::regex_constants::match_flag_type做一些变化。

最后，贴出代码和结

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include <string>
#include <iostream>
#include <algorithm>
#include <regex>
#include <cstdio>
 
 
 
int main() {
    usingnamespace std;
 
    regex reg("(http|https)://([\\w\\./]*)");
    string strIn;
    std::smatch res;
    boolisUrl;
 
    // 查找
    getline(cin, strIn);
    isUrl = std::regex_search(strIn, res, reg, std::regex_constants::match_not_null);
    cout<< (isUrl?"It's a url": "It's not a url")<< endl;
    // 输入 MyBlog is http://www.owent.net/ 匹配成功
    // 匹配结果里有三项，分别是整个匹配表达式和两个子表达式
    // 以下代码输出
    // 这个时候千万不能执行类似strIn = "" 改变strIn内容的操作，
    // 因为其和res指针指向的内存是共享的，如果对其进行就该会出现RE
    for(std::smatch::size_type i = 0; i < res.size(); i ++) {
        cout<<"第"<< i + 1<< "条匹配项first地址 => "<< &(res[i].first)<< endl;
        cout<<"第"<< i + 1<< "条匹配项second地址 => "<< &(res[i].second)<< endl;
        cout<<"第"<< i + 1<< "条匹配值为 => "<< res[i].str()<< endl<< endl;
    }
 
   
    // 匹配
    isUrl = std::regex_match(strIn, res, reg);
    cout<< isUrl<<" <= Matched? ,Size =>"<<res.size()<< endl;
    // 输入 MyBlog is http://www.owent.net/ 匹配失败，但是没有修改res的值
    // 所以会输出上一次匹配的结果： 3
    
    // 替换
    string strRule ="<a href=\"$&\">$&</a><br />\nScheme is $1\nAddress is $2";
    string strOut = std::regex_replace(strIn, reg, strRule);
    cout<< strOut<< endl;
    return0;
}
 
//以下是输入“MyBlog is http://www.owent.net/ ”的输出结果：
//It's a url
//第1条匹配项first地址 => 0032EB70
//第1条匹配项second地址 => 0032EB7C
//第1条匹配值为 => http://www.owent.net/
//
//第2条匹配项first地址 => 0032EB8C
//第2条匹配项second地址 => 0032EB98
//第2条匹配值为 => http
//
//第3条匹配项first地址 => 0032EBA8
//第3条匹配项second地址 => 0032EBB4
//第3条匹配值为 => www.owent.net/
//
//0 <= Matched? ,Size =>3
//MyBlog is <a href="http://www.owent.net/">http://www.owent.net/</a><br />
//Scheme is http
//Address is www.owent.net/