Boost 学习笔记--->字符串&文本处理

来源：互联网发布：淘宝的观复博物馆编辑：程序博客网时间：2024/06/05 19:19

所有示例gitHub地址：https://github.com/RemidNer/BoostTestObject

编译环境：win10 Vs2015 Boost version:1.65.0

概解：

lexical_cast、string_algo、format这几个方面是boost处理字符串与文本的核心功能，涵盖了以下方面：

a、将数值与字符串互做转换

b、将输出做精确的格式化处理

c、字符串的具体表示形式

lexical_cast：

此函数功能类似于c语言中的atoi函数，可以将string、int、flaot之间的字面值进行转换，下面是这个函数一个简单的示例：

示例一：

#include <iostream>
#include <boost/lexical_cast.hpp>          //to use lexical_cast

using namespace std;
using namespace boost;

template<typename T>                       //将类的重载<<操作符作为模版类
struct outable
{
    friend ostream& operator<<(ostream& os,const T& x)
    {
        os << typeid(T).name();
        return os;
    }
};

class DemoClass : public outable<DemoClass>
{
};

void case1()
{
    cout << lexical_cast<string>(DemoClass()) << endl;     //输出类的id并且打印类名
}

int main()
{
    case1();
}

简单的运用示例：

示例二：

#include <boost/lexical_cast.hpp>

using namespace boost;

int main()
{
     int x = lexical_cast<int>("100");               //string ---> int
     long y = lexical_cast<long>("20000");           //string ---> long
     float z = lexical_cast<float>("3.14159e5");     //string ---> float
     double j = lexical_cast<double>("2.1767675");   //string ---> double

     std::cout << x << y << z << j << std::endl;

     /*
     *输出结果：100 20000 314159 2.1767675
     */

     /////////////////////////////////////
     string str = lexical_cast<string>(456);          //int ---> string
     std::cout << str << std::endl;
     std::cout << lexical_cast<string>(0.618) << std::endl;     //float ---> string
     std::cout << lexical_cast<string>(0x10) << std::endl;      //16进制整数 ---> string

     /*
     *输出结果：456 0.61799999999999999 16
     */
}

注意点：

lexical_cast函数在将字符串转换成数字显示时，字符串中只能有数字和小数点，不能出现字母(用作指数表示的e/E除外)或者其它数字字符。

lexical_cast不能转换如："123L"、"0x100"这种格式的C++语法许可的数字字面量字符串，而且lexical_cast不支持高级的格式控制，不能把数字转换成指定格式的字符串，如果需要更高级的格式控制，可使用

std::stringsream
boost::format

异常bad_lexical_cast:

当lexical_cast执行转换出错时会抛出异常：bad_lexical_cast，它是std::bad_cast的派生类，为了使程序更加健壮，需要使用try/catch块来保护转换代码，如下：

示例三：

try
{
     cout << lexical_cast<int>("0x100");
     cout << lexical_cast<double>("HelloWorld");
     cout << lexical_cast<long>("1000L");
     cout << lexical_cast<bool>("flase") << endl;
}
catch(bad_lexical_cast& e)
{
     cout << "error: " << e.what() << endl;
}

上述代码运行后结果如下：

error: bad lexical cast: source type value could not be interpreted as target

同时可以使用异常来验证数字字符串的合法性，可以将这个实现为一个模版类：

示例四：

template<typename T>
bool Num_valid(const char *str)
try
{
   lexical_cast<T>(str);          //进行尝试转换动作
      return true; 
}
catch(bad_lexical_cast &e)
{
     return false;
}

/*
*函数Num_valid使用了一个funtion_try块捕获ban_lexical_cast异常
*如果对字符串调用lexical_cast成功则返回true，失败返回false；
*/

int main()
{
     assert(Num_valid<double>("3.14"));
     assert(!Num_valid<int>("3.14"));
     assert(Num_valid<int>("65535"));
}

对准换对象的要求：

lexical_cast仅仅只是模仿了转型操作符，实际上是一个模版类，lexical_cast内部使用了标准库的流操作符，因此，对于对象的转换有如下要求：

a、转换七点对象是可用作流输出的，即重载了"<<"操作符，operator<<;

b、转换终点对象是可用作流输入的，即重载了">>"操作符，operator>>;

c、转换重点对象必须是可缺省构造和拷贝构造的；

对于C++中的内建类型：int、double、std::string等都满足以上三个条件，这三个也是最常与lexical_cast搭配使用的类型；

但是对于STL中的容器和其它用户自定义的类型，这些条件一般都不满足，不能使用lexical_cast函数进行转换；

应用于自己的类：

如果要讲lexical_cast应用与自己的类，只要实现了对于操作符"<<"的重载即可，就像示例一中所作一样；

Format：

boost.format实现了类似于printf()的格式化对象，可以把参数格式化到一个字符串，相比较C语言里的printf而且是完全类型安全的格式化；

format组件位于名字空间 boost，为了使用Format，需要包含头文件:

#include <boost/format.hpp>

using namespace boost;

对于boost库中的format一个简单的运行实例：

示例五：

#include <boost/format.hpp>

using namespace boost;

void case1()
{
     cout << format("%s:%d + %d = %d\n")%"Sum" % 1 % 2 % (1 + 2);
     
     format fmt("(%1% + %2%)" * %2% = %3%\n);
     fmt % 2 % 5;
     fmt % ((2 + 5) * 5);
     cout << fmt.str();
}

int main()
{
     case1();
}

以上程序运行结果如下：

sum:1 + 2 = 3
(2 + 5) * 5 = 35

实例概解：

程序的第一条语句演示了format的最简单的用法，使用format(...)构造了一个format临时(匿名对象)，构造函数的参数是格式化字符串，其语义是标准printf()语法，使用%x来制定参数格式；

因为要被格式化的参数个数是不确定的，printf()使用了C语言里的可变参数(即参数生命中的省略号)，但它是不安全的，format模仿了流操作符<<,重载了二元操作符operator%作为参数输入符，它同样可以串联任意数量的参数，因此：

format(...)% a % b % c     //可以理解成下面这样的
format(...) << a << b << c;

操作符把参数逐个喂给format对象，完成对参数的格式化，最后format对象支持流输出，可以直接向输出流cout输出内部保存的已格式化好的字符串；

第一条format语句的等价printf()调用是：

printf("%s: %d + %d = %d\n","sum",1,2,(1 + 2));

程序后面三行代码演示了format的另一种用法，预先创建一个format格式化对象，这个对象是可以被后面的代码多次用于格式化操作，format对象仍然用操作符%来接受被格式化的参数，可以分多次输入，(不必一次给全)，但参数的数量必须满足格式化字符串的要求，最后，使用format对象的str()成员函数获得已格式化好的字符串想cout输出；

第二个format用了略不同于printf()格式化的语法："(%1% + %2%) * %2% = %3%",有点类似于C#语言，%X%可以指示参数的位置，减少参数输入的工作，是对printf()语法的一个改进；

第二个format对象的等价printf()调用是：

printf("(%d + %d) * %d = %d\n",2,5,5,(2+5) * 5);

类摘要：

format并不是一个真正的类，而是一个typedef，真正的实现是basic_format，声明如下：

template<class charT,class Traits=std::char_traits<charT>>
class basic_format;
typedef basic_format<char> format;

//basic_format类摘要如下：
template<class charT,class Traitd=std::char_traits<charT>>
class basic_format
{
     public:
          explicit basic_format(const charT *str);
          explicit basic_format(const string &s);
          basic_format& operator=(const basic_format& x);

          string_t str() const;
          size_type size() const;
          void clear();
          basic_format& parse(const string_t&);

          //pass arguments through those operator:
          template<class T>basic_format& operator%(T& x);
          friend std::basic_ostream& operator<<(...)
};//basic_format

typedef basic_format<char >        format;
typedef basic_format<wchar_t >     wformat;

string str(const format& );

成员概解：

a、basic_format构造函数可以接受C字符串(以0结尾的字符数组)、std::string作为格式化字符串，格式化字符串使用类printf的格式规则，构造函数都被声明为explicit，因此必须要显式调用构造；

b、str()返回format对象内部已经格式化好的字符串(不清空)，如果没有得到所有格式化字符串要求的参数则会抛出异常，format库还同时提供一个同名的自由函数str()，它位于boost名字空间，返回format对象内部已格式化好的字符串；

c、size()函数可以获得已格式化好的字符串长度，相当于str().size()，同样，如果没有得到所有格式化字符串要求的参数则会抛出异常；

d、parse()清空format对象内部缓存，并改用一个新的格式化字符串，如果仅仅想清空缓存，可以使用clear()，它把format对象回复到初始状态，这两个函数执行后调用str()、size()都会抛出异常；

e、format重载了operator%，可以接受待格式化的任意参数，%输入的参数个数必须恰好等于格式化字符串要求的数量，过多或过少在format对象输出时都会导致抛出异常，当调用str()输出字符串活clear()清空缓冲区之后，则可以继续再次使用%;

f、format还重载了流输出操作符，因此可以直接向IO流输出已格式化好的字符串，相当于向流输出str();

格式化语法：

format基本继承了printf的格式化语法，它仅对printf语法有少量的不兼容，一般情况下我们很难遇到；

每个printf格式化选项以%开始，后面是格式规则，规定了输出的对齐、宽度、精度、字符类型，如下所示：

%05d     :输出宽度为5的整数，不足位用0填充
%-8.3f   :输出左对齐，总宽度为8，小数位3位的浮点数
% 10s    :输出10位的字符串，不足位用空格填充
%05X     :输出宽度为5的大写16进制整数，不足位用0填充

代码示例：

format fmt("%05d\n%-8.3f\n% 10s\n%05X\n");
cout << fmt %62 %2.236 % "123456789" %48;

运行结果如下：

在经典的printf式格式化外，format还增加了新的格式：

a、%|spec|：与printf格式选项功能相同，但两边增加了竖线分割，可以更好的区分格式化选项与普通字符；

b、%N%:标记第N个参数，相当于占位符，不带任何其他的格式化选项；

使用%|spec|%的形式可以将上面的例子写成如下格式：

format fmt("%|05d|\n%|-8.3f|\n%| 10s|\n%|05X|\n");

format的性能：

printf()不进行类型安全检查，直接向stdout输出，因此速度上非常块，而format较printf()做了很多安全检查工作，因此性能略差，速度上要慢很多，总得来说要比printf()至少慢2倒5倍；

如果在意format的性能，那么可以先简历const format对象，然后拷贝这个对象进行格式化操作，这样比直接使用fromat对象能够提高一些速度，如下：

const format fmt("%10d %020.8f %010X %10.5e\n");
cout << format(fmt)%62 % 2.236 % 255 % 0.618;

高级用法：

format提供了类似于printf的功能，但它并不等同于printf函数，这就是面向对象好处，在通常的格式化字符串之外，format类还拥有几个高级功能，可以在运行时修改格式化选项、绑定输入参数；

a、basic_format& bind_arg(int argN,const T& val)

把格式化字符串第argN位置输入参数固定为val，即使调用clear()也保持不变，除非调用clear_bind()或clear_binds();

b、basic_format& clear_bind(int argN)

取消格式化字符串第argN位置的参数绑定；

c、basic_format& clear_binds()

取消格式化字符串所有位置的参数绑定，并调用clear()方法；

d、basic_format& modify_item(int itemN,T manipulator)

设置格式化字符串第itemN位置的格式化选项，manipulator是一个boost::io::group()返回的对象；

e、boost::io::group(T1 al, ..., Var const& var)

它是一个模版函数，最多支持10个参数(10个重载形式)，可是设置IO流操纵器以指定格式或输入参数值，IO流操纵器位于头文件<iomanip>

以上用法如下示例：

示例六：

#include <boost/format.hpp>
#include <iomanip>

using namespace boost;
using boost::io::group

void case1()
{
     //声明format对象，有三个输入参数，五个格式化选项
     format fmt("%1% %2% %3% %2% %1% \n");
     cout << fmt %1 % 2 % 3;

     fmt.bind_arg(2,10);          //将第二个参数固定为数字10
     cout << fmt %1 %3;           //输出其余两个参数

     fmt.clear();                 //清空缓冲，但是版定的参数不变

     //在%操作符中使用group()，指定IO流操纵符第一个参数显示为八进制
     cout << fmt % group(showbase,oct, 111) % 333;

     fmt.clear_binds();          //清除所有绑定参数

     //设置第一个格式化项，十六进制，宽度为8，右对齐，不足位用*填充
     fmt.modify_item(1,group(hex,right,showbase,setw(8),setfill('*')));
     cout << fmt % 49 % 20 % 100;
}

int main()
{
     case1();
}

/*
* 输出结果：
* 1 2 3 2 1
* 1 10 3 10 1
* 0157 10 333 10 0157
* ****0x31 20 100 20 49
*/

string_algo：

是一个非常全面的字符串算法库，提供了大量的字符串操作函数，如大小写无关比较、修剪、特定模式的子串查找等，可以在不实用正则表达式的情况下处理大多数字符串相关问题；

string_algo库位于名字空间boost::algorithm，但被using语句引入到名字空间boost，为了使用string_algo需要包含等声明如下：

#include <boost/algorithm/string.hpp>

using namespace boost;

示例代码如下：

#include <iostream>
#include <vector>
#include <boost/smart_ptr.hpp>
#include <boost/make_shared.hpp>
#include <boost/algorithm/string.hpp>           //for use string_algo library

using namespace std;
using namespace boost;

void case1()
{
       //shared_ptr、make_shared的使用，避免使用new、delete造成的内存问题
       boost::shared_ptr<std::string> ps = boost::make_shared<std::string>(", I made a stupid decision to leave the world forever");
       std::cout << "The ps content is: " << *ps << std::endl;
}

void case2()
{
       std::string str("ReadMe.txt");
       if (boost::ends_with(str, "txt"))                                            //判断后缀
       {
              std::cout << boost::to_upper_copy(str) + " UPPER" << std::endl;
              assert(boost::ends_with(str, "txt"));
       }

       boost::replace_first(str, "ReadMe", "followme");                            //替换原字符串内容
       cout << "The replace_first str content: " << str << endl;

       vector<char> v(str.begin(), str.end());                                     //一个字符大小的vector
       vector<char> v2 = to_upper_copy(erase_first_copy(v, "txt"));                //to_upper_copy大写，然后删除字符串

       /*
       for (int i = 0; i < v2.size(); ++i)
       {
              cout << v2[i];
       }*/
       for (auto tmp : v2)                                                         //此种方式虽较为方便，但是比起前++的常规for循环来说，开销较大
       {
              cout << tmp;
       }
       cout << endl;
}

int main()
{
       case1();
       case2();
       system("pause");
}

這个例子示范了string_algo库中：ends_with()、to_upper_copy()、replace_first()、erase_first_copy()等函数的基本用法，它们的名称含义都是自说明，可以直接理会其字面意思；

string_algo性能概述：

string_algo被设计用于处理字符串，然而它的处理对象并不一定是string或者basic_string<T>，可以是任何符合boost.range要求的容器，容器内的元素也不一定是char或者wchar_t，任何可拷贝构造和赋值的类型均可，但如果类型的拷贝赋值代价很高，则string_algo的性能会下降；

string_algo库中的算法命名遵循了标准库的惯例，算法名均为小写形式，并使用不同的前缀或者后缀来区分不同的版本，命名规则如下：

a、前缀i：有这个前缀表明算法是大小写不敏感的，否则是大小写敏感的；

b、后缀_copy：有这个后缀表明算法不变动输入，返回处理结果的拷贝，否则算法原地处理，输入即输出；

c、后缀_if：有这个后缀表明算法需要一个判断式的谓词函数对象，否则使用默认的判断准则；

string_algo库提供的算法共分为五大类：

a、大小写转换

b、判断式与分类

c、修剪

d、查找与替换

e、分割与合并

A、大小写转换：

string_algo库可以高效的实现字符串的大小写转换，包括两组算法：to_upper()、to_lower()；

两个算法声明如下：

template<typename T> void to_upper(T &Input);

template<typename T> void to_lower(T &Input);

Usage:

#include <boost/algorithm/string.hpp>

using namespace boost;

void case1()
{
     string str("I Don't Know.\n");
     cout << "to_upper_copy: " << to_upper_copy(str);//返回大写拷贝
     cout << "str content: " << str;                 //原字符串不改变
     to_lower(str);                                  //字符串小写
     cout << "to_lower: " << str;                    //原字符串被改变
}

运行结果：

to_upper_copy: I DON'T KNOW.

str Content: I Don't Know.

After lower str content: i don't know.

B、判断式算法：

判断式算法可以检测两个字符串之间的关系，包括：

1)、starts_with ：检测一个字符串是否是另一个的前缀

2)、ends_with ：检测一个字符串是否是另一个的后缀

3)、contains ：检测一个字符串是否被另一个包含

4)、equals ：检测两个字符串是否相等

5)、lexicographical_compare：根据字典顺序检测一个字符串是否小于另一个

6)、all ：检测一个字符串中的所有元素是否满足指定的判断式

除了all，这些算法都有一个i前缀版本，由于这些操作函数都不会改变原字符串内容，所有不会有copy版本；

以上算法示例如下：

#include <iostream>
#include <vector>
#include <boost/smart_ptr.hpp>
#include <boost/make_shared.hpp>
#include <boost/algorithm/string.hpp>           //for use string_algo library

using namespace std;
using namespace boost;

void case4()
{
       //starts_with() & ends_with() & contains() & equals() & lexicographical_compare() & all()
       string str("Power Bomb");

       assert(iends_with(str, "bomb"));                //大小写无关检测后缀
       assert(!ends_with(str, "bomb"));                //大小写敏感检测后缀
       assert(starts_with(str, "Pow"));                //检测前缀
       assert(contains(str, "er"));                    //测试包含关系

       string str2 = to_lower_copy(str);               //转换小写并返回一个拷贝
       assert(iequals(str, str2));                     //大小写无关判断相等

       string str3 = "power suit";
       assert(lexicographical_compare(str, str3));     //大小写无关比较

       assert(all(str2.substr(0, 5), is_lower()));     //检测子串均小写
}

int main()
{
       /*
       case1();
       case2();
       case3();
       */
       case4();
       system("pause");
}

C、判断式算法(函数对象)：

string_algo增强了标准库中的equal_to<>和less<>函数对象，允许对不同类型的参数进行比较，并提供大小写无关的形式，这些函数对象包括：

1)、is_equal ：类似equals算法，比较两个对象是否相等

2)、is_less ：比较两个对象是否具有小于关系

3)、is_not_greater ：比较两个对象是否具有不大于关系

具体使用实例如下：

void case5()
{
       cout << "In case5() functions" << endl;
       //is_equal() & is_less() & is_not_greater()
       string str1 = "Samus", str2 = "samus";

       assert(!is_equal()(str1, str2));
       assert(is_less()(str1, str2));
}

注意函数对象名称后的两个括号，第一个括号调用了函数对象的构造函数，产生一个临时对象，第二个扩后才是真正的函数调用操作符operator()；

D、分类：

string_algo提供一组分类函数，可以用于检测一个字符是否许贺某种特性，主要用于搭配其它算法，如下所示：

1)、is_space ：字符是否为空格

2)、is_alnum ：字符是否为字母和数字字符

3)、is_alpha ：字符是否为字母

4)、is_cntrl ：字符是否为控制字符

5)、is_digit ：字符是否为十进制数字

6)、is_graph ：字符是否为圆形字符

7)、is_print ：字符是否为可打印字符

8)、is_lower ：字符是否为小写字符

9)、is_punct ：字符是否为标点符号字符

10)、is_upper ：字符是否为大写字符

11)、is_xdigit ：字符是否为十六进制数字

12)、is_any_of ：字符是否是参数字符序列中的任意字符

13)、if_from_range ：字符是否位于制定区间内，即from <= ch <= to;

在使用过程中需要注意，这些方法并不去检测字符，只是返回一个类型为details::is_classifiedF的函数对象，这个对象的operator()才是真正的分类检查函数(这些函数都是工厂函数)；

E、修剪：

string_algo提供3个修剪算法：trim_left、trim_right、trim

修剪算法可以删除字符串开头或结尾部分的空格，它有_if和_copy两种后缀，因此每个算法都有四个版本，_if版本接受判断式IsSpace，将所有被判定为空格(IsSpace(c) == true)的字符删除；

以上D、E的算法示例如下：

void case7()
{
       format fmt("|%s|\n");

       string str = "   samus aran   ";
       cout << "Delete Both Spaces: " << fmt % trim_copy(str) << endl;      //删除两端的空格
       cout << "Delete Left Space : " << fmt % trim_left_copy(str) << endl;//删除左边的空格
       cout << "Delete Right Space: " << fmt % trim_right_copy(str) << endl;//删除右边的空格

       trim_right(str);                                                     //原地删除右边的空格
       cout << "In Situ Delete: " << fmt % str << endl;

       string str1 = "2017 is a year of egg pain;";
       cout << "Delete Left Nums: " << fmt % trim_left_copy_if(str1, is_digit());   //删除左端的数字
       cout << "Delete Right put: " << fmt % trim_right_copy_if(str1, is_punct());  //删除有段的标点
       cout << "Delete Both Nums & Punct & Spaces: " << fmt % trim_copy_if(str1, is_punct() || is_digit() || is_space());
}

int main()
{
       case7();
       system("pause");
}

F、查找：

string_algo与标准库提供的search()功能类似，但接口不一样，它不是返回一个迭代器(查找到的位置)，而使用了boost.range库的iterator_range返回查找到的整个区间，获得了更多的信息；

string_algo提供的查找算法如下：

1)、find_first ：查找字符串在输入中第一次出现的位置

2)、find_last ：查找字符串在输入中最后一次出现的位置

3)、find_nth ：查找字符串在输入中的第n次(从0开始计数)出现的位置

4)、find_head ：取一个字符串开头N个字符的子串，相当于substr(0,n);

5)、find_tail ：取一个字符串末尾N个字符的子串

以上算法因为不变动字符串原来内容，所有没有_copy版本，其中前三个算法有前缀i版本，示例如下：

void case8()
{
    //find_first & find_last & find_nth & find_head & find_tail
    format fmt("|%s| .Pos value is: %d\n");
    string str = "Long Long Ago,There Have A King;";

    iterator_range<string::iterator> rge;        //Explain the iterator interval
    rge = find_first(str, "Long");               //Find the location of the first occurrence with case
    cout << "Find First: " << setw(5) <<  fmt % rge % (rge.begin() - str.begin());

    rge = ifind_first(str, "Long");              //Case independent search for the first place to appear
    cout << "Ifind first: " << setw(5) << fmt % rge % (rge.begin() - str.begin());

    rge = find_nth(str, "ng", 2);                //Look for ng's third place in STR
    cout << "Find nth: " << setw(5) << fmt % rge % (rge.begin() - str.begin());

    rge = find_head(str, 4);                     //Take the first four characters
    cout << "Find Head: " << setw(5) << fmt % rge % (rge.begin() - str.begin());

    rge = find_tail(str, 5);                     //Take the last five characters
    cout << "Find Tail: " << setw(5) << fmt % rge % (rge.begin() - str.begin());

    rge = find_first(str, "samus");              //Not Find
    assert(rge.empty() && !rge);

}

int main()
{
    case8();
    system("pause");
}

G、替换与删除：

替换、删除操作与查找算法非常接近，是在查找到结果后再对字符串进行处理，因此它们命名很相似，如下所示：

1)、replace/erase_first :替换/删除一个字符串在输入中的第一次出现

2)、replace/erase_last :替换/删除一个字符串在输入中的最后一次出现

3)、replace/erase_nth :替换/删除一个字符串在输入中第n次的出现(从0开始计数)

4)、replace/erase_all :替换/删除一个字符串在输入中的所有出现

5)、replace/erase_head :替换/删除输入的开头

6)、replace/erase_tail :替换/删除输入的末尾

这些算法是一个大集合，前八个每个都有前缀"i"、后缀"_copy"组合，有四个版本，后四个则只有"_copy"两个版本，示例如下：

void case9()
{
       //replace_*** & erase_***
       string str = "Samus beat the monster.\n";

       cout << "replace_first_copy: " << replace_first_copy(str, "Samus", "samus") << endl;;

       replace_last(str, "beat", "kill");
       cout << "replace_last: " << str << endl;

       cout << "ierase_all_copy: " << ierase_all_copy(str, "samus") << endl;
       cout << "replace_nth_copy: " << replace_nth_copy(str, "1", 1, "L") << endl;
       cout << "erase_tail_copy: " << erase_tail_copy(str, 8) << endl;
}

int main()
{
       case9();
       system("pause");
}

H、分割：

string_algo提供了两个字符串分割算法：find_all和split，可以使用某种策略把字符串分割成若干部分，并将分割后的字符串拷贝存入指定的容器，应用示例如下;

void case10()
{
       string str = "Samus,Link.Zelda::Mario-Luigi+zelda";
       deque<string> d;

       ifind_all(d, str, "zELDA");       //Case-insensitive segmentation strings are not distinguishable
       assert(d.size() == 2);
       cout << "deque size: " << d.size() << endl;

       for (BOOST_AUTO(pos, d.begin());pos != d.end();++pos)
       {
              cout << "Pos:[ " << *pos << " ]";
       }
       cout << endl;

       list <iterator_range<string::iterator>> ls;
       split(ls, str, is_any_of(",.:-+"));      //Use punctuation marks
       for (auto tmp:ls)
       {
              cout << "Pos: [ " << tmp << " ]";
       }
       cout << endl;

       ls.clear();
       split(ls, str, is_any_of(".:-"), token_compress_on);
       for (auto tmp : ls)
       {
              cout << "Pos:[ " << tmp << " ];";
       }
       cout << endl;
}

int main()
{
       case10();
       system("pause");
}

I、合并：

合并算法join是分割算法的逆运算，把存储在容器中的字符串连接成一个新的字符串，并且可以指定连接的分隔符，示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>
#include <boost/assign.hpp>                    //for use list_of()
#include <boost/format.hpp>
#include <boost/smart_ptr.hpp>
#include <boost/make_shared.hpp>
#include <boost/typeof/typeof.hpp>
#include <boost/algorithm/string.hpp>           //for use string_algo() library

using namespace std;
using namespace boost;
using namespace boost::assign;

void case11()
{
       vector<string> str = list_of("Samus")("Link")("Zelda")("Mario");
       cout << "Vector str size is: " << str.size() << endl;
       cout << "Vector str Content: " << join(str, "+") << endl;                   //coalescing

       struct is_contains_a

       {
              bool operator()(const string &st)
              {
                     return contains(st, "a");
              }
       };
       cout << "After Operator() str Content: " << join_if(str, "**", is_contains_a()) << endl;  //coalescing
}

int main()
{
       case11();
       system("pause");
}

J、查找分割迭代器：

通用的find_all以及split之外，string_algo库中还提供两个查找迭代器find_iterator、split_iterator，它们可以在字符串中像迭代器那样遍历匹配，进行查找或者分割，不用容器容纳，示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>
#include <boost/assign.hpp>                      //for use list_of() function
#include <boost/format.hpp>
#include <boost/smart_ptr.hpp>
#include <boost/make_shared.hpp>
#include <boost/typeof/typeof.hpp>
#include <boost/algorithm/string.hpp>           //for use string_algo library

void case12()
{
       string str("Samus||samus||mario||||Link");

       typedef find_iterator<string::iterator> string_find_iterator; //查找迭代器类型定义

       string_find_iterator pos, end;                                                            //声明查找迭代器变量
       for (pos = make_find_iterator(str,first_finder("samus",is_iequal()));pos != end;++pos)
       {
              cout << "Pos Content is: " << *pos << ";";
       }
       cout << endl;

       typedef split_iterator<string::iterator> string_split_iterator;                          //分割迭代器类型定义

       string_split_iterator p, endp;                                                            //声明分割迭代器变量
       for (p = make_split_iterator(str,first_finder("||",is_iequal()));p != endp;++p)          //is_iequal()判断是否相等
       {
              cout << "P Content is: " << *p << ";";
       }
       cout << endl;
}

int main()
{
       case12();
       system("pause");
}

过程概解：

使用查找迭代器首先要声明迭代器对象find_iterator或者split_iterator，它们的模版类型参数是一个迭代器类型a，如：string::iterator或者char*;

为了获得迭代器的起始位置，需要调用first_finder()函数，用于判断匹配对象，再用make_find_iterator或make_split_iterator来真正创建迭代器，同族的查找函数还有last_finder、nth_finder、token_finder等，它们的含义与查找算法类似，从不同的位置开始查找返回迭代器；

初始化工作完成后，就可以像使用标准迭代器或者指针那样，不断的遍历迭代器对象，使用解引用操作符获取查找的内容，知道找不到匹配的对象；

特别注意分割器的运用，它可以以任意长度的字符串作为分隔符进行分割，而普通的split算法则只能以字符作为分隔符；

tokenizer：

tokeizer库是有一个专门用于分词(token)的字符串处理库，可以使用简单易用的方法把一个字符串分解成若干个单词，它与string_algo库的分割算法类似，但不同之处也有很多；

tokenizer位于名字空间boost，为了使用tokenizer组件，需要在文件中包含并声明如下：

#include <boost/tokenizer.hpp>

using namespace boost;

/*
* tokenizer类原型
*/
template<typename TokenizerFunc = char_delimtiers_separator<char>,
          typename Iterator = std::string::const_iterator,
          typename Type = std::string>

class
{
     tokenizer(Iterator first,Iterator last,const TokenizerFunc& f);
     tokenizer(const Container& c,const TokenizerFunc& f);

     void assign(Iterator first,Iterator last);
     void assign(const Container& c);
     void assign(const Container& c,const TokenizerFunc& f);

     iterator begin() const;
     iterator end() const;
};

参数说明：

A、TokenizerFunc : tokenizer库专门的分词函数对象，默认是使用空格、标点符号分词

B、Iterator ：字符序列的迭代器类型

C、Type : 保存分词结果的类型

这三个模版类型都提供了默认值，但通常只有前两个模版参数可以变化，第三个类型一般只能选择std::string或者std::wstring，这也是它位于模版参数列表最后的原因；

tokenizer的构造函数接受要进行分词的字符串，可以以迭代器的区间形式给出，也可以是一个有begin()和end()成员函数的容器；

assign()函数可以重新指定要分词的字符串，用于再利用tokenizer；

tokenizer具有类似标准容器的接口，begin()函数使tokenizer开始执行分词功能，返回第一个分词迭代器，end()函数表明迭代器已经到达分词序列的末尾，分词结束；

用法：

tokenizer的用法很像string_algo的分割迭代器，但要简单一些，可以向使用一个容器用，向tokenizer传入一个欲分词的字符串构造，然后用begin()获得迭代器反复迭代；

详细用法示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>                                 //for use lits<std::string> str
#include <boost/assign.hpp>                     //for use list_of() function
#include <boost/format.hpp>                     //for use format fmt("***")
#include <boost/tokenizer.hpp>                  //for use tokenizer<> tok(std:;string)
#include <boost/smart_ptr.hpp>                  //for use shared_ptr()
#include <boost/make_shared.hpp>                //for use make_shared()
#include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO
#include <boost/algorithm/string.hpp>           //for use string_algo library

void case13()
{
       //tokenizer<> tok(std::string);
       string str = "Link raise the master-sword.";

       tokenizer<> tok(str);                    //使用缺省模版参数创建分词对象
                                                //此时是默认使用空格、标点符号进行字符分词

       for (BOOST_AUTO(pos,tok.begin());pos != tok.end();++pos)
       {
              cout << " Pos Content: " << *pos << endl;
       }
}

int main()
{
       case13();
       system("pause");
}

分词函数对象：

tokenizer的构造参数中，只要满足且具有合适的operator()、reset(0语言的函数对象就可以用于分词，tokenizer库提供四个预定义好的分词对象：

a、char_delimiters_separaptor：使用标点符号分词，已经被声明废弃，不推荐使用；

b、char_separator：支持一个字符集合作为分隔符，默认的行为与char_delimiters_separator类似；

c、escaped_list_separator：用于csv格式(逗号分隔)的分词；

d、offsert_separator：使用偏移量来分词，在分解平文件格式的字符串时很有用；

以下为上面主要三个对象的使用介绍：

a、char_separator：使用一个字符集合作为分词依据，行为很类似split算法，它的构造如下所示：

char_separator(const char* dropped_delims,const char* kept_delims = 0,empty_token_policy empty_tokens = drop_empty_tokens);

构造函数中的参数释义如下：

1)、dropped_delims：分隔符集合，这个集合中的字符不会作为分词的结果出现；

2)、kept_delims ：分隔符集合，但其中的字符会保留在分词结果中；

3)、empty_tokens ：类似split算法的eCompress参数，处理两个连续出现的分隔符，如keep_empty_tokens则表示连续出现的分隔符表示了一个空字符串，相当于split算法的token_compress_off值，如为drop_empty_tokens，则空白单次不会作为分词的结果；

如果使用默认构造，不传入任何参数的话，则等同于char_separator(" ",标点符号字符,drop_empty_tokens)，以空格和标点符号分词，保留标点符号，不输出空白单次，示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>                                                      //for use lits<std::string> str
#include <boost/assign.hpp>                            //for use list_of() function
#include <boost/format.hpp>                            //for use format fmt("***")
#include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string)
#include <boost/smart_ptr.hpp>                         //for use shared_ptr()
#include <boost/make_shared.hpp>                //for use make_shared()
#include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO
#include <boost/algorithm/string.hpp>           //for use string_algo library

using namespace std;
using namespace boost;
using namespace boost::assign;

template<typename T>
void print(T &tok)
{
       for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)
       {
              cout << " Pos Cotent: " << *pos << endl;
       }
}

void case14()
{
       //char_separator()
       char *str = "Link ;; <master-sword> zelda";

       char_separator<char> seq;                              //一个char_separator对象
       tokenizer < char_separator<char>, char*> tok(str, str + strlen(str), seq); //传入char_separator构造分词对象
       cout << "tokenizer: " << endl;
       print(tok);                                            //分词并输出

       tok.assign(str, str + strlen(str), char_separator<char>(" ;-","<>"));       //重新分词
       cout << "tok.assign: " << endl;
       print(tok);

       tok.assign(str, str + strlen(str), char_separator<char>(" ;-<>", "", drop_empty_tokens));
       cout << "Twocie Assign: " << endl;
       print(tok);

}

int main()
{
       case14();
       system("pause");
}

b、escaped_list_separator：这个是专门处理CSV格式(Comma Split Value，逗号分割值)的分词对象，它的构造函数声明如下：

escaped_list_separator(char e = '\\',char c = ',',char q = '\"');

这个函数的参数一般都取默认值，其释义如下：

1)、参数"e"：指定了字符串中的转义字符，默认是‘\'；

2)、参数"c"：分隔符，默认是‘，’；

3)、参数"q"：引号字符，默认是”

具体示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>                                                      //for use lits<std::string> str
#include <boost/assign.hpp>                            //for use list_of() function
#include <boost/format.hpp>                            //for use format fmt("***")
#include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string)
#include <boost/smart_ptr.hpp>                         //for use shared_ptr()
#include <boost/make_shared.hpp>                //for use make_shared()
#include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO
#include <boost/algorithm/string.hpp>           //for use string_algo library

using namespace std;
using namespace boost;
using namespace boost::assign;

template<typename T>
void print(T &tok)
{
       for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)
       {
              cout << " Pos Cotent: " << *pos << endl;
       }
}

void case15()
{
       //escaped_list_separator()
       string str = "id,100,name,\"mario\"";

       escaped_list_separator<char> seq;
       tokenizer<escaped_list_separator<char>> tok(str, seq);
       print(tok);
}

int main()
{
       case15();
       system("pause");
}
/************************************************************************/
/* 输出结果：
/* Pos Cotent : id
/* Pos Cotent : 100
/* Pos Cotent : name
/* Pos Cotent : mario
/* 请按任意键继续. . .
/************************************************************************/

c、offset_separator：与前两种分词函数不同，这个分词功能不基于查找分隔符，而是使用偏移量的概念，在处理某些不实用分隔符而使用固定字段宽度的文本时非常有用，构造函数如下：

template<typename Iter>
offset_separator(Iter begin,Iter end,bool wrap_offsets = true,bool return_partial_last = true);

offset_separator的构造函数接受两个迭代器参数(也可以是数组指针)begin、end，指定分词用的整数偏移量序列，整数序列的每个元素分词字段的宽度；

bool参数bwrapoffsets，决定是否在偏移量用完后继续分词，bool参数return_partial_last决定在偏移量学列最后是个否返回分词不足的部分，这两个附加参数的默认值都是true，示例如下：

#include <iostream>
#include <vector>
#include <iomanip>
#include <string>
#include <list>                                                      //for use lits<std::string> str
#include <boost/assign.hpp>                            //for use list_of() function
#include <boost/format.hpp>                            //for use format fmt("***")
#include <boost/tokenizer.hpp>                         //for use tokenizer<> tok(std:;string)
#include <boost/smart_ptr.hpp>                         //for use shared_ptr()
#include <boost/make_shared.hpp>                //for use make_shared()
#include <boost/typeof/typeof.hpp>              //for use BOOST_AUTO
#include <boost/algorithm/string.hpp>           //for use string_algo library

using namespace std;
using namespace boost;
using namespace boost::assign;


template<typename T>
void print(T &tok)
{
       for (BOOST_AUTO(pos,tok.begin()); pos != tok.end(); ++pos)
       {
              cout << " Pos Cotent: " << *pos;
       }
       cout << endl;
}

void case16()
{
       //offset_separator
       string str = "2233344445";
       int offsets[] = { 2,3, Li4 };
       offset_separator seq(offsets, offsets + 3, true, true);
       tokenizer<offset_separator> tok(str, seq);
       print(tok);

       tok.assign(str, offset_separator(offsets, offsets + 3, false));
       print(tok);

       str += "56667";
       tok.assign(str, offset_separator(offsets, offsets + 3, true, false));
       print(tok);

       /************************************************************************/
       /* 输出结果：
       /* Pos Cotent: 22 Pos Cotent: 333 Pos Cotent: 4444 Pos Cotent: 5
       /* Pos Cotent: 22 Pos Cotent: 333 Pos Cotent: 4444
       /* Pos Cotent: 22 Pos Cotent: 333 Pos Cotent: 4444 Pos Cotent: 55 Pos Cotent: 666
       /* 请按任意键继续. . .*/
       /************************************************************************/
}

int main()
{
       case16();
       system("pause");
}

阅读全文

0 0