c++ 处理中文标点符号

来源：互联网发布：淘宝开店要两张银行卡编辑：程序博客网时间：2024/06/05 04:06

1、去掉字符串首尾的标点符号——英文

str.erase(0, str.find_first_not_of(str_puncuation));                       |str.erase(str.find_last_not_of(str_puncuation) + 1);       其中str_puncuation表示要删除的标点符号

存在问题：处理的字符串包含中文时会有问题（某些标点符号会匹配半个汉字），比如“【”会匹配“网”。

2、去掉字符串首尾的标点符号——中英文

将字符串和对应的标点符号都变成宽字节，然后再处理

string转为wstring

std::wstring StringToWstring(const std::string str){// string转wstring    unsigned len = str.size() * 2;// 预留字节数    setlocale(LC_CTYPE, "");     //必须调用此函数    wchar_t *p = new wchar_t[len];// 申请一段内存存放转换后的字符串    mbstowcs(p,str.c_str(),len);// 转换    std::wstring str1(p);    delete[] p;// 释放申请的内存    return str1;}

然后再运用下面的代码去掉首尾的标点符号，其中wstr和wstr_puncuation分别表示转换为宽字节后的字符串

wstr.erase(0, wstr.find_first_not_of(*wstr_puncuation));                       |wstr.erase(wstr.find_last_not_of(*wstr_puncuation) + 1);

3、补充，用c++里面的正则表达式去掉标点符号（中英文）。

举例，匹配以“网“结尾的字符串，匹配成功则取出”网“前面的字符串

              boost::wregex wrg(L"(.*?)(网)");//"L"保证是宽字节              boost::wsmatch wsm;              bool r=boost::regex_match( wsToMatch, wsm, wrg);              if(r) //如果匹配成功              {                int iLen= wcstombs( NULL, wsm[0].str().c_str(), 0 );//wsm[0]表示取的是"网"前面的字符串（和正则里面的括号相关）                char *lpsz= new char[iLen+1];                int i= wcstombs( lpsz, wsm[0].str().c_str(), iLen );                lpsz[iLen] = '\0';                string sToMatch(lpsz);                delete []lpsz;                cout << "result:" << sToMatch << endl;                 }

其中string和wstring的转换参考：http://www.cnblogs.com/SunboyL/archive/2013/03/31/stringandwstring.html

1 0