UTF-8编码实测

来源：互联网发布：淘宝详情页制作软件编辑：程序博客网时间：2024/06/06 01:30

本文打算用C++程序跟踪UTF-8字符的二进制格式。从实践上感受一下UTF-8的应用。

开发环境是UBuntu12.04 32bit OS. GCC 4.6.3，系统字节顺序是little endian.

如果有汉字‘一’，首先通过一个网站工具：http://rishida.net/tools/conversion/ 可以查到它的Unicode码是：0x4E00

用UTF-8对0x4E00进行编码后是：E4 B8 80，三字节。

下面的代码用来打印二进制码：

#include "test.h"#include "util/endian.h"#include "util/utf.h"#include <iostream>using namespace std;int main(int argc, char ** argv) {  // TEST(3 > 2);  char const * p = "一";  cout << PrintStringAsBinaryString(p) << endl;  string str = "一";  cout << PrintStringAsBinaryString(str) << endl;  cout << IsLittleEndian() << endl;}

utf.h中的两个函数实现代码：

#ifndef UTIL_UTF_H_#define UTIL_UTF_H_#include "util/endian.h"string PrintStringAsBinaryString(char const* p) {  stringstream stream;  for (size_t i = 0; i < strlen(p); ++i) {    stream << PrintIntAsBinaryString(p[i]);    stream << " ";  }  return stream.str();}string PrintStringAsBinaryString(string const& str) {  stringstream stream;  for (size_t i = 0; i < str.size(); ++i) {    stream << PrintIntAsBinaryString(str[i]);    stream << " ";  }  return stream.str();}#endif

PtintIntAsBinaryString的代码是：

// T must be one of integer typetemplate<class T>string PrintIntAsBinaryString(T v) {  stringstream stream;  int i = sizeof(T) * 8 - 1;  while (i >= 0) {    stream << Bit_Value(v, i);    --i;  }      return stream.str();}

// Get the bit value specified by the index// index starts with 0template<class T>int Bit_Value(T value, uint8_t index) {  return (value & (1 << index)) == 0 ? 0 : 1;}

显示结果是：

11100100 10111000 10000000

刚好是E4 B8 80

这里可以看到Leading byte就是最高位的字节E4, 就存放在char const * p所指的的内存的起始地址，因此可以看出是Big endian，也就是在系统中实测默认应该采用的是UTF-8 BE编码。

下面继续奋战，把UTF-8转换成Unicode码，也就是code point。注意code_point类型的定义

typedef uint32_t code_point

现在测试代码修改一下，引入boost::locale库，这个算法自己也可以写，不过时间紧张，先用成熟的库吧。

#include "test.h"#include "util/endian.h"#include "util/utf.h"#include <iostream>#include <boost/locale/utf.hpp>using namespace std;using namespace boost::locale::utf;int main(int argc, char ** argv) {  // TEST(3 > 2);  char const * p = "一";  cout << PrintStringAsBinaryString(p) << endl;  string str = "一";  cout << PrintStringAsBinaryString(str) << endl;  code_point c = utf_traits<char, sizeof(char)>::decode(p, p + 3);  cout << "code point: 0x" << std::hex << c << " binary format:B" << PrintIntAsBinaryString(c) << endl;}

倒数第二行代码就是调用了decode进行解码。

结果是：

code point: 0x4e00 binary format:B00000000000000000100111000000000

非常理想。也可以传递string::iterator作为参数，只是注意string::begin()不能直接作为参数使用，而要像这样：

string::iterator itor = str.begin();utf_traits<char, sizeof(char)>::decode(itor, str.end());

因为decode的第一个参数是引用，decode内部会执行++操作。而string::begin()是不允许改变的，因此编译会报错。