UTF-8解码

来源：互联网发布：天津大学软件学院位置编辑：程序博客网时间：2024/06/01 08:18

要想了解UTF-8编码规则，请参考我的文章：http://blog.csdn.net/sheismylife/article/details/8570015

在我的另一篇文章"UTF-8编码实测" http://blog.csdn.net/sheismylife/article/details/8571726 中，我使用了boost::locale库的代码来解码UTF-8. 现在来仔细研究一下解码的算法：

如何分辨leading byte和continuation bytes呢？关键在于任何一个continuation byte都以10开始。下面的函数可以帮助判断是否为continuation byte:

bool is_trail(char ci) {  unsigned char c = ci;  return (c & 0xC0) == 0x80;}

因为0xC0二进制格式是1100 0000，和c按位与后也就是低六位全部设置为0，进保留c的高两位.

而0x80二进制格式是1000 0000, 如果两者相等，说明c的高两位是10，因此c是continuation byte。返回true。

有了这个函数，判断一个字节是否为leading byte也很简单：

bool is_lead(char ci) {  return !is_trail(ci);}

再看一下函数trail_length，该函数通过分析一个leading byte来确定continuation bytes的长度。

int trail_length(char ci) {  unsigned char c = ci;  if(c < 128)    return 0;  if(BOOST_LOCALE_UNLIKELY(c < 194))    return -1;  if(c < 224)    return 1;  if(c < 240)    return 2;  if(BOOST_LOCALE_LIKELY(c <=244))    return 3;  return -1;}

如果c < 128, 说明就是一个ASCII字符，用一个字节表示。因此conitunuation bytes长度为0，utf-8编码总长度为1.

因为11011111就是223，只要c在区间[128, 224), 说明continuation bytes长度为1，utf-8编码总长度为2.

继续推理，11101111等于239, 因此在[224, 240)区间的，continuation bytes长度为2，utf-8编码总长度为3.

11110111等于247，因此在[240, 248)区间的，continuation bytes长度为3，utf-8编码总长度为4. 但是这里限定的区间实际上是[240, 244] 不明白为什么，可能还有什么编码规则我不清楚。

现在来看一下utf.hpp中的代码，http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/utf_8hpp_source.html

00192         template<typename Iterator>00193         static code_point decode(Iterator &p,Iterator e)00194         {00195             if(BOOST_LOCALE_UNLIKELY(p==e))00196                 return incomplete;00197 00198             unsigned char lead = *p++;00199 00200             // First byte is fully validated here00201             int trail_size = trail_length(lead);00202 00203             if(BOOST_LOCALE_UNLIKELY(trail_size < 0))00204                 return illegal;00205 00206             //00207             // Ok as only ASCII may be of size = 000208             // also optimize for ASCII text00209             //00210             if(trail_size == 0)00211                 return lead;00212             00213             code_point c = lead & ((1<<(6-trail_size))-1);00214 00215             // Read the rest00216             unsigned char tmp;00217             switch(trail_size) {00218             case 3:00219                 if(BOOST_LOCALE_UNLIKELY(p==e))00220                     return incomplete;00221                 tmp = *p++;00222                 if (!is_trail(tmp))00223                     return illegal;00224                 c = (c << 6) | ( tmp & 0x3F);00225             case 2:00226                 if(BOOST_LOCALE_UNLIKELY(p==e))00227                     return incomplete;00228                 tmp = *p++;00229                 if (!is_trail(tmp))00230                     return illegal;00231                 c = (c << 6) | ( tmp & 0x3F);00232             case 1:00233                 if(BOOST_LOCALE_UNLIKELY(p==e))00234                     return incomplete;00235                 tmp = *p++;00236                 if (!is_trail(tmp))00237                     return illegal;00238                 c = (c << 6) | ( tmp & 0x3F);00239             }00240 00241             // Check code point validity: no surrogates and00242             // valid range00243             if(BOOST_LOCALE_UNLIKELY(!is_valid_codepoint(c)))00244                 return illegal;00245 00246             // make sure it is the most compact representation00247             if(BOOST_LOCALE_UNLIKELY(width(c)!=trail_size + 1))00248                 return illegal;00249 00250             return c;00251 00252         }

decode函数负责从字符串中解析一个utf-8编码（可能包含1-4字节），返回对应的code point，一个uint32_t的整数。

总是假定第一个字节是leading byte，然后用trail_length获取continuation bytes的长度，如果为0，说明就是ASCII字符，直接返回。

210行到250行处理的都是非ASCII字符。先注意switch/case用法，这里没有break语句，也就是说，如果trail_size为3，会先执行case 3:里面的语句，然后再依次执行case 2和 case 1（不需要匹配）。这是来自switch/case的特殊语法。参考：http://msdn.microsoft.com/en-us/library/k0t5wee3.aspx

当然用循环能更好的表达，但是不知道Artyom为什么选择这种写法？性能更高么？至少有一条，如果有代码扫描工具的话，会认为这里三段代码重复，会报警告信息。：)

现在再看一下之前UTF-8编码一文中引用的wiki的例子：

现在看一个来自wiki的例子演示如何对字符€进行UTF-8编码：step 1：获取€的Unicode code point，是0xU+20ACstep 2：0xU+20AC范围在U+07FF和U+FFFF之间，因此用三字节表示。step 3：0xU+20AC的二进制码是：10000010101100，14位长，要想表示3字节编码，必须凑成16 bits.因此高位补上两个0，变成2字节16位长:0010000010101100，我下面称为数值串。step 4: 根据规则，添加一个leading byte，开头是1110，那么这个leading byte还有4个bit需要填充，从数值串高位取4个bit来，leading byte变成了：11100010，而数值串值为000010101100step 5: 第一个continuation byte高位应该是10，还缺少6 bits，从数值串中按高位取6 bit,这样第一个continuation byte为：10000010，而数值串变为101100step 6: 第二个continuation byte高位也应该是10，还缺少6 bits, 从数值串取6 bits,这样第二个continuation byte为：10101100最终编码形成的三字节：11100010 10000010 10101100写成16进制就是0xE282AC

解码也就是编码的逆过程，从leading byte中抽取低4位，从两个continuation bytes中都抽取低6位，拼接成16bits的整数，然后转换类型变成uint32_t的整数。

下面的代码其实就是取出低四位：

code_point c = lead & ((1<<(6-trail_size))-1);

这是个很好的技巧,总结一下可以写成一个函数, 函数接受两个参数，一个是要提取bit的x，一个是提取的位数。

uint8_t GetLowNBit(uint8_t x, uint8_t n) {  return x & ((1<<n)-1);}

为什么这里用6-trail_size，纯粹是观察出来的规律。

2字节utf-8编码时，leading byte以110开头，因此要提取低5位数值， 6-trail_size=6-1=5，刚好能够提取低5位。

3字节utf-8编码，leading byte以1110开头，因此要提取低4位数值，6-trail_size=6-2=4.

4字节utf-8编码，leading byte以11110开头，因此要提取低3位数值，6-trail_size=6-3=3.

所以这里用6，Artyom的观察力很敏锐。

提取出leading byte的低位数据后，现在要提取continuation bytes的低位数据，也就是那个switch/case的功能。

这就很简单了，tmp & 0x3F就是取出低6位数据，因为continuation byte永远都是10开头。每次取出低6位后，将c左移6位然后按位或，就达到合并bit成新的整数的目的。

前面已经解释过swtich/case在without break的用法。这里就会运行两次，取出后面两个continuation bytes的低6位数据，并合并。

UTF-8解码算法分析完成。