网页和URL内非英语字符的编码方法

来源：互联网发布：全天88星座软件编辑：程序博客网时间：2024/05/06 17:30

HTML和URL中对于ASCII码中大于0x7f的字符需要进行编码，主要有”\u”和“&#”两种前缀编码方式，其后跟的字符都是unicode编码。
《Unicode Escape Formats》中对各中编码进行了介绍，比较全面
http://billposer.org/Software/ListOfRepresentations.html

&#前缀的unicode编码属于NCR规范，参见
https://en.wikipedia.org/wiki/Numeric_character_reference
https://en.wikipedia.org/wiki/Character_encodings_in_HTML#HTML_character_references
“&#x”开头的是NCR中的十六进制格式，里面的字母常见为小写，但也可以大小写混用。

下面是我写的一个实现，配合linux系统的iconv函数，可以方便的将unicode转换为各种网页编码。

int uni2ascii(const char* fmt,const char* src, const int srclen, char* dst, const int dstsize) {    int i=0,j=0;    assert(srclen%2==0);//unicode must be 2 bytes aligned.    while(i<srclen&&j<dstsize)    {        if(*(unsigned short*)(src+i)<0x7f)        {            dst[j]=*(unsigned short*)(src+i);            j++;        }        else        {            j+=snprintf(dst+j,dstsize-j,fmt,*(unsigned short*)(src+i));        }        i+=2;    }    return j;}int main(){    //convert src to unicode first with iconv in linux. by littlefang    switch(to_charset){        case CHARSET_UNICODE_ASCII_ESC:            fmt="\\u%x;";            break;        case CHARSET_UNICODE_ASCII_ALIGNED:            fmt="\\u%04x";            break;        case CHARSET_UNICODE_NCR_DEC:               fmt="&#%u;";            break;        case CHARSET_UNICODE_NCR_HEX:            fmt="&#x%x;";        default:    }    uni2ascii(fmt,src_unicode,,,);}

比较好用的在线的转换工具：
http://tool.chinaz.com/tools/unicode.aspx
它的中文转UNICODE工具生成“\u”前缀码；中文转UTF-8工具生成的是”&#x”前缀码，即16进制NCR。

PHP/Python/JS等语言有大量的转换工具，目前C语言比较好用的离线转换工具就是uni2ascii，下载地址：
http://billposer.org/Software/uni2ascii.html
URL编码较为简单，只有UTF-8和GB2312两种，下面这篇文章讲的很透彻，就不再赘述了。
http://renmin.cnblogs.com/archive/2005/10/14/254773.html

0 0