Linux下与字符处理有关的编程小结

来源：互联网发布：新浪博客怎么绑定域名编辑：程序博客网时间：2024/05/22 12:54

Linux下与字符处理有关的编程小结

strcpy, strlen等字符串操作函数并不认识宽体字符，因此对像UTF16BE编码的字符会作出不正确的处理。

比如：字符串 “你好阿.jpg”，

当用UTF16BE的编码时：

4f60 597d 963f 002e 006a 0070 0067

当用UTF－8编码时：

e4bda0 e5a5bd e998bf 2e 6a 70 67

如果要用strcpy, strlen处理此字符串的UTF16BE编码时，由于they only expect a /0 ending string，会将此字符串短, 即只处理到 4f60 597d 963f 002e 006a ，因此就只剩下了“你好阿”三个字了。

linux下很多函数都有这个问题，此处， strcpy宜用memcpy代替， strlen则不能用了。

linux下bash中和charset转换的函数有：conv

＃ conv --list

会列出所有其支持的字符集

#conv -f UTF16BE -t UTF-8 file

会将文件file从UTF－8编码转换到UTF16BE编码

C库函数中charset相关的例子有：

例1：

#include <iconv.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

int main()

{

iconv_t cd；

cd = iconv_open("UTF16BE","UTF-8"); //convert from utf8 to utf16be

char *In_1, *Out_1;

In_1 =(char *) malloc(128*sizeof(char));

Out_1 =(char *) malloc(128*sizeof(char));

strcpy(In_1, "你好阿.jpg");

int inlen=strlen(In_1);

printf("inlen=%d, In=%s/n", inlen, In_1);

char *out_1 = Out_1;

n = iconv(cd, &In_1, &inlen, &Out_1, &outlen);

printf("n=%d, in=%s, inlen=%d, out=%s, outlen=%d/n", n, In_1, inlen, out_1, outlen);

int i;

for(i=0; i<128;i++){

printf("%c", out_1[i]);

}

iconv_close(cd);

}

此程序运行的结果为：

[root@joy test]# ./conv

inlen=13, In=你好阿.jpg

n=0, in=, inlen=0, out=O`Y}�?, outlen=114

O`Y}�?.jpg

可以看到，红色部分out只打了宽体字符中文部分，“.jpg”由于以“00”开始，被print(“%s”) 忽略了。对于宽体字符串，应以上面灰色部分的代码使用的方法。

例2：

#include <iconv.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

int main()

{

iconv_t cd,ef;

int outlen=128;

int n;

cd = iconv_open("UTF16BE","UTF-8");

char *In_1, *Out_1;

In_1 =(char *) malloc(128*sizeof(char));

Out_1 =(char *) malloc(128*sizeof(char));

strcpy(In_1, "你好阿.jpg");

int inlen=strlen(In_1);

printf("inlen=%d, In=%s/n", inlen, In_1);

char *out_1 = Out_1;

n = iconv(cd, &In_1, &inlen, &Out_1, &outlen);

printf("n=%d, in=%s, inlen=%d, out=%s, outlen=%d/n", n, In_1, inlen, out_1, outlen);

int i;

for(i=0; i<128;i++){

printf("%c", out_1[i]);

}

iconv_close(cd);

char *In_2, *Out_2, *out_2;

In_2 = (char *)malloc(128*sizeof(char));

Out_2 = (char *)malloc(128*sizeof(char));

ef = iconv_open("UTF-8","UTF16BE");

memcpy(In_2, out_1, 128);

Out_2 = out_1;

memset(Out_2, '/0', 128);

inlen = 128;

outlen = 128;

out_2 = Out_2;

n = iconv(ef, &In_2, &inlen, &Out_2, &outlen);

printf("/nn=%d, in=%s, inlen=%d, out=%s, outlen=%d/n", n, In_2, inlen, out_2, outlen);

iconv_close(ef);

return 0;

}

运行：

[root@joy test]# ./conv

inlen=13, In=你好阿.jpg

n=0, in=, inlen=0, out=O`Y}�?, outlen=114

O`Y}�?.jpg

n=0, in=, inlen=0, out=你好阿.jpg, outlen=58

[root@joy test]# ./conv >a

[root@joy test]# xxd a

0000000: 696e 6c65 6e3d 3133 2c20 496e 3de4 bda0 inlen=13, In=...

0000010: e5a5 bde9 98bf 2e6a 7067 0a6e 3d30 2c20 .......jpg.n=0,

0000020: 696e 3d2c 2069 6e6c 656e 3d30 2c20 6f75 in=, inlen=0, ou

0000030: 743d 4f60 597d 963f 2c20 6f75 746c 656e t=O`Y}.?, outlen

0000040: 3d31 3134 0a4f 6059 7d96 3f00 2e00 6a00 =114.O`Y}.?...j.

0000050: 7000 6700 0000 0000 0000 0000 0000 0000 p.g.............

0000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................

0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................

0000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................

0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................

00000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................

00000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................

00000c0: 0000 0000 000a 6e3d 302c 2069 6e3d 2c20 ......n=0, in=,

00000d0: 696e 6c65 6e3d 302c 206f 7574 3de4 bda0 inlen=0, out=...

00000e0: e5a5 bde9 98bf 2e6a 7067 2c20 6f75 746c .......jpg, outl

00000f0: 656e 3d35 380a en=58.

例3：

#include <stdio.h>

int main(int argc, char *argv[])

{

FILE *f;

f = fopen(argv[1], "r+");

int c;

int n = 0;

while( (c = fgetc(f)) != EOF){

n++;

printf("%c",c);

}

printf("n=%d/n", n);

return 0;

}

[root@joy getc]# cat text

你好

我叫joy

[root@joy getc]# ./read text

你好

我叫joy

n=17

可见printf("%c",c); 应该是可以根据charset将编码组合的。

TODO：

printf("%c",c);底层实现部分代码阅读。

further reading:

UTF-8 and Unicode FAQ for Unix/Linux:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

摘录：

The proper way to activate UTF-8 is the POSIX locale mechanism. A locale is a configuration setting that contains information about culture-specific conventions of software behaviour, including the character encoding, the date/time notation, alphabetic sorting rules, the measurement system and common office paper size, etc. The names of locales usually consist of ISO 639-1 language and ISO 3166-1 country codes, sometimes with additional encoding names or other qualifiers.

You can get a list of all locales installed on your system (usually in /usr/lib/locale/) with the command locale -a. Set the environment variable LANG to the name of your preferred locale. When a C program executes the setlocale(LC_CTYPE, "") function, the library will test the environment variables LC_ALL, LC_CTYPE, and LANG in that order, and the first one of these that has a value will determine which locale data is loaded for the LC_CTYPE category (which controls the multibyte conversion functions). The locale data is split up into separate categories. For example, LC_CTYPE defines the character encoding and LC_COLLATE defines the string sorting order. The LANG environment variable is used to set the default locale for all categories, but the LC_* variables can be used to override individual categories. Do not worry too much about the country identifiers in the locales. Locales such as en_GB (English in Great Britain) and en_AU (English in Australia) differ usually only in the LC_MONETARY category (name of currency, rules for printing monetary amounts), which practically no Linux application ever uses. LC_CTYPE=en_GB and LC_CTYPE=en_AU have exactly the same effect.

You can query the name of the character encoding in your current locale with the command locale charmap.

[root@joy test]# locale charmap

UTF-8

This should say UTF-8 if you successfully picked a UTF-8 locale in the LC_CTYPE category. The command locale -m provides a list with the names of all installed character encodings.

If you use exclusively C library multibyte functions to do all the conversion between the external character encoding and the wchar_t encoding that you use internally, then the C library will take care of using the right encoding according to LC_CTYPE for you and your program does not even have to know explicitly what the current multibyte encoding is.

However, if you prefer not to do everything using the libc multi-byte functions (e.g., because you think this would require too many changes in your software or is not efficient enough), then your application has to find out for itself when to activate the UTF-8 mode. To do this, on any X/Open compliant systems, where <langinfo.h> is available, you can use a line such as

utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);

in order to detect whether the current locale uses the UTF-8 encoding. You have of course to add a setlocale(LC_CTYPE, "") at the beginning of your application to set the locale according to the environment variables first. The standard function call nl_langinfo(CODESET) is also what locale charmap calls to find the name of the encoding specified by the current locale for you. It is available on pretty much every modern Unix now. FreeBSD added nl_langinfo(CODESET) support with version 4.6 (2002-06). If you need an autoconf test for the availability of nl_langinfo(CODESET), here is the one Bruno Haible suggested:

======================== m4/codeset.m4 ================================

#serial AM1

dnl From Bruno Haible.

AC_DEFUN([AM_LANGINFO_CODESET],

[

AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,

[AC_TRY_LINK([#include <langinfo.h>],

[char* cs = nl_langinfo(CODESET);],

am_cv_langinfo_codeset=yes,

am_cv_langinfo_codeset=no)

])

if test $am_cv_langinfo_codeset = yes; then

AC_DEFINE(HAVE_LANGINFO_CODESET, 1,

[Define if you have <langinfo.h> and nl_langinfo(CODESET).])

])

=======================================================================

[You could also try to query the locale environment variables yourself without using setlocale(). In the sequence LC_ALL, LC_CTYPE, LANG, look for the first of these environment variables that has a value. Make the UTF-8 mode the default (still overridable by command line switches) when this value contains the substring UTF-8, as this indicates reasonably reliably that the C library has been asked to use a UTF-8 locale. An example code fragment that does this is

char *s;

int utf8_mode = 0;

if (((s = getenv("LC_ALL")) && *s) ||

((s = getenv("LC_CTYPE")) && *s) ||

((s = getenv("LANG")) && *s)) {

if (strstr(s, "UTF-8"))

utf8_mode = 1;

}

This relies of course on all UTF-8 locales having the name of the encoding in their name, which is not always the case, therefore the nl_langinfo() query is clearly the better method. If you are really concerned that calling nl_langinfo() might not be portable enough, there is also Markus Kuhn’s portable public domain nl_langinfo(CODESET) emulator for systems that do not have the real thing (and another one from Bruno Haible), and you can use the norm_charmap() function to standardize the output of the nl_langinfo(CODESET) on different platforms.]

Are there free libraries for dealing with Unicode available?

· Ulrich Drepper’s GNU C library glibc has featured since version 2.2 full multi-byte locale support for UTF-8, an ISO ISO 14651 sorting order algorithm, and it can recode into many other encodings. All current Linux distributions come with glibc 2.2 or newer, so you definitely should upgrade now if you are still using an earlier Linux C library.

· The International Components for Unicode (ICU) (formerly IBM Classes for Unicode) have become what is probably the most powerful cross-platform standard library for more advanced Unicode character processing functions.

· X.Net’s xIUA is a package designed to retrofit existing code for ICU support by providing locale management so that users do not have to modify internal calling interfaces to pass locale parameters. It uses more familiar APIs, for example to collate you use xiua_strcoll, and is thread safe.

· Mark Leisher’s UCData Unicode character property and bidi library as well as his wchar_t support test code.

· Bruno Haible’s libiconv character-set conversion library provides an iconv() implementation, for use on systems which do not have one, or whose implementation cannot convert from/to Unicode.
It also contains the libcharset character-encoding query library that allows applications to determine in a highly portable way the character encoding of the current locale, avoiding the portability concerns of using nl_langinfo(CODESET) directly.

· Bruno Haible’s libutf8 provides various functions for handling UTF-8 strings, especially for platforms that do not yet offer proper UTF-8 locales.

· Tom Tromey’s libunicode library is part of the Gnome Desktop project, but can be built independently of Gnome. It contains various character class and conversion functions. (CVS)

· FriBidi is Dov Grobgeld’s free implementation of the Unicode bidi algorithm.[1]

· Markus Kuhn’s free wcwidth() implementation can be used by applications on platforms where the C library does not yet provide an equivalent function to find, how many column positions a character or string will occupy on a UTF-8 terminal emulator screen.

· Markus Kuhn’s transtab is a transliteration table for applications that have to make a best-effort conversion from Unicode to ASCII or some 8-bit character set. It contains a comprehensive list of substitution strings for Unicode characters, comparable to the fallback notations that people use commonly in email and on typewriters to represent unavailable characters. The table comes in ISO/IEC TR 14652 format, to allow simple inclusion into POSIX locale definition files.

[1]