gcc与BOM

来源：互联网发布：淘宝高仿鞋哪些店铺编辑：程序博客网时间：2024/06/05 05:23

今天遇到一个进程coredump, 在调试时有几个奇怪的现象：
1. 在gdb的时候发现源码和gdb提示的行号对应不起来;
2. 然后用svn查看日志的时候，打开文件提示cannot show diff because of inconsistent newlines in the file.

此时我第一感觉就是源文件里有非法字符，于是用notepad++打开：
发现文件格式是gb2312编码格式，且是以LF结尾的unix换行风格，但是发现有几个CR字符，把他们删掉统一风格;
因为这个文件存在不同编码的汉字注释，我就想顺便把它们也统一一下，以后都用utf-8.于是使用encoding->convert to utf-8转换编码;

再进行编译提示错误如下：

tpn_pu_main.c:1: error: stray '\357' in programtpn_pu_main.c:1: error: stray '\273' in programtpn_pu_main.c:1: error: stray '\277' in program

characters can be specified by \nnn where nnn is the octal code for the character.

但是文件里根本就显示不出来这三个字符，于是就想到了BOM标识，搜了下有关BOM的知识：

Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32 bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros.

BOM可以看作是文件的magic number,常见的有：

EF BB BF UTF-8  #UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式FF FE UTF-16 aka UCS-2, little endianFE FF UTF-16 aka UCS-2, big endian00 00 FF FE UTF-32 aka UCS-4, little endian.00 00 FE FF UTF-32 aka UCS-4, big-endian.

于是我又查看了utf-8的BOM对应的八进制序列：
adtsh@adtsh-H110-4S:~$ echo “obase=8;ibase=16; EF; BB; BF” | bc
357
273
277
正好验证了我的猜测，是gcc不能识别BOM导致编译失败。
为了再次验证这个猜测，我又编写了一个hello world，保存为utf-8格式，编译时提示同样的错误。

找到原因之后就好办了，把文件的编码改成utf-8 without BOM格式，编译就通过了，源码也能对应起来了，再最后coredump的问题也搞定了，是字符数组越界导致的coredump! 所以一定要使用带n系列的printf函数。

注意：

BOM不受欢迎主要是在UNIX环境下，因为很多UNIX程序不鸟BOM。
vim下去掉BOM标记： set nobomb; set fileencoding=utf8; w

阅读全文

0 0