CORE DUMP学习笔记

来源:互联网 发布:linux gbk 乱码 编辑:程序博客网 时间:2024/06/05 17:25

背景

core dump是指核心转储,是指操作系统在进程收到某些信号而终止运行时,将此时进程地址空间的内容以及有关进程状态的其他信息写出的一个磁盘文件,这种信息往往用于调试。在APUE这本书中,多次提到core dump,该手段应该是一般UNIX环境下,最常用的问题定位手段。
core dump可以帮助定位以下问题:
1、 内存访问越界
2、 多线程程序使用了线程不安全的函数
3、 多线程读写的数据未加锁保护
4、 非法指针
5、 堆栈溢出

准备工作

首先,需要确认系统中开启了core dump功能。在终端输入:
ulimit –a
该命令用于显示所有系统所有资源的受限情况,打印如下:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31679
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 31679
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
其中,第一行即为core dump的使能情况。上面的打印中,core file size为0,说明core dump功能是关闭的。通过下面的命令打开core dump,同时限制大小为1024(块):
ulimit –c 1024
然后,在此通过ulimit -c查看,打印如下:
core file size (blocks, -c) 1024
可以看出core dump功能已经被使能

测试1:访问异常指针

测试1是一个非常简单的测试,主要为了测试core dump最基本的功能。代码如下:

#include <stdio.h>void func2(){    printf("in func2.\n");    *(int *)(0) = 0;}void func1(){    printf("in func1.\n");    func2();}int main(){    printf("i will die...\n");    func1();    printf("i am dead.\n");    return 0;}

使用gcc –o编译,关闭core dump功能,在控制台运行,结果如下:
./test1.oiwilldieinfunc1.infunc2.Segmentationfaultcoredump ./test1.o
i will die…
in func1.
in func2.
Segmentation fault (core dumped)
这次运行,显示core dumped,文件与可执行文件在同一路径下:
-rw——- 1 hw2user4 hw2user4 143360 Mar 8 08:50 core.22680
可以通过gdb调试该文件,输入命令:
$ gdb -c core.22680
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “i386-redhat-linux-gnu”.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Core was generated by ./test1.o'.
Program terminated with signal 11, Segmentation fault.
0 0x0804839b in ?? ()
可以看出,最后两行已经判断出程序是收到了signal 11终止,但是最后一行仅打出了PC指针值,却没有具体的函数。我们需要将可执行程序传递给gdb,如下:
$ gdb test1.o core.22680
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hw2user4/test/test1.o...done.
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by
./test1.o’.
Program terminated with signal 11, Segmentation fault.
0 0x0804839b in func2 () at test1.c:8
8 (int )(0) = 0;
这次,可以看出,gdb已经可以解析出异常出现在某一句话中。从这个结果看,由于可以定位到发生异常的具体一条C命令,在目前项目中,由于我司代码风格,函数体过于庞大,而OSP的traceback仅能追踪到外层函数,而追踪不到函数内的某一条C命令。从这一点看,core dump拥有traceback无法比拟的优势。但是这样有一个前提,必须在编译时加入-g参数,加入调试信息,才能显示具体异常的C命令,否则只能显示到上一层调用函数。继续在控制台输入where,可以查看调用关系:
(gdb) where
0 0x0804839b in func2 () at test1.c:8
1 0x080483ba in func1 () at test1.c:15
2 0x080483de in main () at test1.c:23

测试2:动态链接库中出现的异常

测试2的测试场景更贴近我们的实际运行环境。我们的大多数代码均以动态链接库的形式存在,由sysboot执行后进行动态加载。我们将异常构造在动态库中,测试是否可以解析出动态库中的异常。
动态库代码lib.c如下:

#include <stdio.h>void func2(){    printf("in func2.\n");    *(int *)(0) = 0;}void func1(){    printf("in func1.\n");    func2();}int func(){    printf("i will die...\n");    func1();    printf("i am dead.\n");    return 0;}

我们将lib.c编译为lib.so:

gcc -o lib.so -fPIC -rdynamic -shared lib.c

在test.c中,我们加载动态库,并调用func函数:

#include <stdio.h>#include <unistd.h>#include <dlfcn.h>int main(){    void *handle;    void (*fuck)();    handle = dlopen("./lib.so", RTLD_NOW);    if(handle == NULL)    {        printf("fail to open lib.so.\n");        return -1;    }       printf("i will die.\n");    fuck = dlsym(handle, "fuck");    fuck();    printf("i am dead.\n");    return 0;}

编译为可执行文件test.o:

gcc -o test.o test.c –ldl

运行test.o:

$ ./test.oi will die...in func1.in func2.Segmentation fault (core dumped)

可以看到依然产生了异常。然后,使用gdb分析core dump文件:

$ gdb test.o core.25704GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)Copyright (C) 2009 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.  Type "show copying"and "show warranty" for details.This GDB was configured as "i386-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /home/hw2user4/test/test.o...(no debugging symbols found)...done.Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.Loaded symbols for /lib/libdl.so.2Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.Loaded symbols for /lib/libc.so.6Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.Loaded symbols for /lib/ld-linux.so.2Reading symbols from ./lib.so...(no debugging symbols found)...done.Loaded symbols for ./lib.soCore was generated by `./test.o'.Program terminated with signal 11, Segmentation fault.#0  0x00cec471 in func2 () from ./lib.so

这里需要注意,虽然没有将lib.so传递给gdb,但是依然需要保持lib.so存在,否则无法解析地址。输入where依然可以看到函数调用关系:

(gdb) where#0  0x00cec471 in func2 () from ./lib.so#1  0x00cec4a2 in func1 () from ./lib.so#2  0x00cec4cd in func () from ./lib.so#3  0x080484a2 in main ()

测试3:信号处理异常

这个例子是APUE中的一个例子,代码如下:

#include <pwd.h>#include <stdio.h>#include <unistd.h>#include <signal.h>static void my_alarm(void){    struct passwd *rootptr;    void *p;    printf("in signal handler\n");    if((rootptr = getpwnam("root")) == NULL)        printf("getpwnam(root) error\n");    alarm(1);}int main(void){    struct passwd *ptr;    void *p;    signal(SIGALRM, my_alarm);    alarm(1);    for( ; ; ){        if((ptr = getpwnam("hw2user4")) == NULL)            printf("getpwnam error\n");        if(strcmp(ptr->pw_name, "sar") != 0)        {            printf("return value corrupted!, pw_name = %s\n", ptr->pw_name);        }    }}

这段代码在main函数中一直循环调用getpwnam,同时在循环调用前,通过alarm函数注册1s的定时器,1s后,将会接收到SIGALARM信号,并调用my_alarm。在此函数中,依然调用getpwnam,构造多线程调用getpwnam的场景。getpwnam本身不是一个可重入的函数,通过这点来构造信号处理时的异常。
实际测试中,不知道是否由于操作系统的原因,与预期不大相符,第一次触发alarm信号,即会一直卡死。这里务必要注意,使用ctrl+c终止进程,并不会产生core dump文件。必须使用kill发送信号给指定进程,且进程终止,才会产生core dump文件,这也是core dump调试的一个局限性。
分析core dump文件,结果如下:

#0  0x00a0c402 in __kernel_vsyscall ()#1  0x001ee783 in __lll_lock_wait_private () from /lib/libc.so.6#2  0x001a06d8 in _L_lock_25 () from /lib/libc.so.6#3  0x001a05d6 in getpwnam () from /lib/libc.so.6#4  0x08048462 in my_alarm ()#5  <signal handler called>#6  0x0017fbf2 in strchr () from /lib/libc.so.6#7  0x001a0f17 in _nss_files_parse_pwent () from /lib/libc.so.6#8  0x0054c68d in _nss_files_getpwnam_r () from /lib/libnss_files.so.2#9  0x001a0b93 in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6#10 0x001a0618 in getpwnam () from /lib/libc.so.6#11 0x080484c7 in main ()

从堆栈可以看出,包含有两个线程的堆栈,一个是主线程正常执行,从main进入(6~11行),另一个是从信号处理进入(0~4行),信号处理中断了主线程的执行,而进入getpwnam后,一直无法获取到锁,导致信号处理流程一直在wait。而主线程虽然得到了锁,但是由于一致无法运行,导致无法释放锁,造成两者锁死。

优缺点分析

优势:
1. 可以追踪到具体发生异常的某一条C指令,而backtrace只能追溯到异常出现在哪个函数中;
2. 可记录所有任务的运行状态(包括中断、信号处理等)。
缺点:
1、必须终止进程,才能生成core dump。
2、core dump文件占据一定的存储空间,但可通过配置限制大小。
3、编译时需要加入-g选项,且不能使用strip裁剪。对于嵌入式系统来说,空间效率尤为重要。

core dump改进措施

针对在编译时必须加入-g选项,且不能使用strip裁剪的限制,可以通过objcopy进行规避。
首先使用-g参数,编译出结果文件:

gcc -g -o test3.o test3.c

使用objcopy生成调试信息文件:

objcopy --only-keep-debug test3.o test3.debug

使用strip裁剪test3.o:

strip test3.o

运行test3.o,然后构造异常,产生core dump文件。这时我们先尝试使用裁剪后的test3.o进行core dump解析,如下:

(gdb) where#0  0x004f8402 in __kernel_vsyscall ()#1  0x00ae3783 in __lll_lock_wait_private () from /lib/libc.so.6#2  0x00a956d8 in _L_lock_25 () from /lib/libc.so.6#3  0x00a955d6 in getpwnam () from /lib/libc.so.6#4  0x08048492 in getpwnam ()#5  0x08048662 in ?? ()#6  0xbfa2d6bc in ?? ()#7  0x009f064c in check_match.8170 () from /lib/ld-linux.so.2#8  <signal handler called>#9  0x004f8402 in __kernel_vsyscall ()#10 0x00ad3926 in munmap () from /lib/libc.so.6#11 0x00a6a48e in _IO_setb_internal () from /lib/libc.so.6#12 0x00a68fff in _IO_new_file_close_it () from /lib/libc.so.6#13 0x00a5ccae in fclose@@GLIBC_2.1 () from /lib/libc.so.6#14 0x009946fd in _nss_files_getpwnam_r () from /lib/libnss_files.so.2#15 0x00a95b93 in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6#16 0x00a95618 in getpwnam () from /lib/libc.so.6#17 0x08048504 in getpwnam ()#18 0x0804867c in ?? ()#19 0x08048474 in getpwnam ()#20 0xbfa2d5f8 in ?? ()#21 0x08048309 in ?? ()#22 0x00000009 in ?? ()#23 0x0804867c in ?? ()#24 0x08dd6018 in ?? ()#25 0x080485a9 in getpwnam ()#26 0x00b489b4 in lock () from /lib/libc.so.6#27 0xbfa2d6bc in ?? ()#28 0x08dd6008 in ?? ()#29 0xbfa2d630 in ?? ()#30 0x00a01ca0 in ?? () from /lib/ld-linux.so.2#31 0x00000000 in ?? ()

可以看到,由于strip裁剪掉了符号信息,导致无法正确解析。我们在gdb中导入刚刚生成的test3.debug文件:

(gdb) file test3.debugwarning: core file may not match specified executable file.Reading symbols from /home/hw2user4/test/test3.debug...done.

然后再次解析:

(gdb) where#0  0x004f8402 in __kernel_vsyscall ()#1  0x00ae3783 in __lll_lock_wait_private () from /lib/libc.so.6#2  0x00a956d8 in _L_lock_25 () from /lib/libc.so.6#3  0x00a955d6 in getpwnam () from /lib/libc.so.6#4  0x08048492 in my_alarm () at test3.c:17#5  <signal handler called>#6  0x004f8402 in __kernel_vsyscall ()#7  0x00ad3926 in munmap () from /lib/libc.so.6#8  0x00a6a48e in _IO_setb_internal () from /lib/libc.so.6#9  0x00a68fff in _IO_new_file_close_it () from /lib/libc.so.6#10 0x00a5ccae in fclose@@GLIBC_2.1 () from /lib/libc.so.6#11 0x009946fd in _nss_files_getpwnam_r () from /lib/libnss_files.so.2#12 0x00a95b93 in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6#13 0x00a95618 in getpwnam () from /lib/libc.so.6#14 0x08048504 in main () at test3.c:35

这次,我们发现已经可以正常解析了。

0 0
原创粉丝点击