Linux的oops内核调试信息 && Linux Kernel Panic报错解决思路

来源：互联网发布：成都数据库培训编辑：程序博客网时间：2024/04/28 06:35

Oops可以看成是内核级的Segmentation Fault。应用程序如果进行了非法内存访问或执行了非法指令，会得到Segfault信号，一般的行为是coredump，应用程序也可以自己截获Segfault信号，自行处理。如果内核自己犯了这样的错误，则会打出Oops信息。处理器使用的所有地址几乎都是通过一个复杂的页表结构对物理地址映射而得到的虚拟地址(除了内存管理子系统自己所使用的物理地址)。当一个非法的指针被废弃时，内存分页机制将不能为指针映射一个物理地址，处理器就会向操作系统发出一个页故障信号。如果地址不合法，那么内核将不能在该地址“布页”；这时如果处理器处于超级用户模式，内核就会生成一条oops消息。

      解释一下它所产生的过程：
1，do_page_fault()（arch/i386/mm/fault.c）。如果是内核进行了非法访问，do_page_fault()会先打出EIP, PDE等信息，例如：
Unable to handle kernel paging request at virtual address f899b670
printing eip:
c01de48c
*pde = 00737067
然后调用 die("Oops", regs, error_code);这之后，如果系统还活着(至少要满足两个条件：1. 在进程上下文 2. 没有设置panic_on_oops)，会杀死当前进程，导致死机。
2，die()（arch/i386/kernel/traps.c）。die() 首先打出一行：
Oops: 0002 [#1]
其中0002代表错误码，#1代表Oops发生次数。error_code:
*      bit 0 == 0 means no page found, 1 means protection fault
*      bit 1 == 0 means read, 1 means write
*      bit 2 == 0 means kernel, 1 means user-mode
然后，调用 show_registers(regs) 输出寄存器、当前进程、堆栈、指令代码等信息，以供判断。

3，利用dmesg命令可以查看完整的系统内核信息，当然前提这是用Printk输出的。

=================================================================================================================

（1）什么是Kernel Panic?
wiki:A kernel panic is an action taken by an operating system upon detecting an internal fatal error from which it cannot safely recover. The term is largely specific to Unix and Unix-like systems; for Microsoft Windowsoperating systems the equivalent term is “Bug check” (or, colloquially, “Blue Screen of Death“).
The kernel routines that handle panics (in AT&T-derived and BSD Unix source code, a routine known as panic()) are generally designed to output an error message to the console, dump an image of kernel memory to disk for post-mortemdebugging and then either wait for the system to be manually rebooted, or initiate an automatic reboot. The information provided is of highly technical nature and aims to assist a system administrator or software developer in diagnosing the problem.
Attempts by the operating system to read an invalid or non-permitted memory address are a common source of kernel panics. A panic may also occur as a result of a hardware failure or a bug in the operating system. In many cases, the operating system could continue operation after memory violations have occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error.
The kernel panic was introduced in an early version of Unix and demonstrated a major difference between the design philosophies of Unix and its predecessor Multics. Multics developer Tom van Vleck recalls a discussion of this change with Unix developer Dennis Ritchie:
I remarked to Dennis that easily half the code I was writing in Multics was error recovery code. He said, “We left all that stuff out. If there’s an error, we have this routine called panic, and when it is called, the machine crashes, and you holler down the hall, ‘Hey, reboot it.’”[1]
The original panic() function was essentially unchanged from Fifth Edition UNIX to the VAX-based UNIX 32V and output only an error message with no other information, then dropped the system into an endless idle loop. As the Unixcodebase was enhanced, the panic() function was also enhanced to dump various forms of debugging information to the console.

panic是英文中是惊慌的意思，Linux Kernel panic正如其名，linux kernel不知道如何走了，它会尽可能把它此时能获取的全部信息都打印出来。有两种主要类型kernel panic：
1.hard panic(也就是Aieee信息输出)；2.soft panic (也就是Oops信息输出)

（2）什么会导致Linux Kernel Panic

只有加载到内核空间的驱动模块才能直接导致kernel panic，你可以在系统正常的情况下，使用lsmod查看当前系统加载了哪些模块。除此之外，内建在内核里的组件（比如memory map等）也能导致panic。

常见Linux Kernel Panic报错内容：
Kernel panic-not syncing fatal exception in interrupt
kernel panic – not syncing: Attempted to kill the idle task!
kernel panic – not syncing: killing interrupt handler!
Kernel Panic – not syncing：Attempted to kill init !

（3）hard panic

对于hard panic而言，最大的可能性是驱动模块的中断处理(interrupt handler)导致的，一般是因为驱动模块在中断处理程序中访问一个空指针(null pointer)。一旦发生这种情况，驱动模块就无法处理新的中断请求，最终导致系统崩溃。

根据panic的状态不同，内核将记录所有在系统锁定之前的信息。因为kenrel panic是一种很严重的错误，不能确定系统能记录多少信息，下面是一些需要收集的关键信息：

a，/var/log/messages: 幸运的时候，整个kernel panic栈跟踪信息都能记录在这里。要确认是否有一个足够的栈跟踪信息，你只要查找包含”EIP”的一行，它显示了是什么函数和模块调用时导致panic。
b，终端屏幕dump信息，一般OS被锁定后，复制，粘贴肯定是没戏了，因此这类信息，你可以需要借助拍照了。

（4）soft panic
症状：没有hard panic严重，通常导致段错误(segmentation fault)，可以看到一个oops信息，/var/log/messages里可以搜索到’Oops’，机器稍微还能用（但是收集信息后，应该会重启系统）。

原因：凡是非中断处理引发的模块崩溃都将导致soft panic。在这种情况下，驱动本身会崩溃，但是还不至于让系统出现致命性失败，因为它没有锁定中断处理例程。导致hard panic的原因同样对soft panic也有用（比如在运行时访问一个空指针)。

（5）fatal exception

“致命异常（fatal exception）表示一种例外情况，这种情况要求导致其发生的程序关闭。通常，异常（exception）可能是任何意想不到的情况（它不仅仅包括程序错误）。致命异常简单地说就是异常不能被妥善处理以至于程序不能继续运行。
软件应用程序通过几个不同的代码层与操作系统及其他应用程序相联系。当异常（exception）在某个代码层发生时，为了查找所有异常处理的代码，各个代码层都会将这个异常发送给下一层，这样就能够处理这种异常。如果在所有层都没有这种异常处理的代码，致命异常（fatal exception）错误信息就会由操作系统显示出来。这个信息可能还包含一些关于该致命异常错误发生位置的秘密信息（比如在程序存储范围中的十六进制的位置）。这些额外的信息对用户而言没有什么价值，但是可以帮助技术支持人员或开发人员调试程序。

当致命异常（fatal exception）发生时，操作系统没有其他的求助方式只能关闭应用程序，并且在有些情况下是关闭操作系统本身。

（6）安装linux系统遇到安装完成之后，无法启动系统出现Kernel panic-not syncing fatal exception。很多情况是由于板载声卡、网卡、或是cpu 超线程功能（Hyper-Threading ）引起的。这类问题的解决办法就是先查看错误代码中的信息（诸如cut here，Modules linked in，PC is at，LR is at，Unable to handle kernelNULL pointer ），找到错误所指向的硬件，将其禁用。系统启动后，安装好相应的驱动，再启用该硬件即可（针对linux桌面系统）。

参考原文：http://www.ixpub.net/thread-759651-1-1.html
参考原文：http://hi.baidu.com/xysoul/blog/item/d85ebff2c2f1bc1bb17ec562.html

参考原文：http://blog.51osos.com/linux/linux-kernel-panic/