内存MCE错误导致系统崩溃的问题分析

来源:互联网 发布:金庸群侠前传内功数据 编辑:程序博客网 时间:2024/05/16 01:00

今天服务器因为内存问题而崩溃,通过mcelog工具分析是在读内存的时候Error overflow(虽然是ECC内存,但也无奈错误太多),估计是内存硬件故障,如果再次出现的话就得考虑更换内存。


最终原因:硬件故障,应该是主板问题,因为是线上服务器为减少计划内停机时间,同时更换主板和内存解决。


# more /var/log/messages

Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR

Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010092
Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 428fc8840 MISC 204808e886 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0
Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR
Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 10: 8800004800800092

Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 0 MISC 4900030243025000 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0


通过mcelog翻译message的内容如下:

# mcelog sandybridge-ep --ascii < mcelog-manu.txt   
sbridge: HANDLING MCE MEMORY ERROR
Hardware event. This is not a software error.
CPU 0 BANK 5 
MISC 244076f686 ADDR 1a6bca040 
TIME 1383200376 Thu Oct 31 14:19:36 2013
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc0000c000010092 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 45
SOCKET 0 APIC 0