填坑日常之 EDAC DIMM CE Error

来源:互联网 发布:知行供应链 编辑:程序博客网 时间:2024/05/19 04:26

2015-12-30 下午 ngaios 监控发现硬盘满报警
2015-12-31 早上开始排查原因

Alt text

Alt text

经过排查发现 log 目录下的三个系统日志非常大,竟有 8.7G 大小

读日志内容后发现有大量的 EDAC DIMM CE Error 出现

Alt text

经过一番 Google 后得知这是由于内存错误,导致系统自动启动错误恢复机制,但恢复失败写入日志,继续修复,循环下去导致日志文件大小暴增

粗略看了看 linux 的内核文档之 edac doc

根据这一段

Dual channels allows for 128 bit data transfers to the CPU from memory.Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs(FB-DIMMs). The following example will assume 2 channels:            Channel 0   Channel 1    ===================================    csrow0  | DIMM_A0   | DIMM_B0 |    csrow1  | DIMM_A0   | DIMM_B0 |    ===================================    ===================================    csrow2  | DIMM_A1   | DIMM_B1 |    csrow3  | DIMM_A1   | DIMM_B1 |    ===================================

于是在机器上执行

root@ubuntu:/var/log# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:4213901959

参考前面的文档,可得出问题的是 DIMM_A1

执行 root@ubuntu:/var/log# dmidecode -t memory,在结果中可以找到 DIMM_A1 的信息

Memory Device    Array Handle: 0x0032    Error Information Handle: Not Provided    Total Width: 72 bits    Data Width: 64 bits    Size: 4096 MB    Form Factor: DIMM    Set: None    Locator: DIMM_A1    Bank Locator: BANK0    Type: DDR3    Type Detail: Other    Speed: 1333 MHz    Manufacturer: Manufacturer0    Serial Number: SerNum1    Asset Tag: AssetTagNum1    Part Number: PartNum1

后续:
- 为了避免以后再发生这种日志撑满硬盘的情况,修改 logrotate 的配置文件,缩短日志备份周期,减少日志备份保留数量,启用备份压缩


参考:
- How can I find which memory have CE error?
- edac doc