ceph 故障解决备忘
来源:互联网 发布:周立功单片机使用方法 编辑:程序博客网 时间:2024/06/05 05:42
参考 ceph 环境
cluster dc4f91c1-8792-4948-b68f-2fcea75f53b9 health HEALTH_WARN 15 requests are blocked > 32 sec; clock skew detected on mon.hh-yun-ceph-cinder025-128075 monmap e3: 5 mons at {hh-yun-ceph-cinder015-128055=240.30.128.55:6789/0,hh-yun-ceph-cinder017-128057=240.30.128.57:6789/0,hh-yun-ceph-cinder024-128074=240.30.128.74:6789/0,hh-yun-ceph-cinder025-128075=240.30.128.75:6789/0,hh-yun-ceph-cinder026-128076=240.30.128.76:6789/0}, election epoch 168, quorum 0,1,2,3,4 hh-yun-ceph-cinder015-128055,hh-yun-ceph-cinder017-128057,hh-yun-ceph-cinder024-128074,hh-yun-ceph-cinder025-128075,hh-yun-ceph-cinder026-128076 osdmap e27430: 100 osds: 100 up, 100 in pgmap v11224834: 20544 pgs, 2 pools, 70255 GB data, 17678 kobjects 205 TB used, 157 TB / 363 TB avail 20540 active+clean 4 active+clean+scrubbing+deep client io 57251 kB/s rd, 44602 kB/s wr, 3797 op/s
参考 ceph health detail 返回结果
1. mon.hh-yun-ceph-cinder025-128075 addr 240.30.128.75:6789/0 clock skew 0.919947s > max 0.05s (latency 0.000544046s)2. 15 requests are blocked
这里是具有两个常见错误
1. 时间不同步导致 mon 报警
2. 由于有硬件故障, 网络延时, 或其他原因导致客户端访问 ceph 存储超时
问题解决 (时间同步)
当前系统中的环境设定
[root@hh-yun-ceph-cinder015-128055 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep clock "mon_clock_drift_allowed": "0.05", <- 当 mon 时间偏移 0.05 秒则不正常 "mon_clock_drift_warn_backoff": "5", <- 当出现 5 次偏移, 则报警 "clock_offset": "0", <- mon 节点的时间偏移默认值
问题定位
检测各个机器的时间, 发现 hh-yun-ceph-cinder025-128075 节点时间偏移
修正方法
systemctl stop chronydntpdate 10.199.129.21systemctl start chronyd
当同步了时间并验证后, 需重启 mon 节点
/etc/init.d/ceph stop mon/etc/init.d/ceph start mon
因为 mon 节点与客户非常连接, 因此, 在确保 mon 节点具有冗余情况下, 可以在生产时间段进行快速重启
问题解决 (15 requests are blocked)
参考信息
ceph health detailHEALTH_WARN 14 requests are blocked > 32 sec; 11 osds have slow requests7 ops are blocked > 536871 sec2 ops are blocked > 268435 sec2 ops are blocked > 67108.9 sec3 ops are blocked > 33554.4 sec1 ops are blocked > 536871 sec on osd.01 ops are blocked > 536871 sec on osd.102 ops are blocked > 536871 sec on osd.122 ops are blocked > 268435 sec on osd.181 ops are blocked > 536871 sec on osd.311 ops are blocked > 536871 sec on osd.381 ops are blocked > 67108.9 sec on osd.381 ops are blocked > 33554.4 sec on osd.481 ops are blocked > 67108.9 sec on osd.521 ops are blocked > 536871 sec on osd.631 ops are blocked > 33554.4 sec on osd.641 ops are blocked > 33554.4 sec on osd.6911 osds have slow requests
上述信息, 发现, 访问被 block 而且时间很久,
解决方法
对上述 osd 进行一个一个的重启, 切记, 一个重启后, 数据 recovery 正常后才可以执行下一次的 osd 重启/etc/init.d/ceph stop osd.0/etc/init.d/ceph start osd.0待数据恢复后才能够执行下一个 OSD 重启
恢复后解决
cluster dc4f91c1-8792-4948-b68f-2fcea75f53b9 health HEALTH_OK monmap e3: 5 mons at {hh-yun-ceph-cinder015-128055=240.30.128.55:6789/0,hh-yun-ceph-cinder017-128057=240.30.128.57:6789/0,hh-yun-ceph-cinder024-128074=240.30.128.74:6789/0,hh-yun-ceph-cinder025-128075=240.30.128.75:6789/0,hh-yun-ceph-cinder026-128076=240.30.128.76:6789/0}, election epoch 170, quorum 0,1,2,3,4 hh-yun-ceph-cinder015-128055,hh-yun-ceph-cinder017-128057,hh-yun-ceph-cinder024-128074,hh-yun-ceph-cinder025-128075,hh-yun-ceph-cinder026-128076 osdmap e27495: 100 osds: 100 up, 100 in pgmap v11231620: 20544 pgs, 2 pools, 70294 GB data, 17688 kobjects 206 TB used, 157 TB / 363 TB avail 20539 active+clean 5 active+clean+scrubbing+deep client io 973 kB/s rd, 22936 kB/s wr, 1334 op/s
0 0
- ceph 故障解决备忘
- ceph OSD 故障记录
- ceph 故障分析(backfill_toofull)
- openstack ceph故障排查
- Ceph删除/添加故障OSD(ceph-deploy)
- ceph存储 ceph集群osd故障自我检测
- mac-pro连接vpn后无法上网但是可以登录qq故障解决备忘
- 从故障ceph cluster中恢复rbd
- ceph故障【pgs inconsistent;scrub errors】解决方法
- ceph存储 服务器硬盘故障预测实践
- openstack连接ceph不成功解决
- VMare故障 备忘 2009-12-3
- 备忘:rhel 6.1 grub 引导故障排除
- Windows 7系统应用软件故障修复备忘
- pdns 域名绑定 IP 故障备忘
- 故障分析与解决
- AXIS 古怪故障解决
- oracle故障解决日志
- MD5对密码进行加密处理机制
- mysql存储过程按月创建表分区 方式一
- 谈下C++编程题和c语言中a/3*3的意义
- 【实验室:DLT】DLT算法的易错之处
- Oracle 汉字在不同字符集下所占字节
- ceph 故障解决备忘
- 静态嵌套类(Static Nested Class)和内部类(Inner Class)的不同
- Linux中vi使用
- Android系统onKeyDown事件
- jsp中获取get请求过来的参数
- python--requests下载图片
- try ... catch 与 __try ... __except
- 下拉ScrollView伸缩头布局,实现ScrollView回弹效果
- 网络异常那些事