ceph (luminous 版) data disk 故障测试
来源:互联网 发布:js 表格动态增加行 编辑:程序博客网 时间:2024/05/22 01:32
目的
模拟 ceph (luminous 版) data disk 故障修复上述问题
环境
参考手动部署 ceph 环境说明 (luminous 版)
参考当前 ceph 环境
ceph -s
cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_OK services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 36 up, 36 in data: pools: 1 pools, 2048 pgs objects: 28024 objects, 109 GB usage: 331 GB used, 196 TB / 196 TB avail pgs: 2048 active+clean
osd tree (取部分)
[root@hh-ceph-128214 ceph]# ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 216.00000 root default-10 72.00000 rack racka07 -3 72.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 6.00000 osd.14 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000 2 hdd 6.00000 osd.2 up 1.00000 1.00000 3 hdd 6.00000 osd.3 up 1.00000 1.00000
故障模拟
[root@hh-ceph-128214 ceph]# df -h | grep ceph-14/dev/sdc1 5.5T 8.8G 5.5T 1% /var/lib/ceph/osd/ceph-14/dev/sdn3 4.7G 2.1G 2.7G 44% /var/lib/ceph/journal/ceph-14[root@hh-ceph-128214 ceph]# rm -rf /var/lib/ceph/osd/ceph-14/*[root@hh-ceph-128214 ceph]# ls /var/lib/ceph/osd/ceph-14/
查询当前状态
cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_WARN 1 osds down Degraded data redundancy: 3246/121608 objects degraded (2.669%), 124 pgs unclean, 155 pgs degraded services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 35 up, 36 in data: pools: 1 pools, 2048 pgs objects: 40536 objects, 157 GB usage: 493 GB used, 195 TB / 196 TB avail pgs: 3246/121608 objects degraded (2.669%) 1893 active+clean 155 active+undersized+degraded io: client: 132 kB/s rd, 177 MB/s wr, 165 op/s rd, 175 op/s wr
参考 osd tree
[root@hh-ceph-128214 ceph]# ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 216.00000 root default-10 72.00000 rack racka07 -3 72.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 6.00000 osd.14 down 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000
参考错误日志
orting failure:12017-11-24 16:09:24.767761 7fdd215c1700 0 log_channel(cluster) log [DBG] : osd.14 10.199.128.214:6804/11943 reported immediately failed by osd.10 10.199.128.40:6820/126172017-11-24 16:09:24.996514 7fdd215c1700 1 mon.hh-ceph-128040@0(leader).osd e328 prepare_failure osd.14 10.199.128.214:6804/11943 from osd.6 10.199.128.40:6812/12317 is reporting failure:12017-11-24 16:09:24.996545 7fdd215c1700 0 log_channel(cluster) log [DBG] : osd.14 10.199.128.214:6804/11943 reported immediately failed by osd.6 10.199.128.40:6812/123172017-11-24 16:09:25.083523 7fdd23dc6700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)2017-11-24 16:09:25.087241 7fdd1cdb8700 1 mon.hh-ceph-128040@0(leader).log v17642 check_sub sending message to client.94503 10.199.128.40:0/161437639 with 1 entries (version 17642)2017-11-24 16:09:25.093344 7fdd1cdb8700 1 mon.hh-ceph-128040@0(leader).osd e329 e329: 36 total, 35 up, 36 in2017-11-24 16:09:25.093857 7fdd1cdb8700 0 log_channel(cluster) log [DBG] : osdmap e329: 36 total, 35 up, 36 in2017-11-24 16:09:25.094151 7fdd215c1700 0 mon.hh-ceph-128040@0(leader) e1 handle_command mon_command({"prefix": "osd metadata", "id": 30} v 0) v12017-11-24 16:09:25.094192 7fdd215c1700 0 log_channel(audit) log [DBG] : from='client.94503 10.199.128.40:0/161437639' entity='mgr.openstack' cmd=[{"prefix": "osd metadata", "id": 30}]: dispatch
恢复过程
删除 osd.14 auth 授权
[root@hh-ceph-128040 tmp]# ceph auth del osd.14updated
移除 osd.14 osd map
[root@hh-ceph-128214 ~]# ceph osd crush remove osd.14removed item id 14 name 'osd.14' from crush map
移除 OSD.14
[root@hh-ceph-128214 ~]# ceph osd rm osd.14removed osd.14
参考osd tree
Every 2.0s: ceph osd tree Sat Nov 25 15:27:41 2017ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 210.00000 root default-10 66.00000 rack racka07 -3 66.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000
删除 journal 文件
[root@hh-ceph-128214 ceph]# rm -rf /var/lib/ceph/journal/ceph-14/journal[root@hh-ceph-128214 /]# umount /dev/sdn3[root@hh-ceph-128214 /]# mkfs -t xfs -f /dev/sdn3meta-data=/dev/sdn3 isize=256 agcount=4, agsize=305152 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0data = bsize=4096 blocks=1220608, imaxpct=25 = sunit=0 swidth=0 blksnaming =version 2 bsize=4096 ascii-ci=0 ftype=0log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1realtime =none extsz=4096 blocks=0, rtextents=0[root@hh-ceph-128214 ~]# mount /dev/sdn3 /var/lib/ceph/journal/ceph-14/
恢复分区
[root@hh-ceph-128214 tmp]# umount /dev/sdc1[root@hh-ceph-128214 /]# dd if=/dev/zero of=/dev/sdc bs=1M count=100记录了100+0 的读入记录了100+0 的写出104857600字节(105 MB)已复制,0.59539 秒,176 MB/秒[root@hh-ceph-128214 tmp]# parted -s /dev/sdc mklabel gpt[root@hh-ceph-128214 tmp]# parted /dev/sdc mkpart primary xfs 1 100%信息: You may need to update /etc/fstab.[root@hh-ceph-128214 tmp]# mkfs.xfs -f -i size=1024 /dev/sdc1meta-data=/dev/sdc1 isize=1024 agcount=6, agsize=268435455 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0data = bsize=4096 blocks=1465130240, imaxpct=5 = sunit=0 swidth=0 blksnaming =version 2 bsize=4096 ascii-ci=0 ftype=0log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1realtime =none extsz=4096 blocks=0, rtextents=0[root@hh-ceph-128214 tmp]# mount /dev/sdc1 /var/lib/ceph/osd/ceph-14/
初始化 ceph osd (自动恢复 journal 文件)
[root@hh-ceph-128214 /]# ceph-osd -i 14 --mkfs --mkkey2017-11-24 18:21:42.297329 7fc7dc79bd00 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway2017-11-24 18:21:42.473203 7fc7dc79bd00 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway2017-11-24 18:21:42.473725 7fc7dc79bd00 -1 read_settings error reading settings: (2) No such file or directory2017-11-24 18:21:42.782000 7fc7dc79bd00 -1 created object store /var/lib/ceph/osd/ceph-14 for osd.14 fsid c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd2017-11-24 18:21:42.782044 7fc7dc79bd00 -1 auth: error reading file: /var/lib/ceph/osd/ceph-14/keyring: can't open /var/lib/ceph/osd/ceph-14/keyring: (2) No such file or directory2017-11-24 18:21:42.782202 7fc7dc79bd00 -1 created new key in keyring /var/lib/ceph/osd/ceph-14/keyring
创建 osd
[root@hh-ceph-128214 ~]# ceph osd create14
恢复 auth 认证
[root@hh-ceph-128214 tmp]# ceph auth add osd.14 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-14/keyringadded key for osd.14
恢复文件权限
[root@hh-ceph-128214 /]# ls -l /var/lib/ceph/journal/ceph-14/ /var/lib/ceph/osd/ceph-14//var/lib/ceph/journal/ceph-14/:总用量 2097152-rw-r--r-- 1 root root 2147483648 11月 24 18:21 journal/var/lib/ceph/osd/ceph-14/:总用量 36-rw-r--r-- 1 root root 37 11月 24 18:21 ceph_fsiddrwxr-xr-x 4 root root 61 11月 24 18:21 current-rw-r--r-- 1 root root 37 11月 24 18:21 fsid-rw------- 1 root root 57 11月 24 18:21 keyring-rw-r--r-- 1 root root 21 11月 24 18:21 magic-rw-r--r-- 1 root root 6 11月 24 18:21 ready-rw-r--r-- 1 root root 4 11月 24 18:21 store_version-rw-r--r-- 1 root root 53 11月 24 18:21 superblock-rw-r--r-- 1 root root 10 11月 24 18:21 type-rw-r--r-- 1 root root 3 11月 24 18:21 whoami[root@hh-ceph-128214 /]# chown ceph:ceph -R /var/lib/ceph/journal/ceph-14/ /var/lib/ceph/osd/ceph-14/
启动 ceph osd
[root@hh-ceph-128214 tmp]# systemctl status ceph-osd@14● ceph-osd@14.service - Ceph object storage daemon osd.14 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled) Active: failed (Result: start-limit) since 五 2017-11-24 17:35:00 CST; 1min 51s ago Process: 106773 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Process: 106767 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 106773 (code=exited, status=1/FAILURE)11月 24 17:34:40 hh-ceph-128214.vclound.com systemd[1]: Unit ceph-osd@14.service entered failed state.11月 24 17:34:40 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service failed.11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service holdoff time over, scheduling restart.11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: start request repeated too quickly for ceph-osd@14.service11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: Failed to start Ceph object storage daemon osd.14.11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: Unit ceph-osd@14.service entered failed state.11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service failed.[root@hh-ceph-128214 tmp]# systemctl start ceph-osd@14Job for ceph-osd@14.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@14.service" and "journalctl -xe" for details.To force a start use "systemctl reset-failed ceph-osd@14.service" followed by "systemctl start ceph-osd@14.service" again.[root@hh-ceph-128214 tmp]# systemctl reset-failed ceph-osd@14[root@hh-ceph-128214 tmp]# systemctl start ceph-osd@14[root@hh-ceph-128214 tmp]# systemctl status ceph-osd@14● ceph-osd@14.service - Ceph object storage daemon osd.14 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled) Active: active (running) since 五 2017-11-24 17:37:17 CST; 3s ago Process: 106871 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 106877 (ceph-osd) CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@14.service └─106877 /usr/bin/ceph-osd -f --cluster ceph --id 14 --setuser ceph --setgroup ceph11月 24 17:37:17 hh-ceph-128214.vclound.com systemd[1]: Starting Ceph object storage daemon osd.14...11月 24 17:37:17 hh-ceph-128214.vclound.com systemd[1]: Started Ceph object storage daemon osd.14.11月 24 17:37:17 hh-ceph-128214.vclound.com ceph-osd[106877]: starting osd.14 at - osd_data /var/lib/ceph/osd/ceph-14 /var/lib/ceph/journal/ceph-14/journal11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.035052 7fbaaf369d00 -1 journal FileJournal::_open: disabling aio for non-block ...o anyway11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.047920 7fbaaf369d00 -1 osd.14 0 log_to_monitors {default=true}11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.054256 7fba96117700 -1 osd.14 0 waiting for initial osdmapHint: Some lines were ellipsized, use -l to show in full.
检测
参考当前 ceph 状态
cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_WARN Degraded data redundancy: 8965/137559 objects degraded (6.517%), 60 pgs unclean, 206 pgs degraded services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 36 up, 36 in <- 参考这里 data: pools: 1 pools, 2048 pgs objects: 45853 objects, 178 GB usage: 540 GB used, 195 TB / 196 TB avail pgs: 8965/137559 objects degraded (6.517%) 1842 active+clean 201 active+recovery_wait+degraded 5 active+recovering+degraded io: recovery: 168 MB/s, 42 objects/s
参考 osd tree
[root@hh-ceph-128214 ceph]# ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 215.45609 root default-10 71.45609 rack racka07 -3 71.45609 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 5.45609 osd.14 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000 2 hdd 6.00000 osd.2 up 1.00000 1.00000 3 hdd 6.00000 osd.3 up 1.00000 1.00000
总结
在恢复 data disk 时候, 必须要把故障 osd 移除, (ceph osd rm osd.14)之前 ceph 0.87 版本在恢复时候不需要执行这个步骤
阅读全文
0 0
- ceph (luminous 版) data disk 故障测试
- ceph (luminous 版) journal disk 故障测试
- ceph (luminous 版) 用户管理
- ceph (luminous 版) pool 管理
- ceph (luminous 版) zabbix 监控
- 手动部署 ceph 环境说明 (luminous 版)
- 手动部署 ceph mon (luminous 版)
- 手动部署 ceph osd (luminous 版)
- 手动部署 ceph mgr (luminous 版)
- ceph (luminous 版) crush map 管理
- ceph (luminous 版) primary affinity 管理
- ceph Luminous dashboard初探
- Ceph安装指南 Luminous版本
- Ceph v12.1.0 Luminous RC released
- ceph Luminous新功能之crush class
- 【分析】Ceph and RBD Mirroring:Luminous
- ceph部署实践(luminous版本)
- ceph 可靠性测试 单故障域故障测试、单磁盘故障测试、单节点故障测试、单机柜故障测试、故障数据重构测试
- js 中的window.location.search.match()页面间的传阐述
- 关于PYTHON一个低级错误(py文件名和库名重复)
- 亚马逊AWS云主机(Rathat)安装docker
- 车辆3d检测
- 图的基本算法(BFS和DFS)
- ceph (luminous 版) data disk 故障测试
- Tensorflow、Keras使用过程中的问题
- Java 反射 动态代理 切面编程
- 大端和小端的含义
- javascript权威指南学习笔记(2)--表达式与运算符
- wireshark安装后无接口
- webpack实战——(3)自动生成项目中的html页面
- spring mvc jpa实现
- http://blog.csdn.net/u012230055/article/details/46647137