ceph PG 状态恢复不了的问题（来自ceph-devel邮件）

来源：互联网发布：软件项目阶段性汇报编辑：程序博客网时间：2024/06/09 22:41

ceph health detail

HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive

pg 1.165 is stuck inactive since forever, current state down+remapped+peering, last acting [38,48]

pg 1.60 is stuck inactive since forever, current state down+remapped+peering, last acting [66,40]

pg 1.60 is down+remapped+peering, acting [66,40]

pg 1.165 is down+remapped+peering, acting [38,48]

[root@cc1 ~]# ceph -s

cluster 8cdfbff9-b7be-46de-85bd-9d49866fcf60

health HEALTH_WARN

2 pgs down

2 pgs peering

2 pgs stuck inactive

monmap e1: 3 mons at {cc1=192.168.128.1:6789/0,cc2=192.168.128.2:6789/0,cc3=192.168.128.3:6789/0}

election epoch 872, quorum 0,1,2 cc1,cc2,cc3

osdmap e115175: 100 osds: 88 up, 86 in; 2 remapped pgs

pgmap v67583069: 3520 pgs, 17 pools, 26675 GB data, 4849 kobjects

76638 GB used, 107 TB / 182 TB avail

3515 active+clean

3 active+clean+scrubbing+deep

2 down+remapped+peering

client io 0 B/s rd, 869 kB/s wr, 14 op/s rd, 113 op/s wr

The thing where you can't query a PG is because the OSD is throttling

incoming work and the throttle is exhausted (the PG can't do work so it

isn't making progress). A workaround for jewel is to restart the OSD

serving the PG and do the query quickly after that (probably in a loop so

that you catch it after it starts up but before the throttle is

exhausted again). (In luminous this is fixed.)

Once you have the query output ('ceph tell $pgid query') you'll be able to

tell what is preventing the PG from peering.

You can identify the osd(s) hosting the pg with 'ceph pg map $pgid'.

If you haven't deleted the data, you should start the OSDs back up.

If they are partially damanged you can use ceph-objectstore-tool to

extract just the PGs in question to make sure you haven't lost anything,

inject them on some other OSD(s) and restart those, and *then* mark the

bad OSDs as 'lost'.

If all else fails, you can just mark those OSDs 'lost', but in doing so

you might be telling the cluster to lose data.

The best thing to do is definitely to get those OSDs started again.

阅读全文

0 0

ceph PG 状态恢复不了的问题 （来自ceph-devel邮件）