Placement groups are down
Understand and troubleshoot placement groups that are in a down state.
The ceph health detail command reports that some placement groups are down. For example,
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down ... pg 0.5 is down+peering pg 1.4 is down+peering ... osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
What this means
In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.
For more information, see Ceph OSD peering.
Troubleshooting this problem
Determine what blocks the peering process.
ceph pg ID query
Replace ID with the ID of the placement group that is down.For example,
[ceph: root@host01 /]# ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2021-08-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2021-08-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2021-08-06 14:40:16.169513"}
]
}
The recovery_state section includes information on why the peering process is blocked.
- If the output includes the peering is blocked due to down osds error message, see Down OSDs.
-
If you see any other error message, open a support ticket with IBM Support.