Placement groups are down

Understand and troubleshoot placement groups that are in a down state.

The ceph health detail command reports that some placement groups are down. For example,

HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

What this means

In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.

For more information, see Ceph OSD peering.

Troubleshooting this problem

Determine what blocks the peering process.

ceph pg ID query

Replace ID with the ID of the placement group that is down.

For example,

[ceph: root@host01 /]#  ceph pg 0.5 query

{ "state": "down+peering",
  ...
  "recovery_state": [
       { "name": "Started\/Primary\/Peering\/GetInfo",
         "enter_time": "2021-08-06 14:40:16.169679",
         "requested_info_from": []},
       { "name": "Started\/Primary\/Peering",
         "enter_time": "2021-08-06 14:40:16.169659",
         "probing_osds": [
               0,
               1],
         "blocked": "peering is blocked due to down osds",
         "down_osds_we_would_probe": [
               1],
         "peering_blocked_by": [
               { "osd": 1,
                 "current_lost_at": 0,
                 "comment": "starting or marking this osd lost may let us proceed"}]},
       { "name": "Started",
         "enter_time": "2021-08-06 14:40:16.169513"}
   ]
}

The recovery_state section includes information on why the peering process is blocked.

If the output includes the peering is blocked due to down osds error message, see Down OSDs.
If you see any other error message, open a support ticket with IBM Support.