Stale placement groups
The ceph health command lists some placement groups (PGs) as
stale.
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
What this means
The Monitor marks a placement group as stale when it does not receive any status update from the primary OSD of the placement group’s acting set or when other OSDs reported that the primary OSD is down.
Usually, PGs enter the stale state after you start the storage cluster and until the peering process completes. However, when the PGs remain stale for longer than expected, it might indicate that the primary OSD for those PGs is down or not reporting PG statistics to the Monitor. When the primary OSD storing stale PGs is back up, Ceph starts to recover the PGs.
The mon_osd_report_timeout setting determines how often OSDs report PGs
statistics to Monitors. By default, this parameter is set to 0.5, which means that
OSDs report the statistics every half a second.
For more information, see Monitoring Placement Group Sets and Down OSDs.
Troubleshooting this problem
-
Use the ceph health detail to identify which PGs are stale and on what OSDs they are stored. The error message includes information similar to the following example:
[ceph: root@host01 /]# ceph health detail HEALTH_WARN 24 pgs stale; 3/300 in osds are down ... pg 2.5 is stuck stale+active+remapped, last acting [2,0] ... osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
-
Troubleshoot any problems with the OSDs that are marked as down.