Unfound objects

Understand and troubleshoot unfound objects.

The ceph health command returns an error message similar to the following one, containing the unfound keyword, as in the following example:

HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)

What this means

Ceph marks objects as unfound when it knows these objects or their newer copies exist but it is unable to find them. As a result, Ceph cannot recover such objects and proceed with the recovery process.

Example situation

A placement group stores data on osd.1 and osd.2.

osd.1 goes down.
osd.2 handles some write operations.
osd.1 comes up.
A peering process between osd.1 and osd.2 starts, and the objects missing on osd.1 are queued for recovery.
Before Ceph copies new objects, osd.2 goes down.

As a result, osd.1 knows that these objects exist, but there is no OSD that has a copy of the objects.

In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound objects blocks the recovery process.

Troubleshooting this problem

Use the ceph health detail command to determine which placement group contains unfound objects.

For example,

[ceph: root@host01 /]# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; recovery 5/937611 objects degraded (0.001%); 1/312537 unfound (0.000%)
pg 3.8a5 is stuck unclean for 803946.712780, current state active+recovering, last acting [320,248,0]
pg 3.8a5 is active+recovering, acting [320,248,0], 1 unfound
recovery 5/937611 objects degraded (0.001%); **1/312537 unfound (0.000%)**

List more information about the placement group.

ceph pg ID query

Replace ID with the ID of the placement group containing the unfound objects.

For example,

[ceph: root@host01 /]# ceph pg 3.8a5 query
{ "state": "active+recovering",
  "epoch": 10741,
  "up": [
        320,
        248,
        0],
  "acting": [
        320,
        248,
        0],
<snip>
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2021-08-28 19:30:12.058136",
          "might_have_unfound": [
                { "osd": "0",
                  "status": "already probed"},
                { "osd": "248",
                  "status": "already probed"},
                { "osd": "301",
                  "status": "already probed"},
                { "osd": "362",
                  "status": "already probed"},
                { "osd": "395",
                  "status": "already probed"},
                { "osd": "429",
                  "status": "osd is down"}],
          "recovery_progress": { "backfill_targets": [],
              "waiting_on_backfill": [],
              "last_backfill_started": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": [],
              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2021-08-28 19:30:11.044020"}],

The might_have_unfound section includes OSDs where Ceph tried to locate the unfound objects:

The already probed status indicates that Ceph cannot locate the unfound objects in that OSD.
The osd is down status indicates that Ceph cannot contact that OSD.

Troubleshoot the OSDs that are marked as down. See Down OSDs for details.
If you are unable to fix the problem that causes the OSD to be down, open a support ticket. For more information, see IBM Support.