Slow requests or requests are blocked

Understand and troubleshoot slow requests or if requests are blocked.

When the ceph-osd daemon is slow to respond to a request, the ceph health detail command returns an error message similar to the following example:
HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests
30 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.11
1 ops are blocked > 268435 sec on osd.18
28 ops are blocked > 268435 sec on osd.39
3 osds have slow requests
In addition to the error message, the Ceph logs include an error message similar to the following messages:
2022-05-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2022-05-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]

What this means

An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

The main causes of OSDs having slow requests are:

  • Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches.
  • Problems with the network. These problems are usually connected with flapping OSDs. For more information, see Flapping OSDs.
  • System load.

Table 1 shows the types of slow requests. Use the dump_historic_ops administration socket command to determine the type of a slow request.

Table 1. Slow request types
Slow request type Description
waiting for rw locks The OSD is waiting to acquire a lock on a placement group for the operation.
waiting for subops The OSD is waiting for replica OSDs to apply the operation to the journal.
no flag points reached The OSD did not reach any major operation milestone.
waiting for degraded object The OSDs have not yet replicated an object the specified number of times.

For more information about the administration socket, see Using the Ceph administration socket.

Troubleshooting this problem

  1. Determine whether the OSDs with slow or block requests share a common piece of hardware, for example, a disk drive, host, rack, or network switch.
  2. If the OSDs share a disk:
    1. Use the smartmontools utility to check the health of the disk or the logs to determine any errors on the disk.
      Note: The smartmontools utility is included in the smartmontools package.
    2. Use the iostat utility to get the I/O wait report (%iowai) on the OSD disk to determine whether the disk is under heavy load.
      Note: The iostat utility is included in the sysstat package.
  3. If the OSDs share the node with another service:
    1. Check the RAM and CPU usage.
    2. Use the netstat utility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any networking issues.
  4. If the OSDs share a rack, check the network switch for the rack. For example, if you use jumbo frames, verify that the NIC in the path has jumbo frames set.

  5. If you are unable to determine a common piece of hardware shared by OSDs with slow requests, or to troubleshoot and fix hardware and networking problems, open a support ticket with IBM Support.