Slow requests or requests are blocked
Understand and troubleshoot slow requests or if requests are blocked.
When the
ceph-osd daemon is slow to respond to a request, the ceph health detail command returns an error message similar to the following example:
HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests
In addition to the error message, the Ceph logs include an error message similar to the following messages:
2022-05-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
2022-05-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
What this means
An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
The main causes of OSDs having slow requests are:
- Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches.
- Problems with the network. These problems are usually connected with flapping OSDs. For more information, see Flapping OSDs.
- System load.
Table 1 shows the types of slow requests. Use the dump_historic_ops administration socket command to determine the type of a slow request.
| Slow request type | Description |
|---|---|
waiting for rw locks |
The OSD is waiting to acquire a lock on a placement group for the operation. |
waiting for subops |
The OSD is waiting for replica OSDs to apply the operation to the journal. |
no flag points reached |
The OSD did not reach any major operation milestone. |
waiting for degraded object |
The OSDs have not yet replicated an object the specified number of times. |
For more information about the administration socket, see Using the Ceph administration socket.
Troubleshooting this problem
- Determine whether the OSDs with slow or block requests share a common piece of hardware, for example, a disk drive, host, rack, or network switch.
- If the OSDs share a disk:
- Use the
smartmontoolsutility to check the health of the disk or the logs to determine any errors on the disk.Note: Thesmartmontoolsutility is included in thesmartmontoolspackage. - Use the
iostatutility to get the I/O wait report (%iowai) on the OSD disk to determine whether the disk is under heavy load.Note: Theiostatutility is included in thesysstatpackage.
- Use the
- If the OSDs share the node with another service:
- Check the RAM and CPU usage.
- Use the
netstatutility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any networking issues.
-
If the OSDs share a rack, check the network switch for the rack. For example, if you use jumbo frames, verify that the NIC in the path has jumbo frames set.
- If you are unable to determine a common piece of hardware shared by OSDs with slow requests, or to troubleshoot and fix hardware and networking problems, open a support ticket with IBM Support.