CephOSDSlowOps

An Object Storage Device (OSD) with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

Impact: Medium

Diagnosis

More information about the slow requests can be obtained using the Openshift console.

Access the OSD pod terminal, and run the following commands:
  1. $ ceph daemon osd.<id> ops
    
  2. $ ceph daemon osd.<id> dump_historic_ops
Note: The number of the OSD is seen in the pod name. For example, in rook-ceph-osd-0-5d86d4d8d4-zlqkx, <0> is the OSD.

Mitigation

The main causes of the OSDs having slow requests are:
  • Problems with the underlying hardware or infrastructure, such as, disk drives, hosts, racks, or network switches. Use the Openshift monitoring console to find the alerts or errors about cluster resources. This can give you an idea about the root cause of the slow operations in the OSD.
  • Problems with the network. These problems are usually connected with flapping OSDs. See Flapping OSDs within the IBM Storage Ceph documentation..
  • If it is a network issue, escalate to IBM Support.
  • System load. Use the Openshift console to review the metrics of the OSD pod and the node which is running the OSD. Adding or assigning more resources can be a possible solution.