APAR status
Closed as program error.
Error description
When QoS throttling is in use, and an application uses ionice with certain IO priorities, it is possible for not only that application to experience degraded performance due to the throttline, but also for other file system operations to be delayed or to fail. When an IO priority is set, it is mapped to a QoS class. For example, an IO priority of 4 is mapped to the "other" QoS class. If other priorities are used, they may get mapped to other QoS classes which get a default IOPs of only 1. Such a low threshold can lead to long delays in servicing this application's IO requests, and this shows up as long waiters as seen with 'mmdiag --waiters' or 'mmlsnode -N waiters -L'. For example: Waiting 45.1314 sec since 08:36:45, monitored, thread 6757 FsyncHandlerThread: on ThCond 0x1801F07A648 (QosIoTbtCondvar), reason 'tbtGetTokens' Waiting 45.0602 sec since 08:36:45, monitored, thread 34762 CleanBufferThread: on ThCond 0x1801F07A648 (QosIoTbtCondvar), reason 'tbtGetTokens' Waiting 45.0601 sec since 08:36:45, monitored, thread 56910 CleanBufferThread: on ThCond 0x1801F07A648 (QosIoTbtCondvar), reason 'tbtGetTokens' Waiting 45.0601 sec since 08:36:45, monitored, thread 50045 CleanBufferThread: on ThCond 0x1801F07A648 (QosIoTbtCondvar), reason 'tbtGetTokens' If you look at the output from the mmlsqos command, you might see an unexpected class like "standby" listed for some nodes. This is one example of an IO priority (7) getting mapped to a QoS class (standby) that has a low default IOPs setting (1). One possible side effect of this might be for a command like mmcrsnapshot or mmdelsnapshot to time out and fail. Commands like these must quiesce he file system first, which means that all outstanding IO operations must complete. If this cannot be achieved within a timeout period, then the command will wait some period of time and try the quiesce again. After enough failed attempts to quiesce, then the command will fail. So while it may seem like an application with a low IO priority should only impact that application, if there is too much QoS throttling taking place, there can actually be an impact to other file system operations like managing snapshots.
Local fix
A work around to this problem would be for the application to set its IO priority to 4, which translates to the "other" QoS class, which is often unlimited, or is at least likely to have a higher IOPs setting than the 1 that the "standby" class gets.
Problem summary
Application runs with I/O priority mapping into a not supported QoS class, which does have IOPS limitation with 1 IOPS, thus leading to I/Os being queued to wait for enough tokens to service the I/O operation. This causes long waiters.
Problem conclusion
Benefits of the solution: No more long waiters. Work around: N/A Problem trigger: Application runs with lower I/O priority when QoS is being used. Symptom: I/O hang Platforms affected: All Operating Systems Functional Area affected: QoS Customer Impact: High
Temporary fix
Comments
APAR Information
APAR number
IJ27087
Reported component name
SPEC SCALE DME
Reported component ID
5737F34AP
Reported release
504
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-08-17
Closed date
2020-10-07
Last modified date
2020-10-07
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE DME
Fixed component ID
5737F34AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"504","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
08 October 2020