Notification
Risk classification
HIPER (High Impact and/or Pervasive)
Risk categories
Data Loss
Abstract
IBM has identified an issue in IBM Storage Scale 5.1.7.0 - 5.1.9.8 (IBM Storage Scale System 6.1.8.0 - 6.1.9.5) and IBM Storage Scale 5.2.0.0 - 5.2.2.0 (IBM Storage Scale System 6.2.0.0 - 6.2.2.0) that impacts IBM Storage Scale Erasure Code Edition (IBM Storage Scale ECE) and IBM Storage Scale System. The issue is a race condition that involves multiple threads performing a full-track read operation to the same track while disk errors exist. When the configuration parameter nsdRAIDClientOnlyChecksum is enabled, this race condition could create a situation where, without going through the checksum validation, data read from disks could be used for the reconstruction of data that failed to read due to disk errors.
Description
The race condition only occurs when all the following conditions are true:
- The system is running IBM Storage Scale ECE or IBM Storage Scale System.
- The system is running these code levels: IBM Storage Scale 5.1.7.0 through 5.1.9.8 (IBM Storage Scale System 6.1.8.0 through 6.1.9.5) and IBM Storage Scale 5.2.0.0 through 5.2.2.0 (IBM Storage Scale System 6.2.0.0 through 6.2.2.0).
- nsdRAIDClientOnlyChecksum is enabled.
Note: In recent IBM Storage Scale System configurations, the default is to have it enabled.
- Multiple threads are simultaneously performing full-track read operations to the same track.
- Disk errors or buffer trailer validation errors are affecting the reading of the corresponding strip data.
With these conditions, it is possible that data read from other strips will not be evaluated with the buffer checksum. Therefore, if there is silent data corruption within the buffer, it could be amplified to other areas in the same vtrack.
Note: If the corruption affects information from other GNR buffer trailer or descriptors (such as track ID), the buffer trailer checksums are always checked and are not subject to this condition while a GNR buffer validation is being performed.
Users Affected:
This issue may affect clients that run IBM Storage Scale ECE or IBM Storage Scale System with the environment configuration parameter nsdRAIDClientOnlyChecksum enabled on the following versions of IBM Storage Scale:
- IBM Storage Scale 5.1.7.0 through 5.1.9.8 (IBM Storage Scale System 6.1.8.0 through 6.1.9.5)
- IBM Storage Scale 5.2.0.0 through 5.2.2.0 (IBM Storage Scale System 6.2.0.0 through 6.2.2.0)
Problem determination:
Given that this is a race condition, definitively determining if the problem has happened will be difficult. However, these entries have to be in the GPFS log (/var/adm/ras) to allow the race condition to occur:
2025-02-10_20:17:09.569-0500: [E] Error validating buffer checksum in vdisk RG001LG002VS004 vtrack 7611 data segment 12 pdisk e1s17 psector 6135481200 vsector 15587503.
2025-02-10_20:17:09.570-0500: [E] Error validating buffer checksum in vdisk RG001LG002VS004 vtrack 7611 data segment 16 pdisk e1s17 psector 6135481264 vsector 15587567.
Recommended Action
IBM Storage Scale System customers running 6.2.0.0 - 6.2.2.0 should upgrade to IBM Storage Scale System 6.2.2.1 (or later):
IBM Storage Scale ECE customers running 5.1.7.0 - 5.1.9.8 should upgrade to IBM Storage Scale 5.1.9.9 (or later):
Customers who are unable to upgrade, should request an interim fix (reference APAR IJ53784).
If an interim fix cannot be immediately applied, and if the administrator wants to avoid the potential silent data corruption issue, the configuration parameter nsdRAIDClientOnlyChecksum can be disabled by using the next command:
mmchconfig nsdRAIDClientOnlyChecksum=no -i -N <server nodes or nodeclass>
If this work around is applied then:
- A potential performance loss for read throughput can occur if nsdRAIDClientOnlyChecksum is disabled.
- Changing this configuration parameter will not fix any data in the file system that may have been corrupted. An offline mmfsck must be run to verify the file system.
- After the fix is applied, make sure to re-enable the nsdRAIDClientOnlyChecksum parameter.
Reference ID
Internal reference: D.336303
Date first published
17 March 2025
Was this topic helpful?
Document Information
Modified date:
30 May 2025
UID
ibm17184104