IBM Support

IBM has identified an issue in IBM Spectrum Scale (GPFS) V4.1, V4.2 and V5.0 levels, in which a node failure during file system restripe may result in silent data corruption or data loss in large files or file system structure errors

Flashes (Alerts)


Abstract

IBM has identified an issue in IBM Spectrum Scale (GPFS) V4.1.1.0 thru V4.1.1.22, V4.2.0.0 thru V4.2.3.12, and V5.0.0.0 thru V5.0.2.2 levels, in which a node failure during file system restripe (mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk, mmdelsnapshot, mmdelfileset) may result in silent data corruption or data loss in large files or file system structure errors.

Content

Problem Summary:

In the processes of restriping a file system (with commands such as mmrestripefs or others), some data blocks of large files may be skipped if some nodes in the cluster fail during the operation. An issue has been identified where, under certain conditions, a node failure in the cluster causes a range of data blocks which is assigned to this (failed) node to not be repaired in the overall repair process. As a result, data in these blocks may be lost, or replicas become inconsistent.  The problem affects files of size larger than (50,000 * data block size) . The following GPFS commands are impacted: mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset.

Users Affected:

This issue affects customers running IBM Spectrum Scale (GPFS): V4.1.1.0 thru V4.1.1.22,  or V4.2.0.0 thru V4.2.3.12, or V5.0.0.0 thru V5.0.2.2, when all of the following conditions are met:

1) files of size larger than (50,000 * data block size) exist in the file system

2) a GPFS command such as mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset is running

3) at least two nodes in the cluster participate in the command which is specified in 2)

4) at least one node fails during the period the command is running

This issue could result in problems in large files (file size larger than "50,000 * data block size"), such as data loss, silent data corruption, or file system structure error(s) indicating disk address double de-allocation. Taking mmdeldisk as an example, when disks are deleted from the file system, the data of large files (50,000 * data block size) may not be completely migrated, resulting in some data blocks in the large file still referring to the deleted disks. Those blocks will not be accessible after the disk has been deleted, and attempting to read from or write to the file may result in an I/O error(s) when the affected files are accessed. In addition, if another disk is later added into the file system, the blocks which were not migrated by the previous mmdeldisk may now point to this newly added disk, and reading from the file may result in incorrect data being read, writing into the file may result in data corruption of other user files or metadata of the file system. The other commands may result in similar failures, with the incompletely repaired disk addresses referring to disks which are no longer part of the file system.

Problem Determination:

If a command such as  mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset is run, and the conditions mentioned in "Users Affected" have been met, invoking the mmfsck command may result in messages similar to those below:
 
......
Error in inode 153600 snap 0: Indirect block 22 level 1 has bad disk addr at offset 18 replica 0 addr 6:349011968 is corrupt
......

Other possible effects:

1) Reading or writing a large file may fail with an I/O error

2) Reading large files may retrieve incorrect data after some disks are added into the same file system after execution of the GPFS commands mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, or mmdelfileset

3) "Double-deallocate" file system structure errors in dmesg or errpt.
 
......
mmfs: [X] logAssertFailed: !"Assert on Structure Error"
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: Invalid disk data structure. Error code 1108. Volume fs1
. Sense Data
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 54 04 08 00 00 00 00 02
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 17 1E 7D 80
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 20 00 01 00 00 00 00 02
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 17 1E 7D 80
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 40 00 00 00 00 00 00 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 00 00 00 00
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00
mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415:
mmfs: mmfsd: Error=MMFS_GENERIC, ID=0x30D9195E, Tag=8884416
......

Recommendations:
1. Users running IBM Spectrum Scale V5.0.0.0 thru V5.0.2.2 should apply IBM Spectrum Scale V5.0.2.3 or later, available from Fix Central at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=5.0.2&platform=All&function=all
2. Users running IBM Spectrum Scale V4.2.0.0 thru 4.2.3.12 should apply IBM Spectrum Scale V4.2.3.13 or later, available from Fix Central at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.2.3&platform=All&function=all
3. If you cannot apply the above PTF levels, contact IBM Service to obtain and apply an efix for your level of code:
 
For IBM Spectrum Scale V5.0.0.0 thru V5.0.2.2, reference APAR IJ11695
For IBM Spectrum Scale V4.2.0.0 thru V4.2.3.12, reference APAR IJ11626
For IBM Spectrum Scale V4.1.1.0 thru V4.1.1.22, reference APAR IJ11716
 
To contact IBM Service, see  http://www.ibm.com/planetwide/
4. Until the fix is applied, users should temporarily use only one node (the file system manager node) to participate in these GPFS commands mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, or mmdelfileset, using -N to specify the participating nodes. Command mmlsmgr can be used to display which node is the node of the file system manager.
If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"4.1.1, 4.2.0, 4.2.1, 4.2.2, 4.2.3, 5.0.0, 5.0.1, 5.0.2","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ibm10869082