Flashes (Alerts)
Abstract
IBM has identified an issue in IBM Spectrum Scale (GPFS) V4.1.1.0 thru V4.1.1.22, V4.2.0.0 thru V4.2.3.12, and V5.0.0.0 thru V5.0.2.2 levels, in which a node failure during file system restripe (mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk, mmdelsnapshot, mmdelfileset) may result in silent data corruption or data loss in large files or file system structure errors.
Content
Problem Summary:
In the processes of restriping a file system (with commands such as mmrestripefs or others), some data blocks of large files may be skipped if some nodes in the cluster fail during the operation. An issue has been identified where, under certain conditions, a node failure in the cluster causes a range of data blocks which is assigned to this (failed) node to not be repaired in the overall repair process. As a result, data in these blocks may be lost, or replicas become inconsistent. The problem affects files of size larger than (50,000 * data block size) . The following GPFS commands are impacted: mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset.
Users Affected:
1) files of size larger than (50,000 * data block size) exist in the file system
2) a GPFS command such as mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset is running
3) at least two nodes in the cluster participate in the command which is specified in 2)
4) at least one node fails during the period the command is running
This issue could result in problems in large files (file size larger than "50,000 * data block size"), such as data loss, silent data corruption, or file system structure error(s) indicating disk address double de-allocation. Taking mmdeldisk as an example, when disks are deleted from the file system, the data of large files (50,000 * data block size) may not be completely migrated, resulting in some data blocks in the large file still referring to the deleted disks. Those blocks will not be accessible after the disk has been deleted, and attempting to read from or write to the file may result in an I/O error(s) when the affected files are accessed. In addition, if another disk is later added into the file system, the blocks which were not migrated by the previous mmdeldisk may now point to this newly added disk, and reading from the file may result in incorrect data being read, writing into the file may result in data corruption of other user files or metadata of the file system. The other commands may result in similar failures, with the incompletely repaired disk addresses referring to disks which are no longer part of the file system.
Problem Determination:
If a command such as mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, mmdelfileset is run, and the conditions mentioned in "Users Affected" have been met, invoking the mmfsck command may result in messages similar to those below:
...... Error in inode 153600 snap 0: Indirect block 22 level 1 has bad disk addr at offset 18 replica 0 addr 6:349011968 is corrupt ......Other possible effects:
1) Reading or writing a large file may fail with an I/O error
2) Reading large files may retrieve incorrect data after some disks are added into the same file system after execution of the GPFS commands mmdeldisk, mmrestripefs, mmchdisk, mmrpldisk and mmdelsnapshot, or mmdelfileset
3) "Double-deallocate" file system structure errors in dmesg or errpt.
...... mmfs: [X] logAssertFailed: !"Assert on Structure Error" mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: Invalid disk data structure. Error code 1108. Volume fs1 . Sense Data mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 54 04 08 00 00 00 00 02 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 17 1E 7D 80 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 20 00 01 00 00 00 00 02 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 17 1E 7D 80 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 40 00 00 00 00 00 00 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 00 00 00 00 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: 00 00 00 00 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=8884415: mmfs: mmfsd: Error=MMFS_GENERIC, ID=0x30D9195E, Tag=8884416 ......Recommendations:
For IBM Spectrum Scale V5.0.0.0 thru V5.0.2.2, reference APAR IJ11695
For IBM Spectrum Scale V4.2.0.0 thru V4.2.3.12, reference APAR IJ11626
For IBM Spectrum Scale V4.1.1.0 thru V4.1.1.22, reference APAR IJ11716
To contact IBM Service, see http://www.ibm.com/planetwide/
Was this topic helpful?
Document Information
Modified date:
26 September 2022
UID
ibm10869082