IBM Support

IBM Spectrum Scale (GPFS) V4.2 and 5.0 levels: a node or daemon failure may result in either data corruption in compressed files or undetected data corruption in snapshot files

Flashes (Alerts)


Abstract

IBM has identified an issue in IBM Spectrum Scale (GPFS) V4.2 and V5.0 levels, in which a node or daemon failure may result in either data corruption in compressed files or undetected data corruption in snapshot files.

Content

Problem Summary:

When data is written or new extended attributes are set to a file, the metadata of a file needs to be updated accordingly (e.g., file size, mtime, etc.).  To ensure that metadata is updated to disks consistently across a node or daemon failure, recovery log records are produced and can be used to recover the metadata updates after a node or daemon failure.  For a small compressed file, recovery log records for some compressed data blocks may not be produced when appending data or setting the extended attributes. As a result, the data for these blocks missing log protection may be corrupted if a node or daemon failure happens before the metadata updates are flushed to disks, as no log records are available to recover the metadata updates for them. Writes to the file may fail and return the EIO error code, or they may appear to succeed but result in the file content becoming corrupted. Reads from the file may either succeed and return correct data or they may return incorrect data.

The above-stated problem can also result in undetected data corruption in snapshot files when the FileHeat feature is being used.

Running offline fsck may be able to recover the correct state of the compressed files which were damaged by the missing log recovery, but only if there were no further writes to the file after the node failure. If there were writes to the compressed file then the contents of the file may have become corrupted, and the contents of the file cannot be recovered.

Users Affected:

This issue affects customers running any level of IBM Spectrum Scale (GPFS) V4.2 or V5.0, when either all of the conditions 1-4, or condition 5 only (or both) are met:

1. A node or daemon failure happens while appending data or setting extended attributes to the compressed files.
2. Spectrum Scale File Compression is being used, and compressed files are present in the file system.
3. Only small compressed files in particular size ranges are affected.  The minimum file size is the file system block size plus 1, and the maximum file size depends on file data replica and block size configurations, and the number of extended attributes which are set to files.  Contact IBM Service for more information.  
4. Files were successfully compressed and occupy fewer file system blocks than the file size would otherwise indicate.

or

5. For snapshot files with a minimum file size equal to the file system block size, and when the FileHeat feature is used, a node or daemon failure happens while inserting the FileHeat extended attribute to the snapshot files  (V4.2.2 or later). 

Problem Determination:

Compressed data in affected files cannot be decompressed and recompressed, and some log asserts may be raised during this process. The problem may result in the data in the affected files to become silently corrupted. The read on an affected file may return right data, or return corrupted data (zeros or other data patterns).  Writes to the affected areas may fail with the EIO error or hit some asserts, and writing to the area that returned incorrect data may further silently corrupt the file's data.  If an fsck check (-n option) is run, and if you see messages similar to the following, contact IBM Service to determine the next action.  For the corruption of snapshot files, undefined data may not necessarily be zeros.

Error in inode 2424577 snap 0: Indirect block 0 level 1 has bad disk addr at offset 3
replica 0 addr C -1:-1 does not have compression flag set
Repair disk address? no
No. SnapId InodeNum  FileType  Fix  Error(s)   Severity
--- ------ --------- --------- --- ----------- -----------
 1    0    2424577    User      N   DiskAddr   Noncritical

Recommendations:

1. Users running IBM Spectrum Scale V5.0.0.0 through V5.0.2.0 should apply IBM Spectrum Scale V5.0.2.1 or later, available from Fix Central at:  https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=5.0.2&platform=All&function=all

2. Users running IBM Spectrum Scale V4.2.0.0 through V4.2.3.11 should apply IBM Spectrum Scale V4.2.3.12 or later, available from Fix Central at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.2.3&platform=All&function=all

3. If you cannot apply the above PTF levels, contact IBM service for an efix for your level of code:

IBM Spectrum Scale V5.0.0.0 thru V5.0.2.0, reference APAR IJ10414
IBM Spectrum Scale V4.2.0.0 thru V4.2.3.11, reference APAR IJ10474

4. Until the fix is applied, users should temporarily stop using the File Compression function to avoid potential data corruption. Under the guidance of IBM service, run offline fsck to detect and fix the affected files. fsck will only be able to detect and fix the problem on files which have not been written to after they were damaged due to the missing recovery logs.

Until the fix is applied, the FileHeat feature should be disabled to prevent further damage to snapshot files.

5. With guidance from IBM, run offline fsck to gather information on the affected files and determine whether they can be repaired.

To contact IBM Service, see http://www.ibm.com/planetwide/

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"4.2.0, 4.2.1, 4.2.2, 4.2.3, 5.0.0, 5.0.1, 5.0.2","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"4.0, 4.5, 5.0, 5.1, 5.2, 5.3","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ibm10738705