IBM Support

FILE DATA CAN GET CORRUPTED IF A FILE IS HSM RECALLED IN GPFS AND FILE SYSTEM IS MOUNTED READ ONLY

Flashes (Alerts)


Abstract

Under certain conditions, IBM Spectrum Protect for Space Management (HSM) can corrupt a file's data during recall processing. The problem can occur for AIX or Linux GPFS file systems that are mounted in read-only mode on a compute node that is different from the HSM owner node, and the file is recalled by more than one application at the same time.

Content

PROBLEM SUMMARY

The problem can occur if the managed file system is mounted in read-only mode on a compute node that is different from the HSM owner node, and a file is recalled from at least two applications at the same time. An application reading or writing the corrupted data might not detect the corruption. The next migration after file modification will transfer the corrupted data to the IBM Spectrum Protect server.

The original migrated copy of the file on the IBM Spectrum Protect server is not affected by the data corruption, and can be recovered until the HSM reconcile has deleted the migrated copy. The deletion of the original migrated copy depends on how often HSM reconcile runs and the value of the reconcile option MIGFILEEXPIRATION.

This issue affects all IBM Spectrum Protect for Space Management versions on AIX and Linux that manage IBM Spectrum Scale GPFS file systems.

CONDITIONS FOR THE PROBLEM TO OCCUR

HSM migration copies files to the IBM Spectrum Protect Server. If the file resides in the file system and has a copy in the IBM Spectrum Protect Server, the file state is called premigrated. If the file is removed from the file system after the send, the file state is called migrated. The mechanism to remove the file from the file system is called stubbing. A file that is migrated can be recalled from HSM. This means the file is copied from the IBM Spectrum Protect Server to the file system. The file state changes to premigrated if the file was recalled for a read operation. The file state changes to resident if the file was recalled for a write operation. A new file that was never processed by HSM has the file state resident.

HSM automatically recovers files left over from failed migrations and failed recalls. For files that failed migration, the files become resident. For files that failed recall, the files are stubbed again.

Automatic recovery should occur only when a file system managed by HSM is mounted on the compute node that is the HSM file system owner node. However, an error in the code can cause automatic recovery to occur when a file system is mounted on a compute node that is not the HSM file system owner node. If at least two applications recall the same file at the same time, the first recall is stopped and the recovery (stubbing) of the file takes place. The second recall continues, and leads to file data written into the file, and to a file state change from migrated to premigrated or resident. This mismatch can lead to files that have valid file data, but also sequences of zeros.

CIRCUMVENTION

Set the HSM option MIGFILEEXPIRATION to the value 9999 and then determine if your environment is affected. If your environment is not affected, reset the MIGFILEEXPIRATION option to the previous value. This will prevent reconcile from deleting the original migrated copy of affected files if they still exist in IBM Spectrum Protect server storage. However, this will prevent space from being freed in the Spaceman pool on the IBM Spectrum Protect server.

HOW TO DETERMINE IF YOU ARE AFFECTED

On all cluster nodes, examine the /var/log/messages file, including all available older versions of the file, for occurrences of any message that contains "mmmount -o ro". These messages indicates that the file system was mounted read-only. Next, examine the dsmerror.log file, including all available older versions of the file, for occurrences of message "ANS4007E". If any ANS4007E messages have the same time stamp as the "mmmount -o ro" entries, one or more of your space managed files might be be affected by this problem. Check the file indicated in the ANS4007E message for any unexpected sequences of zeros.

Example from /var/log/messages:
2017-07-13T17:02:03.285629+02:00 blackpearl mmfs[14062]: CLI root root [EXIT, CHANGE] 'mmmount gpfs -o ro' RC=0

Example from dsmerror.log:
07/13/2017 17:02:03 ANS4007E Error processing '/gpfs/big0': access to the object is denied

Contact IBM support to get assistance with recovery of affected files.

SOLUTION AND CLIENT PACKAGE LEVELS CONTAINING THE FIX:

A fix for this problem is targeted for IBM Spectrum Protect for Space Management Versions 7.1.8 and 8.1.2, subject to change at the discretion of IBM. If you need a fix before the fixing versions are available, contact IBM support.

[{"Product":{"code":"SSERBH","label":"IBM Spectrum Protect for Space Management"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"}],"Version":"7.1.3;7.1.4;7.1.6;8.1.0","Edition":"Enterprise","Line of Business":{"code":"LOB26","label":"Storage"}}]

Product Synonym

Unix HSM

Document Information

Modified date:
25 September 2022

UID

swg22007721