IBM Support

IBM Spectrum Scale Alert: AFM recovery may incorrectly delete the files at the home or secondary site.

Flashes (Alerts)


Abstract

IBM has identified certain issues affecting Active File Management (AFM) and AFM Asynchronous Disaster Recovery (ADR) in IBM Spectrum Scale, which might result in undetected data loss.

Content

AFM might incorrectly delete files at the home or secondary cluster during an AFM recovery, causing some files to be missed at the home or secondary cluster. AFM recovery is triggered if in-memory queue is lost, for example a gateway node restart. AFM performs readdir from the home to detect deleted and renamed files. If home readdir fails after reading some entries, readdir is retried three times without checking to determine whether some entries were already read. This might cause duplicate entries to be logged as part of home list, causing them to be incorrectly treated as hard link remove operations, and this causes AFM to perform a remove operation on those files. A readdir failure during the AFM recovery can happen if there is a network issue between the cache and the home system. Data might be permanently lost if auto eviction is enabled at the cache, and data was evicted for those files, which were already removed at the home by AFM recovery. Auto eviction is enabled by default with the fileset level config option afmEnableAutoEviction.

Users affected:
Users running AFM and AFM DR on IBM Spectrum Scale V5.0.0.0 through V5.1.1.0 are potentially affected.
Problem Determination:
At the gateway node, remove operations are reported in the mmfs.log file in following manner:

2021-05-03_15:35:58.135+0530: [I] AFM: /usr/lpp/mmfs/bin/tspcachescan gpfs1 sw1 6 0 1319687074 sw1.afm.27129 3 0 C0A87A0A602CD23B 2 0
2021-05-03_15:35:58.147+0530: [I] AFM: Detecting operations to be recovered...
2021-05-03_15:35:58.158+0530: [I] AFM: Found 1 remove operations...
2021-05-03_15:35:58.158+0530: [I] AFM: Found 2 hard link remove operations...
2021-05-03_15:35:58.158+0530: [I] AFM: Found 1 create operations...
2021-05-03_15:35:58.159+0530: [I] AFM: Found 1 update operation...
2021-05-03_15:35:58.159+0530: [I] AFM: Found 1 local cleanup operation...
2021-05-03_15:35:58.167+0530: [I] AFM: Starting 'queue' operation for fileset 'sw1' in file system 'gpfs1'.
2021-05-03_15:35:58.167+0530: [I] Command: tspcache gpfs1 1 sw1 0 3 1319687074 49 0 133 0
2021-05-03_15:35:58.226+0530: [I] Command: successful tspcache gpfs1 1 sw1 0 3 1319687074 49 0 133 0
If the hard link remove operations count is too high or if there are no hard links in the fileset, then it is possible that AFM recovery deleted the files at home or secondary system. Also, a user can check file count in the live file system between cache (or primary) and home (or secondary). A user can run a policy scan to get all the files from the fileset at both cache and home, and compare the file lists between them. If home is not running Spectrum Scale, you can use the "find" command to generate the list of files.
Example policy to get all the files from the indicated path:
RULE EXTERNAL LIST 'allFiles'
RULE 'allFilesRule' LIST 'allFiles' DIRECTORIES_PLUS
                              WHERE REGEX(misc_attributes,'[P]') AND
                               PATH_NAME NOT LIKE '%/.pconflicts/%' AND
                               PATH_NAME NOT LIKE '%/.afm/%' AND
                               PATH_NAME NOT LIKE '%/.ptrash/%'              

mmapplypolicy <path> -P <policy file path> -f <output file path> -L 1 -N mount -I defer

Recommendations:

Any users seeing the incorrect "hard link remove operations" message in the mmfs.log file, apply the fix on AFM gateway nodes at cache or primary cluster.

1. Users running IBM Spectrum Scale V5.0.0.0 through V5.0.5.7, should apply IBM Spectrum Scale V5.0.5.8 or later,  available from Fix Central at:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=5.0.5&platform=All&function=all

2. Users running IBM Spectrum Scale V5.1.0.0 through V5.1.1.0, should apply IBM Spectrum Scale V5.1.1.1 or later, available from Fix Central at:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=5.1.1&platform=All&function=all

3. If you cannot apply one of the above PTF levels, contact IBM Service to obtain and apply an efix for your level of code:

      -  For IBM Spectrum Scale V5.0.0.0 thru V5.0.5.7, reference APAR  IJ32504

      -  For IBM Spectrum Scale V5.1.0.0 thru V5.1.1.0, reference APAR  IJ32481

If you believe that your GPFS file system might be affected by this issue, contact IBM Service as soon as possible for further guidance and assistance.
Until a fix (in PTF or fix form) is applied, when the described symptom (shown in the Problem Determination section) is observed, the user should: 
a) For single writer mode fileset, perform resync:
   mmafmctl device resync -j fileset.
b) For AFM DR primary mode fileset, perform changeSecondary:
     mmafmctl device changeSecondary -j fileset --new-target existingAfmTarget --inband
c) For independent writer mode fileset, files will also get deleted in the cache during the revalidation. However, those files can be found under
     <Fileset Junction Path>/.ptrash directory. You can copy data from the .ptrash directory to the fileset. This synchronizes data between cache (or primary) and home (or secondary).
d) Users can also restore data if there is a backup available, for example - HSM or from old snapshots if they exist.

[{"Type":"SW","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"ARM Category":[{"code":"a8m50000000KzjCAAS","label":"AFM->AFM-DR"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.0.0;5.1.0"}]

Document Information

Modified date:
19 July 2021

UID

ibm16452089