APAR status
Closed as program error.
Error description
In Spectrum Scale Erasure code Edition, it is possible for all of the server's pdisks (physical disks) to become missing, either due to network failure, node failure, or through a planned "node suspend" maintenance procedure. When this happens the system will continue to function if there is sufficient remaining fault tolerance. However, smaller configurations with less ECE nodes are exposed to a race condition where pdisk state changes can interrupt a system-wide descriptor update which causes the recovery group to resign. It is also possible to experience this problem with higher probability when using small ESS configurations, such as the GS1 or GS2 enclosures. For both ESS and ECE, a possible symptom may appear in the mmfs.log in this form when a pdisk state change is quickly followed by a resign message claiming VCD write failures before the system fault tolerance is exceeded: 2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG rg1 state changed from ok/00000.180 to missing/ suspended/00050.180. 2020-12-01_19:01:36.697-0400: [E] Beginning to resign recovery group rg1 due to "VCD write failure", caller err 217 when "updating VCD: RGD" Note that a "VCD write failure" with err 217 is a generic message issued when fault tolerance is exceeded during critical system updates, but in this case the race condition resigns the system when only a handful of missing disks are found.
Local fix
The workaround is to allow the system to resign and recovery on its own. For smaller ECE configurations, avoid leaving a node suspended for long periods of time during maintenance tasks.
Problem summary
In Spectrum Scale Erasure code Edition, it is possible for all of the server's pdisks (physical disks) to become missing, either due to network failure, node failure, or through a planned "node suspend" maintenance procedure. When this happens the system will continue to function if there is sufficient remaining fault tolerance. However, smaller configurations with less ECE nodes are exposed to a race condition where pdisk state changes can interrupt a system-wide descriptor update which causes the recovery group to resign. It is also possible to experience this problem with higher probability when using small ESS configurations, such as the GS1 or GS2 enclosures. For both ESS and ECE, a possible symptom may appear in the mmfs.log in this form when a pdisk state change is quickly followed by a resign message claiming VCD write failures before the system fault tolerance is exceeded: 2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG rg1 state changed from ok/00000.180 to missing/ suspended/00050.180. 2020-12-01_19:01:36.697-0400: [E] Beginning to resign recovery group rg1 due to "VCD write failure", caller err 217 when "updating VCD: RGD" Note that a "VCD write failure" with err 217 is a generic message issued when fault tolerance is exceeded during critical system updates, but in this case the race condition resigns the system when only a handful of missing disks are found.
Problem conclusion
Benefits of the solution: Code fix eliminates the race condition, making the system more stable in the presence of missing disks. Work Around: The workaround is to allow the system to resign and recovery on its own. For smaller ECE configurations, avoid leaving a node suspended for long periods of time during maintenance tasks. Problem trigger: pdisk state updates and system-wide descriptor updates on configurations with smaller recovery groups, such as a small number of nodes on ECE or a small Enclosure on ESS. Symptom: Unexpected Results/Behavior Platforms affected: N/A Functional Area affected: ESS/GNR Customer Impact: Suggested
Temporary fix
Comments
APAR Information
APAR number
IJ29812
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
510
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-12-11
Closed date
2020-12-11
Last modified date
2020-12-11
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"510","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
16 December 2020