IJ29812: RECOVERY GROUP RESIGNS IN A SMALL ECE SETUP IN THE PRESENCE OF MISSING DISKS BEFORE FAULT TOLERANCE IS REACHED.

APAR status

Closed as program error.

Error description

In Spectrum Scale Erasure code Edition, it is
possible for all of the server's pdisks
(physical disks) to become missing, either due
to network failure, node failure, or through a planned
"node suspend" maintenance procedure. When this
happens the system will continue to function if there is
sufficient remaining fault tolerance. However, smaller
configurations with less ECE nodes are exposed to a
race condition where pdisk state changes can  interrupt
a system-wide descriptor update which
causes the recovery group to resign.
It is also possible to experience this problem with higher
probability when using small ESS configurations,
such as the GS1 or GS2 enclosures.

For both ESS and ECE, a possible symptom may appear in
the mmfs.log in this form when a pdisk state change is quickly
followed by a resign message claiming VCD write failures
before the system fault tolerance is exceeded:
2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG
rg1 state changed from ok/00000.180 to missing/
suspended/00050.180.
2020-12-01_19:01:36.697-0400: [E] Beginning to resign
recovery group rg1 due to "VCD write failure", caller err 217
when "updating VCD: RGD"

Note that a "VCD write failure" with err 217 is a generic
message
issued when fault tolerance is exceeded during critical system
updates, but in this case the race condition resigns the system
when only a handful of missing disks are found.

Local fix

The workaround is to allow the system to resign and recovery
on its own. For smaller ECE configurations, avoid leaving a
node suspended for long periods of time during maintenance
tasks.

Problem summary

In Spectrum Scale Erasure code Edition, it is
possible for all of the server's pdisks
(physical disks) to become missing, either due
to network failure, node failure, or through a planned
"node suspend" maintenance procedure. When this
happens the system will continue to function if there is
sufficient remaining fault tolerance. However, smaller
configurations with less ECE nodes are exposed to a
race condition where pdisk state changes can  interrupt
a system-wide descriptor update which
causes the recovery group to resign.
It is also possible to experience this problem with higher
probability when using small ESS configurations,
such as the GS1 or GS2 enclosures.

For both ESS and ECE, a possible symptom may appear in
the mmfs.log in this form when a pdisk state change is quickly
followed by a resign message claiming VCD write failures
before the system fault tolerance is exceeded:
2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG
rg1 state changed from ok/00000.180 to missing/
suspended/00050.180.
2020-12-01_19:01:36.697-0400: [E] Beginning to resign
recovery group rg1 due to "VCD write failure", caller err 217
when "updating VCD: RGD"

Note that a "VCD write failure" with err 217 is a generic
message
issued when fault tolerance is exceeded during critical system
updates, but in this case the race condition resigns the system
when only a handful of missing disks are found.

Problem conclusion

Benefits of the solution:
Code fix eliminates the race condition, making the system
more stable in the presence of missing disks.

Work Around:
The workaround is to allow the system to resign and recovery
on its own. For smaller ECE configurations, avoid leaving a
node suspended for long periods of time during maintenance
tasks.

Problem trigger:
pdisk state updates and system-wide descriptor updates
on configurations with smaller recovery groups, such as a
small number of nodes on ECE or a small Enclosure on ESS.

Symptom:
Unexpected Results/Behavior

Platforms affected:
N/A
Functional Area affected:
ESS/GNR

Customer Impact:
Suggested

Temporary fix

Comments

APAR Information

APAR number
IJ29812
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
510
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-12-11
Closed date
2020-12-11
Last modified date
2020-12-11

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

IJ29910

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"510","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
16 December 2020

Tips

IJ29812: RECOVERY GROUP RESIGNS IN A SMALL ECE SETUP IN THE PRESENCE OF MISSING DISKS BEFORE FAULT TOLERANCE IS REACHED.

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?