APAR status
Closed as program error.
Error description
A race condition between the RG master resign/recovery and the mdi operation on the worker side could lead to a bug in the RG master recovery and cause the the working index (WI) entry to be stuck in Assigned state. This would further cause the integrity manger thread to block and cause long waiter "wait for working index entry to be committed" on the RG master node. Under this state, it could lead to data integrity issues when the next RG master resign/recovery event happens. Note this could only happen in ECE/ESS3K environment, and not in legacy ESS environment.
Local fix
The immediate bug does not cause a problem right away, but it does expose a condition that additional events could lead to a recovery failure. When the recovery failure is detected data has already been corrupted. It will require special steps to allow recovery to complete but in the meantime a certain amount of data will be lost.
Problem summary
A race condition between the RG master resign/recovery and the mdi operation on the worker side could lead to a bug in the RG master recovery and cause the the working index (WI) entry to be stuck in Assigned state. This would further cause the integrity manger thread to block and cause long waiter "wait for working index entry to be committed" on the RG master node. Under this state, it could lead to data integrity issues when the next RG master resign/recovery event happens. Note this could only happen in ECE/ESS3K environment, and not in legacy ESS environment.
Problem conclusion
Benefits of the solution: Fixed the code so that the working index recovery is properly handled during the recovery. Otherwise data loss could happen. Work Around: The immediate bug does not cause a problem right away, but it does expose a condition that additional events could lead to a recovery failure. When the recovery failure is detected data has already been corrupted. It will require special steps to allow recovery to complete but in the meantime a certain amount of data will be lost. Problem trigger: This is normally exposed by a race condition when the RG master resign/recovery event happens in the middle of the worker side mdi operation, specifically when worker's RPC which modifies mdi state, interleaves with RG master's RPC which inquires about the mdi state. Symptom: The immediate effect will be that on the RG (root owner node) a long waiter will be observed: "wait for working index entry to be committed". Under this state, if the metadata block related to this working index is updated, RG recovery could cause metadata roll back and cause data integrity issues, or the other possible outcome is recovery failure which renders a failure message as "MDI VCD recovery failure in build used metaslot bitmaps: 16" Note that this long waiter could be caused by other situations, and not all cases will result data integrity issue. Platforms affected: N/A Functional Area affected: GNR/ECE/ESS3K Customer Impact: high Importance
Temporary fix
Comments
APAR Information
APAR number
IJ26348
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-07-22
Closed date
2020-07-22
Last modified date
2020-07-22
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
23 July 2020