IJ26348: LONG WAITER "WAIT FOR WORKING INDEX ENTRY TO BE COMMITTED"

APAR status

Closed as program error.

Error description

A race condition between the RG master
resign/recovery and the mdi operation
on the worker side
could lead to a bug in the RG master recovery and
cause the the working index (WI)
entry to be stuck in Assigned state.
This would further cause the
integrity manger thread to block
and cause long waiter "wait for working
index entry to be committed" on the RG master
node.  Under this state, it could lead
to data integrity issues when the next
RG master resign/recovery event happens.

Note this could only happen in ECE/ESS3K
environment, and not in legacy ESS environment.

Local fix

The immediate bug does not cause a problem
right away, but it does expose a condition that
additional events could lead to a recovery failure.
When the recovery failure is detected data has
already been corrupted. It will require
special steps to allow recovery to
complete but in the meantime a
certain amount of data will be lost.

Problem summary

A race condition between the RG master
resign/recovery and the mdi operation
on the worker side
could lead to a bug in the RG master recovery and
cause the the working index (WI)
entry to be stuck in Assigned state.
This would further cause the
integrity manger thread to block
and cause long waiter "wait for working
index entry to be committed" on the RG master
node.  Under this state, it could lead
to data integrity issues when the next
RG master resign/recovery event happens.

Note this could only happen in ECE/ESS3K
environment, and not in legacy ESS environment.

Problem conclusion

Benefits of the solution:
Fixed the code so that the working index recovery
is properly handled during the recovery.
Otherwise data loss could happen.

Work Around:
The immediate bug does not cause a problem
right away, but it does expose a condition that
additional events could lead to a recovery failure.
When the recovery failure is detected data has
already been corrupted. It will require
special steps to allow recovery to
complete but in the meantime a
certain amount of data will be lost.

Problem trigger:
This is normally exposed by a race condition
when the RG master resign/recovery event happens
in the middle of the worker side mdi
operation, specifically when worker's
RPC which modifies mdi state, interleaves
with RG master's RPC which inquires
about the mdi state.

Symptom:
The immediate effect will be that on the
RG (root owner node) a long waiter will be
observed:
"wait for working index entry to be committed".
Under this state, if the metadata block
related to this working index is updated,
RG recovery could cause
metadata roll back and cause
data integrity issues, or the other
possible outcome is recovery failure
which renders a failure message as "MDI VCD
recovery failure in build used metaslot bitmaps: 16"

Note that this long waiter could
be caused by other situations, and not
all cases will result data integrity issue.

Platforms affected: N/A

Functional Area affected: GNR/ECE/ESS3K

Customer Impact: high Importance

Temporary fix

Comments

APAR Information

APAR number
IJ26348
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-07-22
Closed date
2020-07-22
Last modified date
2020-07-22

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 July 2020

Tips

IJ26348: LONG WAITER "WAIT FOR WORKING INDEX ENTRY TO BE COMMITTED"

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?