APAR status
Closed as program error.
Error description
ABSTRACT: readmit failed to fix stale strips with pdisk transient pathWait Error Description: The rebalance/readmitting failed to fix all stale strips when pdisk state is in transient pathWait Reported in: Spectrum Scale 5.1.7.1 Known Impact: There will be stale strips left for some vtracks. If there are new pdisk failures, the vtrack reconstruction will fail to fix this vtrack and cause access error. Verification steps: If DA is already in scrub state, run this command on each ECE IO node: mmfsadm dump vdisk" | grep 'vQueue. pQueue.' | grep -iv "count 0" If it shows no-zero counts, then you hit this problem. Recovery action: N/A Local Fix: N/A
Local fix
Problem summary
When daemon restarts on a worker node, it is possible to have a race condition that causes worker local state change to take place after GNR's readmit operation which intends to repair tracks with stale data. The delayed state change could result the intended readmit operation to fail to repair the data on the given disks, thus result in stale sectors in the tracks which could have been fixed once the delayed state change takes place. With more disk failures before the next cycle of scan and repair operations having a chance to repair these vtracks, it could result data loss if number of faults are beyond the fault tolerance of the vdisk.
Problem conclusion
This problem is fixed in 5.1.9.5 To see all Spectrum Scale APARs and their respective Fix solutions refer to page: https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale _apars.html Benefits of the solution: Fixed the code so to avoid the race condition. Work Around: Before the fix is installed, manually verify if there is any vtracks stuck in stale state. Problem trigger: Daemon restart on individual ECE node, or shared ESS node (even though much less likely), followed by more failing disks. Symptom: Daemon crash Platforms affected: All Functional Area affected: GNR Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ49862
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
517
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2024-01-26
Closed date
2024-07-22
Last modified date
2024-07-22
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"517","Line of Business":{"code":"LOB69","label":"Storage TPS"}}]
Document Information
Modified date:
23 July 2024