A fix is available
APAR status
Closed as program error.
Error description
During the GDPS site failure condition, primary sysplex CDS volume was lost and some OS LPARs in the plex were also lost, IXC256A was issued with the LPAR name lacking response to the CDS switching. V XCF,lpar,OFFLINE comamnd was issued but the IXC371D for the confirmation was never seen, so the removal of those LPARs mentioned in the IXC256A can NOT happen. From the SADUMP provided, it showed that V XCF,lpar,OFFLINE related command processing task in asid(1) was waiting for the WTOR ID from the console task. Console task was running in IEAVM616 (UA76732) and invoked XCF serice for the console group record and waiting for it. This CDS READ request was made just shortly before XCF declared the loss of sysplex CDS (i.e QUCB_GLBComm_NoDS ). Otherwise console could have utilized alternative path to produce reply ID. ANALYSIS: As sysplex CDS switiching can not be completed from those LPARs on failure site and none of them can be removed completely by the GDPS control as IXC371D can NOT be issued without reply ID, plex operation is completely stopped. KNOWN IMPACT: As those LPARs mentioned in the IXC256A can NOT be removed from the plex, sysplex CDS switching can NOT be completed, Plexwide operation is impacted/ stopped such as CFRM CDS switch can NOT take place and no str rebuild can occur for those str loss on the CF resides on failure site. VERIFICATION STEPS: 1) COUPLE SERIAL output looking for the outstanding requests, and find the request related to : Request ID: 000xxxxx Request Type: 00000000 Record Type/Number: CONSOLE 00000001 Record Subtype/Number: N/A Ownership: Global Waiter Owning System: N/A ASID: x'000A' TCB Address: 00tttttt Diag002: 08040005 Diag054: 00000000 0000138B 2) Review the task structure under the TCB tttttt in SUMMARY FORMAT ASID(x'A') to locate the linkage stack entry with PC 00B02 then confirm the PSWE address within IEAVM616 for reading the group record. 3) MSGCACHE output and search for 'V XCF,; or 'VARY XCF,' to locate the offline commands, then search IXC371D and not found.
Local fix
BYPASS/CIRCUMVENTION: no bypass RECOVERY ACTION: Plexwide ipl to recover all LPARs
Problem summary
**************************************************************** * USERS AFFECTED: * * Installations exploiting sysplex with a * * sysplex couple data set (CDS) formatted for * * 9 or more systems at z/OS V2R4 (HBB77C0) * * and above. * **************************************************************** * PROBLEM DESCRIPTION: * * Deadlock trying to remove both a * * system from the sysplex and a * * sysplex CDS, when partitioning * * tries to issue MSGIXC371D and * * WTOR processing is unable to read * * the required data from the * * sysplex CDS. * * * * SYSPLEXDS * **************************************************************** * RECOMMENDATION: * * Install the applicable PTF on each system * * in the sysplex. A rolling IPL is * * sufficient to activate the fix. * **************************************************************** Deadlock can occur in a scenario like the following: o A system issues a VARY XCF command to remove an unresponsive system from the sysplex. XCF initiates WTOR IXC371D to confirm the request. o Consoles attempts to read its record in the sysplex CDS as part of WTOR processing. The read of the primary sysplex CDS encounters I/O delays. o With the Consoles read request in progress but delayed, XCF Serialization recognizes a device error and begins processing to remove the primary sysplex CDS. o The outgoing unresponsive system does not participate in the CDS removal process. At this point, we have a three-way deadlock between Serialization, Partitioning, and Consoles: o Removal of the primary sysplex CDS cannot progress because a system has failed to send the required participation signals. o The VARY command to remove the unresponsive system is hung trying to issue MSGIXC371D because the Consoles task responsible for obtaining a reply ID is suspended waiting for the CDS read to complete. O The CDS read cannot be processed until removal of the primary sysplex CDS completes (at which time the read would be able to access the surviving alternate CDS).
Problem conclusion
During removal of the sysplex CDS type, fail in-progress read requests against the applicable sysplex CDS record with a new return / reason code combination indicating that the read has been bypassed to avoid deadlock. o After reporting failure, XCF will allow the in-flight request to complete asynchronously when CDS removal completes so that it can clean up resources associated with the request. o On receipt of the failing return / reason code, Consoles will exploit existing processing that allows it to process the WTOR without information from the CDS. It may reuse a previous reply ID or issue the WTOR with reply ID 0 in this scenario. There are no publication updates required for this APAR. However, MSGIEA402A has an existing text fillin of the form XCF RETURN CODE xxxxxxxx, REASON CODE yyyyyyyy This APAR introduces new return / reason code combination 0Cx / 20x (RC0C RSN20) indicating that a couple data set access request has been bypassed to avoid a potential deadlock.
Temporary fix
********* * HIPER * *********
Comments
APAR Information
APAR number
OA53790
Reported component name
GRS
Reported component ID
5752SCSDS
Reported release
7A0
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-08-31
Closed date
2021-11-09
Last modified date
2021-12-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
UJ07064 UJ07065
Modules/Macros
IXCL1RS IXCL1SWT IXCL1QUE IXCYCON IXCL1TSK IXCS2TSK IXCL1PCD IXCF1TF3 IXCL1ERE IXCL1RED IXCF1TX2 IXCF1SCF IXCL1PCX IXCL1UNL IXCE1TNM
Fix information
Fixed component name
XCF
Fixed component ID
5752SCXCF
Applicable component levels
R7C0 PSY UJ07064
UP21/11/24 P F111 ¢
Fix is available
Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"7A0"}]
Document Information
Modified date:
06 December 2021