IBM Support

OA53790: IXC256A RELATED TO SYSPLEX CDS SWITICHING, THEN V XCF,XXX,OFFLINE ISSUED BUT NO IXC371D

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • During the GDPS site failure condition, primary sysplex CDS
    volume was lost and some OS LPARs in the plex were also lost,
    IXC256A was issued with the LPAR name lacking response to the
    CDS switching. V XCF,lpar,OFFLINE comamnd was issued but the
    IXC371D for the confirmation was never seen, so the removal of
    those LPARs mentioned in
    the IXC256A can NOT happen.
    From the SADUMP provided, it showed that V XCF,lpar,OFFLINE
    related command processing task in asid(1) was waiting for the
    WTOR ID from the console task. Console task was running in
    IEAVM616 (UA76732) and invoked XCF serice for the console group
    record and waiting for it.
    This CDS READ request was made just shortly before XCF declared
    the loss of sysplex CDS (i.e QUCB_GLBComm_NoDS ). Otherwise
    console could have utilized alternative path to produce reply
    ID.
    
    ANALYSIS:
    As sysplex CDS switiching can not be completed from those LPARs
    on failure site and none of them can be removed completely by
    the GDPS control as IXC371D can NOT be issued without reply ID,
    plex operation is completely stopped.
    
    KNOWN IMPACT:
    As those LPARs mentioned in the IXC256A can NOT be removed from
    the plex, sysplex CDS switching can NOT be completed, Plexwide
    operation is impacted/ stopped such as CFRM CDS switch can NOT
    take place and no str rebuild can occur for those str loss on
    the CF resides on failure site.
    
    VERIFICATION STEPS:
    1) COUPLE SERIAL output looking for the outstanding requests,
    and find the request related to :
                Request ID:  000xxxxx
              Request Type:  00000000
        Record Type/Number:  CONSOLE  00000001
     Record Subtype/Number:  N/A
                 Ownership:  Global Waiter
             Owning System:  N/A
                      ASID:  x'000A'
               TCB Address:  00tttttt
                   Diag002:  08040005
                   Diag054:  00000000 0000138B
    
    2) Review the task structure under the TCB tttttt in SUMMARY
    FORMAT ASID(x'A') to locate the linkage stack entry with PC
    00B02 then confirm the PSWE address within IEAVM616 for reading
    the group record.
    
    3) MSGCACHE output and search for 'V XCF,; or 'VARY XCF,' to
    locate the offline commands, then search IXC371D and not found.
    

Local fix

  • BYPASS/CIRCUMVENTION:
    no bypass
    
    RECOVERY ACTION:
    Plexwide ipl to recover all LPARs
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Installations exploiting sysplex with a                      *
    * sysplex couple data set (CDS) formatted for                  *
    * 9 or more systems at z/OS V2R4 (HBB77C0)                     *
    * and above.                                                   *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * Deadlock trying to remove both a                             *
    * system from the sysplex and a                                *
    * sysplex CDS, when partitioning                               *
    * tries to issue MSGIXC371D and                                *
    * WTOR processing is unable to read                            *
    * the required data from the                                   *
    * sysplex CDS.                                                 *
    *                                                              *
    * SYSPLEXDS                                                    *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Install the applicable PTF on each system                    *
    * in the sysplex.  A rolling IPL is                            *
    * sufficient to activate the fix.                              *
    ****************************************************************
    Deadlock can occur in a scenario like the following:
    
    o A system issues a VARY XCF command to remove an unresponsive
      system from the sysplex.  XCF initiates WTOR IXC371D to
      confirm the request.
    
    o Consoles attempts to read its record in the sysplex CDS as
      part of WTOR processing.  The read of the primary sysplex CDS
      encounters I/O delays.
    
    o With the Consoles read request in progress but delayed, XCF
      Serialization recognizes a device error and begins processing
      to remove the primary sysplex CDS.
    
    o The outgoing unresponsive system does not participate in the
      CDS removal process.
    
    At this point, we have a three-way deadlock between
    Serialization, Partitioning, and Consoles:
    
    o Removal of the primary sysplex CDS cannot progress because a
      system has failed to send the required participation signals.
    
    o The VARY command to remove the unresponsive system is hung
      trying to issue MSGIXC371D because the Consoles task
      responsible for obtaining a reply ID is suspended waiting for
      the CDS read to complete.
    
    O The CDS read cannot be processed until removal of the primary
      sysplex CDS completes (at which time the read would be able to
      access the surviving alternate CDS).
    

Problem conclusion

  • During removal of the sysplex CDS type, fail in-progress read
    requests against the applicable sysplex CDS record with a new
    return / reason code combination indicating that the read has
    been bypassed to avoid deadlock.
    
    o After reporting failure, XCF will allow the in-flight request
      to complete asynchronously when CDS removal completes so that
      it can clean up resources associated with the request.
    
    o On receipt of the failing return / reason code, Consoles will
      exploit existing processing that allows it to process the WTOR
      without information from the CDS.  It may reuse a previous
      reply ID or issue the WTOR with reply ID 0 in this scenario.
    
    There are no publication updates required for this APAR.
    However, MSGIEA402A has an existing text fillin of the form
    
    XCF RETURN CODE xxxxxxxx, REASON CODE yyyyyyyy
    
    This APAR introduces new return / reason code combination
    0Cx / 20x (RC0C RSN20) indicating that a couple data set
    access request has been bypassed to avoid a potential
    deadlock.
    

Temporary fix

  • *********
    * HIPER *
    *********
    

Comments

APAR Information

  • APAR number

    OA53790

  • Reported component name

    GRS

  • Reported component ID

    5752SCSDS

  • Reported release

    7A0

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-08-31

  • Closed date

    2021-11-09

  • Last modified date

    2021-12-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UJ07064 UJ07065

Modules/Macros

  • IXCL1RS  IXCL1SWT IXCL1QUE IXCYCON  IXCL1TSK IXCS2TSK IXCL1PCD
    IXCF1TF3 IXCL1ERE IXCL1RED IXCF1TX2 IXCF1SCF IXCL1PCX IXCL1UNL
    IXCE1TNM
    

Fix information

  • Fixed component name

    XCF

  • Fixed component ID

    5752SCXCF

Applicable component levels

  • R7C0 PSY UJ07064

       UP21/11/24 P F111 ¢

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"7A0"}]

Document Information

Modified date:
06 December 2021