Handling coupling facility connections that hang

If a connection to a coupling facility structure hangs, an operator should complete some actions to recover from the hanging connection.

About this task

When a member abnormally terminates, its connections to coupling facility structures are put into a FAILING state by cross-system extended services for z/OS® (XES). The member remains in this FAILING state until all surviving members of the group have responded to the XES Disconnected/Failed Connection (DiscFailConn) event for each structure. XES sends this event to each surviving member of the group so that the surviving members can take the necessary recovery actions in response to the failed member.

After all surviving members of the group perform the necessary recovery actions and provide DiscFailConn responses to XES for a given coupling facility structure, XES changes the failed member's connection status for that coupling facility structure from FAILING to FAILED PERSISTENT. The member can reconnect to the coupling facility structure during restart when the member's status is FAILED PERSISTENT.

When you restart the member immediately following a connection failure, the member can attempt to reconnect to a coupling facility structure while its connection is still in a FAILING state. If this occurs, XES denies the reconnect request with a 0C27 reason code. Db2 responds to this by entering a connection-retry loop until the connection succeeds or until it reaches the maximum retry count.

For the SCA, the maximum retry count is 200 times with a 3-second interval between each attempt. For the group buffer pools, the maximum retry count is 5 times with a 10-second interval between each attempt. You might notice a message similar to the following message, which indicates a failed connection attempt:
IXL013I IXLCONN REQUEST FOR STRUCTURE DB2GR0W_SCA FAILED.
JOBNAME: DB2VMSTR ASID: 05E1 CONNECTION NAME: DB2_DB2V
IXLCONN RETURN CODE: 0000000C,   REASON CODE: 02010C27

The preceding message might be displayed multiple times while Db2 is in a connection-retry loop. This is normal.

In rare cases, one or more of the surviving members of a group encounters difficulties in providing the DiscFailConn response to XES for a given coupling facility structure. When this happens, XES issues a message similar to the following message for each member from which it does not receive a response within two minutes:
IXL041I CONNECTOR NAME: DB2_DB2M, JOBNAME: DB2MMSTR, ASID: 0086
HAS NOT RESPONDED TO THE DISCONNECTED/FAILED CONNECTION EVENT FOR
SUBJECT CONNECTION: DB2_DB2V.
DISCONNECT/FAILURE PROCESSING FOR STRUCTUR DB2GR0W_SCA
CANNOT CONTINUE.
MONITORING FOR RESPONSE STARTED: 08/08/2002 23:50:23.
DIAG: 0000 0000 00000000
In extreme cases, the maximum number of connection retries might be reached. If encountered for the SCA, this situation prevents the failed member from restarting and Db2 issues a message similar to the following message:
DSN7506A  -DB2V DSN7LSTK
CONNECTION TO THE SCA STRUCTURE DB2GR0W_SCA FAILED.
 MVS IXLCONN RETURN CODE = 0000000C,
 MVS IXLCONN REASON CODE = 02010C27.

Procedure

To recover from coupling facility structure connections that hang:

  1. Save a dump of all Db2 and IRLM members along with SDATA= (COUPLE, XESDATA) so that IBM® Support can determine what is causing the hung connections. See message II10850 for more information.
  2. Attempt a REBUILD of the lock structure.
    This can sometimes clear the condition that is causing the DiscFailConn response to hang. If the REBUILD of the lock structure works, XES issues a message similar to the following message for each group member as it provides the required DiscFailConn response:
    IXL043I CONNECTION NAME: DB2_DB2M, JOBNAME: DB2MMSTR, ASID: 0086
                    HAS PROVIDED THE REQUIRED RESPONSE. THE REQUIRED RESPONSE
                    FOR THE DISCONNECTED/FAILED CONNECTION EVENT
                    FOR SUBJECT CONNECTION DB2_DB2V,
                    STRUCTURE DB2GR0W_SCA IS NO LONGER EXPECTED.
    If the REBUILD does not work, proceed to step 3.
  3. Issue the D XCF,STR,STRNM=<strname>,CONNM=<conname> command for the structure or connector that is in the FAILING state.
    Alternatively, issue the D XCF,STR,STRNM=<strname>,CONNM=ALL command. Both commands display the status of the structures and connectors that are used by XES.

    If this command identifies the unresponsive members, skip to Step 6. If it does not identify the unresponsive members, proceed to Step 4.

  4. Attempt a structure REBUILD for the affected structure, if you have not already done this.
  5. If the REBUILD hangs, issue the D XCF,STR,STRNM=<strname> command to identify the unresponsive connector.

    This identifies the members that are unresponsive to the REBUILD. These members are probably the same members that are unresponsive to the DiscFailConn event.

  6. Cancel and recycle the unresponsive members.
    The STOP D command might not work because internal Db2 processes are hung, so cancel IRLM or Db2 MSTR.

    As each member terminates, verify that XES issues message IXL043I to indicate that it no longer expects a DiscFailConn response from that member. When all members that owe responses have been stopped, all connections to the SCA should be ACTIVE or FAILED PERSISTENT.

  7. Issue the D XCF,STR,STRNM=<sca>,CONNM=ALL command to verify the status of the connections to SCA.
  8. Restart all members with FAILED PERSISTENT connections.

    As each member successfully reconnects to the SCA, XES issues message IXL014I. If a problem still exists, proceed to step 9.

  9. Stop and restart the systems on which the unresponsive members are running. If restarting the system does not fix the unresponsive members, proceed to step 10.
  10. Cancel and recycle all connectors to the coupling facility structure. If a problem still exists, proceed to step 11.
  11. Stop and restart all systems.