Recovery
Recovery actions occur if RECOVERY was specified on the SYSPLEXMONITOR parameter of the GLOBALCONFIG statement. If the default setting, SYSPLEXMONITOR NORECOVERY, is active, other than issuing the message, no further actions occur if the problem is not corrected.
The VARY TCPIP,,SYSPLEX,LEAVEGROUP command can be used to manually force the sysplex member to leave the TCP/IP sysplex group. As a stack leaves the TCP/IP sysplex group, message EZZ9670E is cleared, as well as message EZD1170E. All other outstanding eventual action messages are cleared when the condition is cleared (for example, starting VTAM®). For information on the VARY TCPIP,,SYSPLEX command, see z/OS Communications Server: IP System Administrator's Commands.
If RECOVERY was specified on the SYSPLEXMONITOR parameter of the GLOBALCONFIG statement in the TCP/IP profile, and this stack is not the only member of the TCP/IP sysplex group, the stack leaves the TCP/IP sysplex group when one of the messages is issued. The one exception to this is EZZ9672E, which is issued only as an OMPROUTE warning message. No actions occur unless the corresponding EZZ9678E OMPROUTE message is subsequently issued.
To determine whether the stack is currently joined to a TCP/IP sysplex group, issue the DISPLAY TCPIP,,SYSPLEX,GROUP command. If the stack is not currently joined to a TCP/IP sysplex group, this command displays the following message:
EZZ8269I tcpstackname mvsname NOT A MEMBER OF A SYSPLEX GROUP
If the stack is currently joined, the name of the TCP/IP XCF group is displayed.
From any member of the sysplex, use the D XCF,GROUP,groupname command to see the systems currently in the sysplex group, where groupname is EZBTCPCS, or if subplexing is being used, EZBTvvtt, where vv is the VTAM XCF group ID suffix and tt is the TCP group ID suffix..
If the RECOVERY option is specified and a TCP/IP stack initiates an automated recovery action by leaving the TCP/IP sysplex group, all local DVIPAs are deactivated and all the VIPADYNAMIC block definitions are saved. Any applications bound to dynamically created DVIPAs (VIPARANGE or MODDVIPA) will receive an asynchronous error, EUNATCH (3448) - the protocol required to support the address family is unavailable.
If internal problems prevent the removal of these resources, eventual action message EZZ9675E is issued, and restarting the stack is necessary to be able to become part of the TCP/IP sysplex group. If all DVIPAs are successfully removed, eventual action message EZZ9676E is issued, indicating that sysplex problem detection cleanup has succeeded. There are two ways for the stack to rejoin the sysplex group and clear this message after a successful cleanup has occurred:
- If AUTOREJOIN was configured on the SYSPLEXMONITOR parameter of the GLOBALCONFIG statement, the stack automatically rejoins the group and reprocesses its saved VIPADYNAMIC configuration when all detected problems have been relieved. The AUTOREJOIN option is the recommended setting when the RECOVERY option is configured.
- Issue the VARY TCPIP,,SYSPLEX,JOINGROUP command to cause the stack to rejoin the group, and reprocess its VIPADYNAMIC configuration.
Recovery is the preferred method of operation, because this allows other members of the TCP/IP sysplex to automatically take over the functions of a member with no actions needed by an operator. IBM® Health Checker for z/OS® can be used to check whether the RECOVERY parameter has been specified when the IPCONFIG DYNAMICXCF or the IPCONFIG6 DYNAMICXCF statement has been specified. For more details about IBM Health Checker for z/OS, see z/OS Communications Server: IP Diagnosis Guide.
There are, however, some environments and scenarios where this automated recovery action might not be desirable and perhaps should not be enabled:
- DVIPAs or Distributed DVIPAs are defined, but no backup TCP/IP
stacks are identified or no provisions are made to move the DVIPAs
in cases of failure.
The basic premise of the automated recovery actions is that one or more other TCP/IP stacks in the system can pick up ownership responsibilities for any DVIPAs owned by the failing TCP/IP stack. If this is not the case, it is suggested that you carefully evaluate the benefits of designating backup TCP/IP stacks and implement a configuration that includes backup capabilities. If this is not possible or desirable, RECOVERY should not be specified. If RECOVERY is not specified, automated recovery actions are disabled by default.
For example, one such configuration is if you are using only DVIPAs that are always bound to a specific TCP/IP stack (that is, in lieu of static VIPAs). In this scenario, because there is no possibility of having ownership of these DVIPAs transferred automatically, there is no value in triggering the automated recovery action and you should consider not enabling the automated recovery function or using static VIPAs (because static VIPAs are not affected by the automated recovery actions).
- Test environments where individual system images have very limited
resources (CPU, storage, and so on).
This can include environments where you are running z/OS as a second level guest under z/VM®, or in LPARs with shared processors and very limited resources. Not enabling the automated recovery actions in these environments can help prevent unwanted recovery actions that are triggered by false positive conditions, such as scenarios where artificial severe resource shortages are detected.
- Environments where VTAM or OMPROUTE are stopped for intervals longer than the TIMERSECS
value specified on the SYSPLEXMONITOR parameter.
If your current operations procedures have provisions for stopping VTAM or OMPROUTE for extended periods of time, you should consider disabling and re-enabling the automated recovery processing around the periods of time where you stop and restart these components. This can be accomplished using the VARY TCPIP,,OBEYFILE command.
An alternative solution could be to increase the TIMERSECS value to accommodate the longest period of time you would expect VTAM or OMPROUTE to be inactive during normal operating procedures. One potential drawback of this approach is that the monitoring of other conditions and triggering of automatic recovery functions is less responsive.