Sysplex sympathy sickness

If you find that more than one system in the sysplex is experiencing problems, the first thing to check for is sysplex sympathy sickness. Sysplex sympathy sickness refers to a state where one unhealthy, non-responsive system impacts the health of other systems in the sysplex. For example, sysplex sympathy sickness might occur as a result of a system hanging while holding global resources or not completing sysplex partitioning.

XCF detects and reports that a system is hanging when it has not updated the sysplex couple data set at regular intervals. The system issues messages IXC427A or IXC426D or IXC101I indicating ‘status update missing’ for a hanging system. The system may also issue write to operator (WTOR) messages IXC102A , IXC402D, or IXC409D prompting the operator to reset the sick system and reply ‘DOWN’ if the system cannot be successfully fenced. If the ‘DOWN’ reply is delayed, other systems usually experience sympathy sickness, with the symptoms listed in Symptoms.

Symptoms

Symptoms of sysplex sympathy sickness include:
  • Message ISG361A indicating GRS list lock contention
  • Message $HASP263 indicating JES checkpoint contention
  • Message ISG633I indicating that GRS is running impaired
  • Messages IOS071I and IOS431I indicating START PENDING status for devices because of reserves held by the sick system
  • Global ENQ resource contention
  • Multiple XCF group members or structure connectors detected hung with accompanying system messages such as IXC431I, IXC631I, IXC640E, IXL040E, IXL041E, and IXL045E.

How to investigate

Check for system message IXC101I followed by outstanding WTOR message IXC102A, IXC402D, or IXC409D prompting an operator to reset the sick or hung system. Reset the system and reply ‘DOWN’ to the WTOR so that the system can be partitioned - sysplex partitioning cleanup of resources by various functions and products does not occur until after the WTOR reply ‘DOWN’.

Best practices

  • Activate an SFM policy specifying ISOLATETIME(0) and CONNFAIL(YES) to automate sysplex partitioning.
  • Enable the system status detection (SSD) function SYSTATDETECT and configure the sysplex with the base control program internal interface (BCPii). The SSD partitioning protocol exploits BCPii interfaces using z/Series hardware services to determine the status of failed systems in a sysplex. If SSD determines that a system has failed, it is automatically removed from the sysplex. SYSSTATDECT and BPCii almost eliminate the need for operator intervention and ensures that failing systems are removed promptly from the sysplex. See the topic on using the SSD partitioning protocol and BCPii in z/OS MVS Setting Up a Sysplex.