ISG177E and ISG178E recovery
If GRS seems hung after an ISG177E or ISG178E disruption message, use the process that follows to recover the sysplex.
- Issue
D GRS
andD XCF,S,ALL
on all the systems to obtain the status of each system. - GRS auto-restart processing might be in progress if any system has a GRS status of ACTIVE. Give GRS enough time, usually 4 to 6 minutes, to restart without manual intervention.
- If all the systems in the GRS display show either INACTIVE or
QUIESCE, issue
XCF PATHIN
andPATHOUT
commands to ensure the entire sysplex has good connectivity. If XCF on any system is unable to deliver signals, one of the systems might not have proper status keeping GRS from restarting. Recover paths as necessary to ensure that all systems have good connectivity. - If the sysplex has good connectivity, yet all the systems in the
GRS display still show either INACTIVE or QUIESCE, you can restart
the ring by manually driving the GRS group notification exits. To
do this, temporarily stop a system using either the hardware console
or
QUIESCE
command.Attention: Do not IPL. - Stop the system for one GRS TOLINT interval. Restart the system
after the GRS TOLINT interval has expired. Restarting the system should
re-drive the GRS group notification exits. If stopping and restarting
a system does not restart GRS, then GRS on that system might not be
the problem. Pick a different system and try stopping and restarting
that system. Note:
- If you are running with an SFM policy that will take a stopped
system out of the sysplex, stop the policy before stopping the system
by using:
SETXCF STOP,POLICY,TYPE=SFM
- If you are able to restart the ring, start the SFM policy using:
whereSETXCF START,POLICY,TYPE=SFM,POLNAME=XXXXX
XXXXX
is the SFM policy.
- If you are running with an SFM policy that will take a stopped
system out of the sysplex, stop the policy before stopping the system
by using:
If you have completed the steps above on all the systems and the D
GRS
output still displays INACTIVE, you can restart the sysplex
using the process that follows.
- Use the hardware console or the
QUIESCE
command to temporarily stop the systems until only one is remaining.Attention: Do not IPL. - Use
D XCF,S,ALL
to check systems status. The XCF display output should show only one system as ACTIVE and the other systems as MONITOR-DETECTED STOP. - When only one system is ACTIVE, wait a TOLINT interval until the remaining system restarts as a one system ring.
- Issuing a
D GRS
after waiting a TOLINT interval will show one system as ACTIVE and the other systems as QUIESCED. - When the
D GRS
command displays an ACTIVE system, start the other systems to have it join the ring.
If GRS is not able to restart, obtain the following data before calling IBM® service:
- SYSLOG from all systems.
- LOGREC from all systems.
- SADUMP from any system that requires an IPL
Use the following JCL to obtain a dump of primary and alternate
CDS:
DUMP COMM=(your dump title)
R x,ASID=(1,6,7,A),REMOTE=(SYSLIST=*(1,6,7,A),DSPNAME,SDATA),CONT
R y,DSPNAME=('XCFAS'.*,'GRS'.*),CONT
R z,SDATA=(COUPLE,XESDATA,GRSQ,RGN,ALLNUC,CSA,PSA,SQA,SUM,TRT),END
where
x, y, and z are reply numbers.Note: If a SADUMP of a critical system is not possible, take a console
dump.