Investigating why failover occurred
Because failover can be initiated by either the primary or the backup domain, it is important to identify which domain initiated failover.
Perform either of the following actions:
- Review the Virtual Domain log file ($NCHOME/log/precision/ncp_virtualdomain.DOMAIN.log) and the Event Gateway log file ($NCHOME/log/precision/ncp_g_event.DOMAIN.log).
- Review the ItnmHealthChk and ItnmFailover events in the Active Event List. (This is the simpler approach.)
If the primary domain initiated failover, this indicates
a failure of one of the primary domain processes. You can check the
status of the processes by querying the database of the ncp_ctrl process.
The serviceState
field in the services.inTray database
table shows the current operational state for each of the processes.
If
the backup domain initiated failover, this indicates a failure to
route health check events through the system due to one of the following
reasons:
- The primary domain did not raise a health check event (for example, because the primary server was down).
- The Probe for Tivoli Netcool/OMNIbus or Event Gateway processes in both domains are not configured to access the same ObjectServer.
- The Event Gateway Failover plug-in is not enabled.
- The Probe for Tivoli Netcool/OMNIbus rules file has been modified such that the health check event does not contain the required information.
- The backup Event Gateway is not letting health check events through the nco2ncp filter.
Also ensure that Virtual Domain is configured (in the $NCHOME/etc/precision/CtrlServices.cfg file) to have a dependency on all processes listed in the $NCHOME/etc/precision/VirtualDomainSchema.cfg file.