Investigating why failover occurred

Because failover can be initiated by either the primary or the backup domain, it is important to identify which domain initiated failover.

Perform either of the following actions:
  • Review the Virtual Domain log file ($NCHOME/log/precision/ncp_virtualdomain.DOMAIN.log) and the Event Gateway log file ($NCHOME/log/precision/ncp_g_event.DOMAIN.log).
  • Review the ItnmHealthChk and ItnmFailover events in the Active Event List. (This is the simpler approach.)

If the primary domain initiated failover, this indicates a failure of one of the primary domain processes. You can check the status of the processes by querying the database of the ncp_ctrl process. The serviceState field in the services.inTray database table shows the current operational state for each of the processes.

If the backup domain initiated failover, this indicates a failure to route health check events through the system due to one of the following reasons:
  • The primary domain did not raise a health check event (for example, because the primary server was down).
  • The Probe for Tivoli Netcool/OMNIbus or Event Gateway processes in both domains are not configured to access the same ObjectServer.
  • The Event Gateway Failover plug-in is not enabled.
  • The Probe for Tivoli Netcool/OMNIbus rules file has been modified such that the health check event does not contain the required information.
  • The backup Event Gateway is not letting health check events through the nco2ncp filter.

Also ensure that Virtual Domain is configured (in the $NCHOME/etc/precision/CtrlServices.cfg file) to have a dependency on all processes listed in the $NCHOME/etc/precision/VirtualDomainSchema.cfg file.