Tracking failover of the Network Manager core processes

You can perform a number of actions and checks to verify whether failover of the Network Manager core processes is operating as expected.

Tracking failover on startup

To ensure that the primary domain starts running as the active domain, start the primary domain and its Virtual Domain process before starting the backup domain. If the backup domain is started before the primary Virtual Domain process has started, the backup domain can become active, start polling the network, and raise health check problem events about the primary domain. This issue, however, resolves itself after the primary Virtual Domain starts and health check events are transmitted between the domains.

At startup, the topology and policies are copied from the primary domain to the backup domain. The backup domain, however, cannot become active (on failover) until it has initialized its topology. To verify that the topology has been initialized:
  • Check for a non-zero size topology cache file (Store.Cache.ncimCache.entityData.domain) in the $NCHOME/var/precision directory in the backup domain, where domain is the name of the current domain.
Event generation for startup: Monitor the Event Viewer for ItnmServiceState and ItnmFailoverConnection Network Manager events, to verify that the Virtual Domain processes are running, and that the TCP socket connection has been established:
  • After each local ncp_virtualdomain process starts, the ncp_ctrl process generates an ItnmServiceState resolution event.
  • When a TCP connection is established between the Virtual Domain processes, an ItnmFailoverConnection resolution event is generated.

Tracking failover when the system is in a steady state

Normal, steady-state failover behavior can be achieved only after the Virtual Domain processes in the primary and backup domains have started and connected. Steady-state behavior can be defined as follows:
  • The primary domain is active, and operating as if it is the sole domain. The discovery process discovers the network, which is monitored by the poller, and events are enriched by the Event Gateway.
  • The backup domain is in standby mode. Discovery is not initiated, and the poller keeps track of the policies configured in the primary domain, but does not poll any devices. The Event Gateway also does not update events in the ObjectServer.
You can run OQL queries on each domain to check on the status of processes:
  • You can check the status of individual Network Manager processes by querying the database of the ncp_ctrl process. All processes that are running without issue should have the setting serviceState = 4 in the services.inTray database table, to indicate that the service is alive and running.
  • The ncp_poller and ncp_g_event processes each have an associated config.failover database table, which identifies their current failover state. When running successfully in a steady state, these processes have the setting FailedOver = 0 in the config.failover OQL table in both domains. (The Virtual Domain process periodically updates the FailedOver field.)
Event generation while in a steady state: Each domain generates events about its state, based on the filters in the $NCHOME/etc/precision/VirtualDomainSchema.cfg file. These events are generated at an interval configured in the m_HealthCheckInterval field. Monitor the Event Viewer for ItnmHealthChk and ItnmDatabaseConnection Network Manager events to check whether the primary and backup domains are in good health:
  • Each domain generates ItnmHealthChk resolution events while it is healthy.
  • The primary domain generates an ItnmDatabaseConnection problem event if connection to the primary NCIM database is lost. If the connection is not re-established within the time interval defined for the NCIM state.filters entry in the VirtualDomainSchema.cfg file, the primary domain generates an ItnmHealthChk problem event, about the primary domain.
  • If the backup domain does not receive an ItnmHealthChk resolution event from the primary domain within the configured m_FailoverTime interval, the backup domain generates a synthetic ItnmHealthChk problem event on behalf of the primary domain.

    If either the primary or backup domain generates an ItnmHealthChk problem event for the primary domain, failover is triggered, and the backup domain becomes active. If the primary domain is still running, it goes into standby mode.

    Tip: For health check events, the Node field identifies the domain for which the health check event is generated. The Summary field identifies the domain raising the event and the domain the event is about.

Tracking failover and failback

When failover occurs, the backup domain becomes active, the backup poller monitors the network, and the Event Gateway updates ObjectServer events. You can run OQL queries to check on the status of the ncp_poller and ncp_g_event processes. These processes each have an associated config.failover database table, which identifies their current failover state. When the backup domain is active, these processes have the setting FailedOver = 1 in the config.failover table, to indicate that they are in a failover state. (If the primary domain is still running, the associated processes are also assigned the value of FailedOver = 1.)

When failback occurs, the backup domain goes into standby, and the primary domain becomes active again. This is analogous to startup.

Event generation on failover and failback: Monitor the Event Viewer for ItnmHealthChk and ItnmFailover Network Manager events, to confirm failover and failback behavior:
  • An ItnmHealthChk problem event about the primary domain indicates that failover has been triggered. A subsequent ItnmHealthChk resolution event about the primary domain indicates that failback has been triggered.
  • ItnmFailover events are generated to indicate when a Network Manager domain fails over or fails back. The event description states whether the domain is the primary or backup, and whether it has become active or gone into standby mode.