Tracking failover of the Network Manager core processes
You can perform a number of actions and checks to verify whether failover of the Network Manager core processes is operating as expected.
Tracking failover on startup
To ensure that the primary domain starts running as the active domain, start the primary domain and its Virtual Domain process before starting the backup domain. If the backup domain is started before the primary Virtual Domain process has started, the backup domain can become active, start polling the network, and raise health check problem events about the primary domain. This issue, however, resolves itself after the primary Virtual Domain starts and health check events are transmitted between the domains.
- Check for a non-zero size topology cache file (Store.Cache.ncimCache.entityData.domain) in the $NCHOME/var/precision directory in the backup domain, where domain is the name of the current domain.
- After each local ncp_virtualdomain process starts, the ncp_ctrl process generates an ItnmServiceState resolution event.
- When a TCP connection is established between the Virtual Domain processes, an ItnmFailoverConnection resolution event is generated.
Tracking failover when the system is in a steady state
- The primary domain is active, and operating as if it is the sole domain. The discovery process discovers the network, which is monitored by the poller, and events are enriched by the Event Gateway.
- The backup domain is in standby mode. Discovery is not initiated, and the poller keeps track of the policies configured in the primary domain, but does not poll any devices. The Event Gateway also does not update events in the ObjectServer.
- You can check the status of individual Network Manager processes
by querying the database of the ncp_ctrl process.
All processes that are running without issue should have the setting
serviceState = 4in the services.inTray database table, to indicate that the service isalive and running
. - The ncp_poller and ncp_g_event processes
each have an associated config.failover database table, which identifies
their current failover state. When running successfully in a steady
state, these processes have the setting
FailedOver = 0in the config.failover OQL table in both domains. (The Virtual Domain process periodically updates the FailedOver field.)
- Each domain generates ItnmHealthChk resolution events while it is healthy.
- The primary domain generates an ItnmDatabaseConnection problem event if connection to the primary NCIM database is lost. If the connection is not re-established within the time interval defined for the NCIM state.filters entry in the VirtualDomainSchema.cfg file, the primary domain generates an ItnmHealthChk problem event, about the primary domain.
- If the backup domain does not receive an ItnmHealthChk resolution event from the primary domain
within the configured m_FailoverTime interval, the backup domain generates a synthetic ItnmHealthChk
problem event on behalf of the primary domain.
If either the primary or backup domain generates an ItnmHealthChk problem event for the primary domain, failover is triggered, and the backup domain becomes active. If the primary domain is still running, it goes into standby mode.
Tip: For health check events, the Node field identifies the domain for which the health check event is generated. The Summary field identifies the domain raising the event and the domain the event is about.
Tracking failover and failback
When failover
occurs, the backup domain becomes active, the backup poller monitors
the network, and the Event Gateway updates ObjectServer events. You
can run OQL queries to check on the status of the ncp_poller and ncp_g_event processes.
These processes each have an associated config.failover database table,
which identifies their current failover state. When the backup domain
is active, these processes have the setting FailedOver = 1 in
the config.failover table, to indicate that they are in a failover
state. (If the primary domain is still running, the associated processes
are also assigned the value of FailedOver = 1.)
When failback occurs, the backup domain goes into standby, and the primary domain becomes active again. This is analogous to startup.
- An ItnmHealthChk problem event about the primary domain indicates that failover has been triggered. A subsequent ItnmHealthChk resolution event about the primary domain indicates that failback has been triggered.
- ItnmFailover events are generated to indicate when a Network Manager domain fails over or fails back. The event description states whether the domain is the primary or backup, and whether it has become active or gone into standby mode.