Network Manager failover and failback

Failover can be initiated by either the primary or backup domain, and is triggered when a health check problem event is generated for the primary domain. Failback is triggered by a subsequent health check resolution event for the primary domain.

An ItnmFailover event is generated by ncp_virtualdomain when a Network Manager domain fails over or fails back.

Failing over

When failover occurs, the primary Network Manager domain goes into standby mode (if it is still running), and the backup domain becomes active.

The following changes occur when the backup domain becomes active:

  • The Event Gateway synchronizes the events with the ObjectServer.
  • The ncp_poller process resumes polling.
  • The Event Gateway switches from the standby filter (StandbyEventFilter) to the incoming event filter (EventFilter).
  • Network Manager continues to monitor the network and perform RCA. However, network discovery is not performed, and the network topology remains static.

When a primary Network Manager server goes into standby mode, the following changes occur:

  • The Event Gateway switches from the incoming event filter (EventFilter) to the standby filter (StandbyEventFilter).
  • The ncp_poller process suspends all polls.
The Apache Storm realtime computation system, which is used to aggregate raw poll data into historical poll data, fails over in a different way to the other Network Manager core processes. In a failover configuration each Network Manager has a fully functioning Apache Storm server which performs aggregation of poll data, but only one of these Apache Storm servers is active at any one time. When an Apache Storm server first starts it checks the database to see if the poll data is already being processed by another Apache Storm server.
  • If it is then the first Apache Storm server goes into a standby state.
  • if not, then the Storm assumes the responsibility of processing the poll data.
When failover occurs, the current Apache Storm server does not stop processing. The Apache Storm server fails over only if it is stopped by the user or is unable to update the database.

Failing back

When a primary Network Manager server in standby mode resumes normal operation, it generates a health check resolution event.

The health check resolution event passes through the system, and the recovered Network Manager server becomes active again.

When the Virtual Domain process on the backup Network Manager server receives the health check resolution event, Virtual Domain switches the backup server back to standby mode.

The GenericClear automation in the ObjectServer is triggered by the health check resolution event, and clears the existing health check problem event.