Outage prevention

Learn about the solutions that are available to prevent planned and unplanned outages.

Comparison of solutions

Table 1. Comparison of solutions for the different deployment modes
Capability Full cloud stack Hybrid
High Availability Disaster Recovery (HADR) No Yes
Setting up high availability disaster recovery in a hybrid deployment
Geo-redundancy Yes (as a technology preview in a full cloud deployment)
Installing a geo-redundant deployment on Red Hat OpenShift Container Platform
No

Geo-redundancy

In a geo-redundant deployment, the primary deployment is located on one Red Hat® OpenShift Container Platform cluster and the secondary deployment is located on a different cluster. The individual Cassandra data centers in each deployment are replicated to synchronize the event and topology data across the clusters. Geo-redundancy is available only as a technology preview in a full cloud deployment. For more information, see Installing a geo-redundant deployment on Red Hat OpenShift Container Platform.

High availability disaster recovery

A high availability disaster recovery (HADR) hybrid deployment is composed of cloud native components on Red Hat OpenShift Container Platform along with an on-premises installation that has multiple IBM Netcool/OMNIbus WebGUI instances.

The on-premises WebGUI or DASH servers can be set up with load balancing by using an HTTP Server that balances the on-premises UI load. If the primary WebGUI fails, then the user is routed to the backup WebGUI seamlessly.

For disaster recovery, automatic and manual failover and failback between Netcool Operations Insight deployments is supported. If the primary ObjectServer fails, then the secondary ObjectServer takes over. In a HADR hybrid deployment, only cloud native analytics policies get pushed to the backup cluster through the backup or restore pods. No event and topology data is synchronized across the Cassandra instances, as the Cassandra instances do not communicate with each other.

HADR features include:
  • Supporting continuous grouping of events between two hybrid deployments.
  • Allowing more than one WebGUI instance to connect to the same hybrid deployment.
  • Supporting automatic and manual failover and failback between deployments.
  • Backup and restore of cloud native analytics policies.
A general overview of the HADR architecture is presented in Figure 1. For more information, see Setting up high availability disaster recovery in a hybrid deployment.
Figure 1. HADR architecture on a Netcool Operations Insight hybrid deployment
HADR architecture diagram

On-premises WebGUI access is through the HTTP load balancer. The HTTP load balancer enables high availability by distributing the workload among the WebGUI instances.

DASH is set up to use single sign-on (SSO) with the ObjectServer as the repository to store the OAuth tokens. Public-private key pairs on each DASH instance confirm the validity of the LTPA tokens.

ObjectServer traffic flows between the on-premises aggregation ObjectServer and the WebGUI instances. The traffic includes UI configuration metadata, authentication, and event data.

The console integration with the on-premises HTTP load balancer is updated by the active deployment. At any one time, either the primary or the backup cloud deployment updates the console integration.

Certificate authority (CA) signed certificates allow communication between the WebGUI instances. These CA signed certificates are loaded in to the HTTP load balancer. CA signed certificates are also added to the user-certificates configmap. The common UI services load the CA signed certificates from the configmap for the cluster connection to the HTTP load balancer.

The HAproxy directs users to the currently active deployment. The cloud NOI UI components query the HAproxy to determine the OAuth token for the associated Web GUI instance.

The coordinator service in the backup deployment tries to connect to the coordinator service in the primary deployment through the HAproxy, to determine the state of the primary deployment. If the primary coordinator service is not reachable, the backup coordinator service does the failover.