IBM Support

QRadar: About high-availability (HA) fail over conditions

Question & Answer


Question

What are the sequence of events that can lead to a High-Availability (HA) fail over?

Cause

QRadar high-availability deployments guide in the IBM Knowledge Center discusses HA failover. When an HA pair has one of the peers active and the other in the standby state, it is ready to failover. If the Active host becomes offline, unavailable on the network, or if mount points are read-only a failover can occur.

What happens during a fail over from Active to Standby:
  • Services on the active appliance to stop and the standby appliance begins to start services.
  • Event collection and search functionalities, are briefly interrupted.
  • If the HA host going through a fail over is the console, the User Interface (UI) is also briefly unavailable.
  • Any open SSH connections using the VIP are disconnected.

 

Answer

On a healthy HA host, one of the HA peers is active and the other in a standby state. At this state, the data is synchronized across the peers, regular Network Connectivity and Heartbeat Ping tests are occurring. Such an HA host is ready to failover. The IBM Knowledge Center lists a number of circumstances that can trigger a failover. When any of these events are detected by the HA manager of the either peers, the host that in the Standby state starts "going active". The end state of the active host depends on the exact trigger and is one of the three possible states:
 
  • Offline: The offline status is set by an admin manually when maintenance or a manual failover is required. In the offline state, the HA manager is not providing failover to another appliance. On an offline host, services are not running. For the system to be ready for a failover, the admin must set the HA appliance in the online state.
  • Failed: The failed status is set when one of the HA manager services determined that the system is accessible but not capable to provide services. Usually encountered during soft failures of the host or some forms of network issues such as failed or delayed pings, failed connectivity tests.
  • Unknown: The unknown status is set when the active host is not able to determine the state of the HA host. This status can be seen during hardware problems, such as hard failures of the host, Kernel crashes, file system corruption, or network failures.

If possible, before the Standby host goes Active, the Active host attempts to gracefully shut down. This can be either transitioning into an Offline state or a Failed state. When transitioning into these states, the following actions occur:
  1. Shutting down the Application: The hostcontext service and its associated sub components, as well as the tomcat service (if exists on the host) and the hostservices service shutdown in that order. While this is happening, operations such as log collection, searches, data accumulation, and so on are interrupted. The UI also becomes unavailable if this is occurring on the console. Services are not available until the peer has gone Active.
  2. Shared filesystems are released: At this stage, shared filesystems such the /store partition are released on the peer going offline or failed. See Technote 1993804: QRadar: High Availability (HA) Peer data replication for further information on how data is replicated.
  3. Virtual IP is released: HA Clusters use a Virtual IP address (VIP) that is used by the Active host. At this stage, the ownership of the virtual IP address is released. Any connections that are still open that use the VIP, such as SSH connections, will be dropped as a result. The VIP is not available on the network until the peer claims it.

At this stage, the standby host is ready to go Active. Please note that the standby host can also start going Active if the peer host becomes unavailable. If the peer host is unavailable, the shared resources are assumed to be no longer in use by the peer host. When going active, the HA failover event sequence occurs:
  1. Shared filesystems are mounted: The /store partition becomes available.
  2. Virtual IP is taken over: At this stage, any connections made to the VIP reaches the newly active host.
  3. Application is started: Hostservices, tomcat (if available), and hostcontext services are started. When tomcat is fully started, the UI also becomes available again. Configuration communications start occurring and event collection, processing, searching start functioning.

[{"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"High Availability","Platform":[{"code":"PF016","label":"Linux"}],"Version":"7.2","Edition":"","Line of Business":{"code":"LOB24","label":"Security Software"}}]

Document Information

Modified date:
18 December 2020

UID

swg21994665