Installing IBM Cloud Pak for AIOps on a multi-zone architecture (multi-zone HA)

  The use of the multi-zone high availability (HA) deployment architecture for IBM Cloud Pak for AIOps is available only for technology preview, and is not for production usage.

Review the following sections:

Overview

IBM Cloud Pak® for AIOps is starting to implement multi-zone availability, as provided by the Red Hat® OpenShift® Container Platform. This feature is not available for deployments of IBM Cloud Pak for AIOps on Linux®.

Installing IBM Cloud Pak for AIOps on a Red Hat OpenShift cluster that spans multiple availability zones improves resiliency. The distribution of pod replicas in different zones means that if a zone becomes unavailable, then remaining replicas in other zones are still able to service requests. Red Hat OpenShift is also able to reschedule failed nodes' workloads from a zone that is subject to an outage to nodes in an alternative healthy zone.

For more information about the benefits of multi-zone architectures, see Running in multiple zones Opens in a new tab in the Kubernetes documentation.

Red Hat OpenShift tolerates nodes (and by extension zones) being down for five minutes, in addition to a node monitoring interval. For that reason, and accounting for the additional image pull and pod startup times, node and zone failures take approximately six minutes to start recovering. For more information, see Using remote worker node at the network edge: Kubernetes zones in the Red Hat OpenShift documentation.

Note: Multi-zone availability is only supported for production deployments of IBM Cloud Pak for AIOps. A starter deployment of IBM Cloud Pak for AIOps can be deployed on a multi-zone architecture, but it will not be highly available.

Installing on a multi-zone architecture

Use the following steps to install IBM Cloud Pak for AIOps on a multi-zone architecture.

  1. Prepare your cluster

    Before you install IBM Cloud Pak for AIOps, you must ensure that your Red Hat OpenShift cluster is configured for multi-zone by labeling the cluster nodes with a zone. Cluster administrator permission is required to add labels to your nodes.

    For multi-zone availability to be effective, it is important to plan the physical locations of the zones so that impact to more than one physical location is less likely to impact more then one zone. For example, zones should be set up for hardware that is physically located in different buildings or in server racks with independent power supplies.

    If you are installing IBM Cloud Pak for AIOps on a cloud platform, then refer to your cloud provider's documentation to learn how to select a multi-zone type when you are setting up your cluster, and then skip to step 2, Configure storage.

    If you are installing directly on Red Hat OpenShift on-premises, then label your Red Hat OpenShift cluster nodes as follows.

    1. Label cluster nodes with zones.

      Distribute your cluster nodes across three availability zones, ensuring that you have a cluster master node in each availability zone. Allocate each node to an availability zone by labeling the node with the zone:

      oc label node <node-name> topology.kubernetes.io/zone=<zone-name>
      

      Where

      • <node-name> is the fully qualified domain name of your node, for example worker0.mycluster.subdomain.example.com.
      • <zone-name> is the name of the zone that you want the node to be in.
    2. Label cluster nodes with regions.

      IBM Cloud Pak for AIOps has a dependency that requires the Kubernetes region label to be set on each node. The region label can be set to the same value on all nodes.

      oc label node <node-name> topology.kubernetes.io/region=<region-name>
      

      Where

      • <node-name> is the fully qualified domain name of your node, for example worker0.mycluster.subdomain.example.com.
      • <region-name> is the region name.

    When the topology.kubernetes.io/zone label is specified for the nodes in a cluster, Kubernetes tries to schedule a ReplicaSet's pods on different worker nodes in different zones. For example, a deployment of three replicas should result in one replica being deployed in each zone of a three zone cluster.

    Note: average latency between nodes across zones should not exceed single digit milliseconds. Ideally latency should be low single digit milliseconds.

  2. Configure storage

    Refer to your storage provider's documentation for instructions on configuring storage classes for a multi-zone configuration that enables data to remain available if a cluster node becomes unavailable.

  1. Install IBM Cloud Pak for AIOps

    Starter deployments are not configured to support multi-zone HA. Use the following instructions to install a production installation of IBM Cloud Pak for AIOps.

    Installing IBM Cloud Pak for AIOps (Production installation)

    IBM Cloud Pak for AIOps uses a preferred scheduling approach instead of a hard cluster topology requirement. This approach provides greater administrative flexibility and helps IBM Cloud Pak for AIOps to remain functional despite any cluster topology issues. Red Hat OpenShift distributes pods on the nodes according to node availability and other considerations such as available resources, node taints, or whether an image already exists on a node. These considerations might sometimes conflict.

    The installation procedure has an optional step to run a status checker scriptOpens in a new tab. This script reports whether pods are optimally distributed for resilience within your multi-zone cluster. Manual intervention might sometimes be necessary to move pods to the correct zone. The deletion of one or more incorrectly placed pods is usually sufficient, but it may be necessary to cordon nodes in a zone to help direct Red Hat OpenShift to distribute the pods to the proper zone. For more information about rebalancing, see Rebalancing workloads on a multi-zone architecture (multi-zone HA).

    Note: The Secure Tunnel function is enhanced to better handle zone outages. If you use Secure Tunnel, when you configure connector and tunnel worker replicas, ensure that the replica count is equal to or greater than the number of zones in your environment. The number of replicas is preferably a multiple of the number of zones. For example, if your environment has three zones then you might configure two replicas for each zone, to give a total of six replicas. For more information about configuring connector and tunnel replicas, see Creating Secure Tunnel connections.

Some resources might not failover as expected, and might require intervention. Contact IBM Support if you encounter any problems when you are using this technology preview.

Limitations while workloads are rescheduled

If the following functions are affected by a zone outage, then they are compromised until Red Hat OpenShift reschedules the affected workloads to nodes in an unaffected zone. This takes approximately six minutes.

  • Integrations do not collect or forward data.
  • Alerts, Events, and Incidents from IBM® Netcool® Operations Insight® probes are paused.
  • IBM® Netcool® Agile Service Manager observers do not run.
  • AI model management UI is not accessible.
  • AI training may be paused and there is no new training of selected analytics.
  • You cannot view, edit, delete, create, or deploy integrations.
  • Insights Dashboard is not accessible.
  • Resource enrichment functionality, which matches resources to resources, updates status, and adds queriable tags is paused.
  • The matching of resources to events is paused.
  • Timed jobs such as alert expiration and closure do not run.
  • Netcool Operations Insight probe data archiving and events may be paused.
  • Secure tunnel controller pod - changes to the tunnel connections or application mappings fail. If one of the connection worker pods is not available for Secure Tunnel to use then performance might be impacted, depending on the current workload.
  • Runbook Automation: SSH actions and HTTP actions that are running might lose connection to their target if the outage affects the pod that is running them. In this scenario, no further output is received from the target. For SSH actions, the execution of the remote script on the target machine stops. Ansible actions are run by the Ansible Automation Platform. For HADR considerations of Ansible actions, refer to the Ansible Automation Platform documentation.
  • No new usage metrics