Rebalancing workloads on a multi-zone architecture (IBM Cloud Pak for AIOps on OpenShift)

  The use of the multi-zone high availability (HA) deployment architecture for IBM Cloud Pak for AIOps is available only for technology preview, and is not for production usage.

Use these steps to manually redistribute IBM Cloud Pak® for AIOps workloads across a multi-zone Red Hat® OpenShift® Container Platform cluster.

Overview

For more information about deploying IBM Cloud Pak for AIOps on a multi-zone cluster, see Installing IBM Cloud Pak for AIOps on a multi-zone architecture (multi-zone HA).

The distribution of IBM Cloud Pak for AIOps workloads across the zones in your multi-zone cluster is important to help ensure high availability. You might need to rebalance your workloads manually to achieve an optimal distribution.

If a zone outage occurs in a multi-zone cluster, Red Hat OpenShift automatically reschedules IBM Cloud Pak for AIOps workloads on the affected zone to alternative available zones. This helps IBM Cloud Pak for AIOps to continue functioning during a zone outage, but it means that workloads are not optimally distributed when the affected zone is restored. Rebalancing is then needed to help ensure that all available zones have sufficient replicas to maintain high availability if another zone outage occurs. Red Hat OpenShift does not automatically rebalance workloads because it can be disruptive and cause further problems. Rebalancing must be done manually when convenient, for example during a scheduled maintenance window.

Procedure

Use the following steps to correct the distribution of workloads, for example after the restoration of a failed zone.

  1. Check whether all the recovered nodes are healthy.

    Run the following command to validate the readiness of all nodes:

    oc get nodes
    

    Example output:

    $ oc get nodes
     NAME                STATUS   ROLES                  AGE    VERSION
     master0.acme.com    Ready    control-plane,master   4d3h   v1.27.13+048520e
     master1.acme.com    Ready    control-plane,master   4d3h   v1.27.13+048520e
     master2.acme.com    Ready    control-plane,master   4d3h   v1.27.13+048520e
     worker0.acme.com    Ready    infra,worker           4d3h   v1.27.13+048520e
     worker1.acme.com    Ready    infra,worker           4d3h   v1.27.13+048520e
     worker10.acme.com   Ready    worker                 4d3h   v1.27.13+048520e
     worker11.acme.com   Ready    worker                 4d3h   v1.27.13+048520e
     worker12.acme.com   Ready    worker                 4d3h   v1.27.13+048520e
     worker13.acme.com   Ready    worker                 4d3h   v1.27.13+048520e
     worker14.acme.com   Ready    worker                 4d3h   v1.27.13+048520e
     worker2.acme.com    Ready    infra,worker           4d3h   v1.27.13+048520e
     worker3.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker4.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker5.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker6.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker7.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker8.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
     worker9.acme.com    Ready    worker                 4d3h   v1.27.13+048520e
    

    Do not proceed until all nodes have a STATUS of Ready.

  2. Assess whether workloads are distributed well.

    Download the status checker plug-in from github.com/IBMOpens in a new tab. Run the following command to assess the distribution of IBM Cloud Pak for AIOps workloads across your cluster zones:

    oc waiops multizone status
    

    No further action is needed if the script reports INFO: All pods are distributed appropriately for multizone.

    Example output where pods are not optimally distributed across zones:

    $ oc waiops multizone status
    
    Regions : CentralNC
    Zones   : ChapelHill Durham Raleigh
    
    NAME                ZONE         REGION
    worker11.acme.com   ChapelHill   CentralNC
    worker14.acme.com   ChapelHill   CentralNC
    worker2.acme.com    ChapelHill   CentralNC
    worker5.acme.com    ChapelHill   CentralNC
    worker8.acme.com    ChapelHill   CentralNC
    worker1.acme.com    Durham       CentralNC
    worker10.acme.com   Durham       CentralNC
    worker13.acme.com   Durham       CentralNC
    worker4.acme.com    Durham       CentralNC
    worker7.acme.com    Durham       CentralNC
    worker0.acme.com    Raleigh      CentralNC
    worker12.acme.com   Raleigh      CentralNC
    worker3.acme.com    Raleigh      CentralNC
    worker6.acme.com    Raleigh      CentralNC
    worker9.acme.com    Raleigh      CentralNC
    
    INFO: Checking the namespace aiops...................................done.
    
    WARNING: Not all pods are distributed optimally.
    
    NAMESPACE   NAME                                                      ZONE         NODE                KIND
    aiops       aimanager-aio-log-anomaly-detector-7554c44d69-4ztcb       Durham       worker7.acme.com    Deployment
    aiops       aimanager-aio-log-anomaly-detector-7554c44d69-tsq87       Durham       worker13.acme.com   Deployment
    aiops       aimanager-aio-log-anomaly-detector-7554c44d69-27w6h       Raleigh      worker0.acme.com    Deployment
    aiops       aimanager-aio-log-anomaly-detector-7554c44d69-47hf9       Raleigh      worker12.acme.com   Deployment
    aiops       aiops-base-ui-59d8b785d8-g7v58                            Durham       worker7.acme.com    Deployment
    aiops       aiops-base-ui-59d8b785d8-f2mwb                            Raleigh      worker6.acme.com    Deployment
    aiops       aiops-base-ui-59d8b785d8-hbv7s                            Raleigh      worker0.acme.com    Deployment
    aiops       aiops-ir-lifecycle-datarouting-66b7fdd5df-4ml7r           Raleigh      worker9.acme.com    Deployment
    aiops       aiops-ir-lifecycle-datarouting-66b7fdd5df-l56l8           Raleigh      worker12.acme.com   Deployment
    aiops       aiops-ir-lifecycle-policy-registry-svc-7cd8dcbdff-h4zcv   Durham       worker10.acme.com   Deployment
    aiops       aiops-ir-lifecycle-policy-registry-svc-7cd8dcbdff-q584z   Durham       worker4.acme.com    Deployment
    aiops       platform-auth-service-69847b59dd-dfzgm                    Durham       worker10.acme.com   Deployment
    aiops       platform-auth-service-69847b59dd-nkzwf                    Durham       worker4.acme.com    Deployment
    aiops       platform-auth-service-69847b59dd-nh22p                    Raleigh      worker12.acme.com   Deployment
    aiops       platform-identity-management-6c6fd4dcc8-wknsp             Durham       worker4.acme.com    Deployment
    aiops       platform-identity-management-6c6fd4dcc8-ps565             Raleigh      worker12.acme.com   Deployment
    aiops       platform-identity-management-6c6fd4dcc8-zktfc             Raleigh      worker6.acme.com    Deployment
    aiops       platform-identity-provider-649c6dd6cd-45v7c               Durham       worker7.acme.com    Deployment
    aiops       platform-identity-provider-649c6dd6cd-jfhmc               Durham       worker4.acme.com    Deployment
    aiops       platform-identity-provider-649c6dd6cd-hc9v5               Raleigh      worker12.acme.com   Deployment
    

    In the preceding example output, pods that are not optimally distributed are listed after WARNING: Not all pods are distributed optimally. The distribution is suboptimal because pod replicas are on the same zone, instead of being across distinct zones.

  3. Manually redistribute the identified pods on the multi-zone cluster to achieve an optimal distribution for high availability.

    For each microservice identified in the previous step, delete pods that have the same zone as another pod within that microservice - an example is given at the end of this step. The order of deletion is significant. If the pods identified by the status checker plug-in contain any of the following, then delete them in the following order before you delete any other pods:

    • iaf-system-zookeeper
    • iaf-system-kafka
    • aiops-topology-cassandra
    • aiops-ibm-elasticsearch-es-server-zone0
    • aiops-ibm-elasticsearch-es-server-zone1
    • aiops-ibm-elasticsearch-es-server-zone2
    • aiops-ir-analytics-spark-worker
    • aiops-lad-flink-taskmanager
    • aiops-lad-flink
    • aiops-ir-lifecycle-flink-taskmanager
    • aiops-ir-lifecycle-flink
    • <instance>-edb-postgres, where <instance> is the name of your IBM Cloud Pak for AIOps instance, usually ibm-cp-aiops.
    • aiops-ir-lifecycle-datarouting
    • aimanager-aio-similar-incidents-service
    • aimanager-aio-log-anomaly-golden-signals
    • aiops-topology-topology
    • aimanager-aio-change-risk
    • aiops-ir-core-ncoprimary
    • aiops-ir-core-ncobackup
    • <instance>-redis-server, where <instance> is the name of your IBM Cloud Pak for AIOps instance, usually ibm-cp-aiops.
    • aiops-topology-layout

    Note: To achieve an optimal distribution, delete pods individually. You can delete multiple pods together if you need to reduce disruption time.

    Run the following command to delete a pod:

    oc delete pod -n <project> <pod_to_delete>
    

    Where

    • <project> is the project that your IBM Cloud Pak for AIOps installation is deployed in.
    • <pod_to_delete> is a pod with multiple replicas in the same zone.

    For example, the platform-identity-provider microservice in the example output for step 2 has two pods in Durham and one pod in Raleigh. Delete a pod in Durham so that a new pod can be scheduled on a node in the Chapel Hill zone, achieving distribution across all zones, as in the following example:

    Example command:

    oc delete pod -n aiops platform-identity-provider-649c6dd6cd-45v7c
    

  4. Validate whether the distribution is now optimal.

    Re-run the status checker plug-in to check that all pods are now optimally distributed across the cluster zones.

    oc waiops multizone status
    

    Example output where pods are optimally distributed across zones:

    $ oc waiops multizone status
    
    Regions : CentralNC
    Zones   : ChapelHill Durham Raleigh
    
    NAME                ZONE         REGION
    worker11.acme.com   ChapelHill   CentralNC
    worker14.acme.com   ChapelHill   CentralNC
    worker2.acme.com    ChapelHill   CentralNC
    worker5.acme.com    ChapelHill   CentralNC
    worker8.acme.com    ChapelHill   CentralNC
    worker1.acme.com    Durham       CentralNC
    worker10.acme.com   Durham       CentralNC
    worker13.acme.com   Durham       CentralNC
    worker4.acme.com    Durham       CentralNC
    worker7.acme.com    Durham       CentralNC
    worker0.acme.com    Raleigh      CentralNC
    worker12.acme.com   Raleigh      CentralNC
    worker3.acme.com    Raleigh      CentralNC
    worker6.acme.com    Raleigh      CentralNC
    worker9.acme.com    Raleigh      CentralNC
    
    INFO: Checking the namespace aiops...................................done.
    
    INFO: All pods are distributed appropriately for multizone.
    

    If the status checker plug-in still reports a nonoptimal distribution, then repeat step 3. If the service remains incorrectly distributed, then you can run the following command to cordon nodes within a zone so that deleted pods cannot be rescheduled on them:

    oc adm cordon $(oc get no -l topology.kubernetes.io/zone=<zone_name> -o jsonpath='{.items[*].metadata.name}')
    

    Where <zone_name> is the name of the zone that you do not want pods to be scheduled in.

    Example command to cordon nodes in the Durham zone to help ensure that a pod is not scheduled in the Durham zone:

    oc adm cordon $(oc get no -l topology.kubernetes.io/zone=Durham -o jsonpath='{.items[*].metadata.name}')
    

    When the service has an optimal distribution, uncordon the nodes:

    oc adm uncordon $(oc get no -l topology.kubernetes.io/zone=<zone_name>> -o jsonpath='{.items[*].metadata.name}')
    

    Where <zone_name> is the name of the zone that you do not want pods to be scheduled in.

    Example command:

    oc adm uncordon $(oc get no -l topology.kubernetes.io/zone=Durham -o jsonpath='{.items[*].metadata.name}')