Upgrading Events Operator in foundational services version 3.17.x and earlier or 3.23.0 and later

Operator upgrade fails when upgrading directly from 3.17 to 3.23.0

Foundational services version 3.23.0 introduces Events Operator 4.5.0, which is based on Kafka version 3.3.1. Events Operator 4.5.0 uses enhanced inter-broker communication with separated control and data-plane listeners. Since brokers must be able to communicate via the listeners to maintain a quorum during the rolling upgrade, and not all prior versions are compatible with the new listeners. Therefore, some prior versions can be upgraded directly to foundational services version 3.23.0 while other versions require a multi-step upgrade process.

The following are the recommended upgrade paths based on the code stream currently deployed:

This will allow the pods to upgrade successfully without the cluster falling into a degraded state with impacted performance and risk of data loss in the event of further failures.

Important: It is strongly recommended that you follow one of the upgrade paths above, dependent on the version of foundational services currently deployed.

Procedure

Attempting to upgrade from versions foundational services 3.17.x or earlier (May 2022 release) to 3.23.x in a single step might leave one or more Kafka pods in a hung state as the old pods are not aware of the separated listeners and unable to form a quorum with the new pods. This will result in a Kafka cluster with degraded performance and in certain cases, at risk of data loss should further failures occur before the upgrade completes.

To determine if any pods are hung, you can inspect the Kafka custom resource by running:

oc get kafka my-kafka -n my-ns -o yaml

Note: Substitute your Kafka cluster name and namespace where appropriate. The status section in the returned YAML document will show the overall state of the cluster. In cases where the rolling upgrade has failed, it will look similar to:

status:
  conditions:
  - lastTransitionTime: "2023-01-12T10:35:30.462360234Z"
    message: Pod my-kafka-kafka-1 is currently not rollable
    reason: UnforceableProblem # Or 'ForceableProblem'
    status: "True"
    type: NotReady
  observedGeneration: 2

The following script will attempt to remediate failed direct upgrades from foundational services version 3.17.x to foundational services version 3.23.0.

Important: Run the following script during a period when message traffic and producer or consumer activity are low.

#!/bin/bash

### Script to bounce the Kafka Pods if the Kafka Pod is running
### With an older version - #ControlPlaneListerner issue

FG_CYAN='\033[0;36m'
FG_PURPLE='\033[0;35m'
FG_DEF="\033[39m"
FG_RED='\033[0;31m'
FG_GREEN='\033[0;32m'

RESTART_REQUIRED="false"

script_name=$(basename "$0")


## Check if user has entered `namespace` argument to the script
## If not show the help text and exit
if [ -z "$1" ]
  then
    echo -e "${FG_RED}namespace is not supplied ${FG_DEF}"
    echo -e "${FG_CYAN}usage: $script_name <namespace>, example:  $script_name myproject ${FG_DEF}"
    exit 0
fi

### Accept namespace as an input paramter
namespace=$1

## Check the Kafka CR status to see whether the Kafka Pod is running
## with an old version after the upgrade if so will set RESTART_REQUIRED="true"
cr_status_reason=$(oc get kafkas.ibmevents.ibm.com  -n ${namespace} -o jsonpath='{.items[0].status.conditions[*].reason}')
if [[ "$cr_status_reason" == *"UnforceableProblem"* || "$cr_status_reason" == *"ForceableProblem"* ]]; then
   RESTART_REQUIRED="true"
   echo -e "${FG_RED} There is Kafka Pod(s) running with old version of Kafka in ${namespace} namespace and needs a restart. ${FG_DEF}"
else
   echo -e "${FG_GREEN} All the Kafka Pods in ${namespace} namespace is running with the latest version of Kafka...things are looking good ${FG_DEF}"
   exit 0
fi

## Events Operator versions 3.15.0 and below ships Kafka
## 2.6.0, 2.6.1, 2.7.0. These are the Versions from which
## The update to latest version (3.3.1) will cause UnforceableProblem
kafka_versions=("2.6.0.jar" "2.6.1.jar" "2.7.0.jar")

## If the Kafka CR status shows `UnforceableProblem` message
## Then loop through the Kafka Pods check whether it's running
## With any of these (2.6.0, 2.6.1, 2.7.0) versions if so
## Bounce the Pod to progress the upgrade.
if [ "$RESTART_REQUIRED" = "true" ]; then
   echo -e "${FG_CYAN}Checking Kafka version for namespace $namespace ${FG_DEF}"
   for i in $(oc get pods -l app.kubernetes.io/name=kafka -o name -n $namespace)
      do
         echo -e "${FG_PURPLE}Kafka Pod $i ${FG_DEF}"
         version=$(oc -n $namespace rsh $i ls libs | grep kafka_)
         echo $version
         for item in "${kafka_versions[@]}"; do
            if [[ "${version#*-}" == *$item* ]]; then
               oc delete $i -n $namespace
               count=0
               while :
               do
               PODS=$(oc get pods -l app.kubernetes.io/name=kafka -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.status.phase}{"\n"}')
               if [[ "Running" == $(echo "$PODS" | grep kafka | awk '{print $2}' | uniq) ]]; then
                  echo -e "${FG_CYAN}Kafka Pod is successfully bounced!!!! ${FG_DEF}"
                  break
               else
                  ((count+=1))
                  if (( count <= 24 )); then
                     echo -e "${FG_PURPLE}Waiting for the Kafka pod to roll.  Recheck in 10 seconds ${FG_DEF}"
                     sleep 10
                  else
                     echo -e "${FG_RED}Pods taking too long.  Giving up.${FG_DEF}"
                     exit 1
                  fi
               fi
               done
            fi
         done
      done
fi