IBM Support

AIOps 4.13 Upgrade Failure Caused by aiops-ir-lifecycle-policy-upgrade-job

Flashes (Alerts)


Abstract

The Problem:

Upgrades from 4.12->4.13 are failing with exceptions with certain savepoints in lifecycle:

This has been diagnosed as a defect.

Content

DIAGNOSING THE PROBLEM:

The Problem:
Upgrades from 4.12->4.13 are failing with the following exception with certain savepoints:

java.io.IOException: User defined function KeyedStateReaderFunction#readKey threw an exception
at org.apache.flink.state.api.input.KeyedStateInputFormat.nextRecord(KeyedStateInputFormat.java:225)
at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:98)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:113)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:71)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:338)
Caused by: java.lang.RuntimeException: java.lang.ClassCastException: com.ibm.aiops.lifecycle.sdk.util.RegistryUpdate incompatible with java.util.List
at com.ibm.aiops.flink.state.StateManager.lambda$serialize$10(StateManager.java:259)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:522)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:512)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:239)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at com.ibm.aiops.flink.state.StateManager.serialize(StateManager.java:269)
at com.ibm.aiops.lifecycle.stateupgrade.state.StateManagerStateReader$1.collect(StateManagerStateReader.java:85)
at com.ibm.aiops.lifecycle.stateupgrade.state.StateManagerStateReader$1.collect(StateManagerStateReader.java:77)
at com.ibm.aiops.lifecycle.stateupgrade.strategies.v412.Version412ExecutorStateMapper.accept(Version412ExecutorStateMapper.java:141)
at com.ibm.aiops.lifecycle.stateupgrade.strategies.v412.Version412ExecutorStateMapper.accept(Version412ExecutorStateMapper.java:38)
at com.ibm.aiops.lifecycle.stateupgrade.state.StateManagerStateReader.readKey(StateManagerStateReader.java:75)
at com.ibm.aiops.lifecycle.stateupgrade.state.StateManagerStateReader.readKey(StateManagerStateReader.java:31)
at org.apache.flink.state.api.input.operator.KeyedStateReaderOperator.processElement(KeyedStateReaderOperator.java:76)
at org.apache.flink.state.api.input.operator.KeyedStateReaderOperator.processElement(KeyedStateReaderOperator.java:51)
at org.apache.flink.state.api.input.KeyedStateInputFormat.nextRecord(KeyedStateInputFormat.java:223)
... 4 more
Caused by: java.lang.ClassCastException: com.ibm.aiops.lifecycle.sdk.util.RegistryUpdate incompatible with java.util.List
at org.apache.flink.api.common.typeutils.base.ListSerializer.serialize(ListSerializer.java:42)
at com.ibm.aiops.flink.state.StateManager.lambda$serialize$5(StateManager.java:225)
at io.vavr.control.Try.run(Try.java:131)
at com.ibm.aiops.flink.state.StateManager.lambda$serialize$6(StateManager.java:225)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at com.ibm.aiops.flink.state.StateManager.lambda$serialize$10(StateManager.java:225)
... 21 more

 

This occurs when:
There are active policy executions at the time of the savepoint
Previous policies in those executions have stored global variables.

 

 

RESOLVING THE PROBLEM:

The hotfix is only applied on 4.13 release of AIOPS 

image = "cp.icr.io/cp/cp4waiops/lifecycle-trigger@sha256:fe77ee33558a08361dcf5fad8dbabc06ac3c19c915efe196c1e75ca24db4b66a"

 

This hotfix patches the code that manages policy state upgrade and resolves the defect.
Applying Patch

  1. Set your project for AIOps

    oc project <AIOps project / namespace>
  2. Back up the lifecycle CSV:

    oc get clusterserviceversions.operators.coreos.com "$(oc get subscriptions.operators.coreos.com ibm-aiops-ir-lifecycle -o jsonpath='{.status.installedCSV}')" -o yaml > ibm-aiops-ir-lifecycle-csv-back.yaml
  3. Set the new image to patch:

    export PATCHED_AIOPS_IR_LIFECYCLE_IMAGE="cp.icr.io/cp/cp4waiops/lifecycle-trigger@sha256:fe77ee33558a08361dcf5fad8dbabc06ac3c19c915efe196c1e75ca24db4b66a"
  4. Patch the CSV with the updated image

    oc patch clusterserviceversions.operators.coreos.com "$(oc get subscriptions.operators.coreos.com ibm-aiops-ir-lifecycle -o jsonpath='{.status.installedCSV}')" --type='json' -p="[{'op': 'replace', 'path': '/spec/install/spec/deployments/0/spec/template/metadata/annotations/olm.relatedImage.lifecycle-trigger', 'value': '${PATCHED_AIOPS_IR_LIFECYCLE_IMAGE}'}]"
  5. Wait for the new operator pod to start:

    oc get po --watch | grep lifecycle-operator

This is complete when the new ir-lifecycle-operator pod is shown in the Running state.

6. Delete the current failed upgrade job to trigger a new one to be scheduled:

oc delete job -l app.kubernetes.io/component=policy-upgrade-job

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSE9G0Q","label":"IBM Cloud Pak for AIOps"},"ARM Category":[{"code":"a8m0z0000001jFJAAY","label":"Cloud Pak for AIOps"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"4.13.0"}]

Product Synonym

CP4AIOps

Document Information

Modified date:
20 April 2026

UID

ibm17270091