Troubleshooting Apache Flink jobs

If you observe that no events are flowing to OpenSearch and that Flink job logs report errors, explore possible diagnoses and solutions.
Tip:

Make sure that the jq command-line JSON processor is installed. Some of these troubleshooting procedures require this tool. The jq tool is available from this page: https://stedolan.github.io/jq/ External link opens a new window or tab.

In code lines, the NAMESPACE, bai_namespace, , or ${NAMESPACE} placeholder is the namespace where Business Automation Insights is deployed. The CR_NAME or custom_resource_name placeholder is the name of the custom resource that was used to deploy IBM Business Automation Insights.

Flink job registration failed due to incorrect task slot assignment

Problem

The job manager log reports errors such as the following one.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. 
Slots required: 8, slots allocated: 0
Cause
If the issue happens after you have updated your Business Automation Insights configuration, the problem might indicate that Flink did not correctly update the metadata about task slot assignment after a failing task manager recovered.
Solution
Restart all task managers by running these delete command.
oc delete pod -l component="taskmanager"

Flink job registration failed due to missing resources

Problem
The job manager log reports errors such as the following one.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.
Cause

If you are using custom processing applications the value of parameter flink.additional_task_managers needs to be increased to allow deployment of all custom processing application.

Solution
  1. Verify that the value of parameter flink.additional_task_managers allows to deploy all custom processing applications. Each processing application deployed as a Flink job requires a number of task manager equals to its parallelism value. Update value if additional task managers are needed.

  2. If error still occurs, list all running Flink jobs with following command:
    JOB_MANAGER_POD=$(oc get pods -l component=jobmanager --no-headers -o custom-columns=Name:.metadata.name | head -n 1)
    oc exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink list | grep RUNNING
  3. If custom processing application jobs should be cancelled, run following Flink CLI command to cancel Flink job and free slots allocated for the job:
    kubeclt exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink cancel <jobid>

After an update of the job submitter, the processing job is in Canceled state and issues an error message

Problem
The following error message is displayed.
Get latest completed checkpoint for <job-id> job
REST endpoint did not return latest completed checkpoint. Getting it from Persistent Volume...
Error: There is no checkpoint to recover from.
Diagnosis
This error can happen when the version of a job is updated, for example to try to fix a failure, and this failure is preventing the creation of new checkpoints and savepoints.
Solution
Restart the job from the latest successful checkpoint or savepoint.
  1. You can find the latest successful checkpoint in the <bai-pv>/checkpoints/<job-name>/<job-id> directory.
    <bai-pv>
    The directory where the persistent volume (PV) was created. Set this variable to /mnt/pv, which is the folder where the PV is mapped within the job submitters.
    <job-name>
    Business Automation Insights
    The name of the failing job, for example bai/bpmn.
    <job-id>
    The job identifier that is indicated by the error message. Pick the most recent checkpoint, that is, the higher <checkpoint-id> value and verify that the folder is not empty.
  2. If all <checkpoint-id> folders are empty, and only in this case, use the latest savepoint of the corresponding processing job, which you can find in the <bai-pv>/savepoints/<job-name> directory.
  3. Update the <job_name>.recovery_path parameter of the failing job by following the procedure in Updating your Business Automation Insights custom resource.

Job pods, such as <custom_resource_name>-bai-bpmn or <custom_resource_name>-bai-icm, are stuck in Init:0/1 status

Problem

The pods of <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm jobs first require that <custom_resource_name>-bai-setup job completes successfully. The <custom_resource_name>-bai-setup job attempts up to three retries on failure. Past these three retries, it does not trigger any new pod creation. As a side-effect, this can cause pods of the <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm jobs to remain stuck in Init:0/1 status.

  • When you run the get pods command, you might observe the following results.
    oc get pods -n <cp4ba_namespace>
    Table 1. Pod retrieval results
    NAME READY RESTARTS STATUS AGE
    ...

    <custom_resource_name>-bai-bpmn-aaaaa

    ...

    <custom_resource_name>-bai-icm-bbbbb

    ...

    Look up the logs of the <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm pods.

    Logs contain messages showing that expected OpenSearch mapping version is not found:

    kubectl logs <custom_resource_name>-bai-bpmn -c wait-elasticsearch
    …
    Checking if mappings version is up-to-date … (iteration 67)
    …
    Checking if mappings version is up-to-date … (iteration 68)
    …
    ...

    0/1

    ...

    0/1

    ...

    ...

    Init:0/1

    ...

    Init:0/1

    ...

    ...

    0

    ...

    0

    ...

    ...

    2h

    ...

    2h

    ...

Cause

This situation can happen if OpenSearch did not start as expected when installing the release. First, make sure that OpenSearch is properly up and running. After OpenSearch is up and running, you can apply the following solution.

Solution
  1. Delete all the pods that were previously created by the <custom_resource_name>-bai-setup job.
    kubectl delete pod -l component=bai-setup
  2. Run the following command to re-create the <custom_resource_name>-bai-setup job.
    kubectl get job <custom_resource_name>-bai-setup -o json | jq "del(.spec.selector)" | jq "del(.spec.template.metadata.labels)" | oc replace --force -f -

You are trying to remove an operator without first creating savepoints

Problem
The job submitter pods are in Error state and you find errors in the logs, such as the following one.
Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<savepoint-id> Cannot map checkpoint/savepoint state for operator xxxxxxxxxxxxxx to the new program, because the operator is not available in the new program.
Cause
This error occurs if you are trying to update your release and remove an operator but you did not first create the necessary savepoints and no values were passed to the <job_name>.recovery_path parameter of the jobs.
Solution
The error message contains a path to a savepoint that is created dynamically to try to update the job. You can use that savepoint to restart the jobs from them by updating the IBM Business Automation Insights release and passing the correct values for each job in its <job_name>.recovery_path parameter. For more information about the parameters that need to be updated in the release, see Advanced updates to your IBM Business Automation Insights deployment.

Tracing IBM Business Automation Insights raw events in Flink jobs

Problem
The Flink jobs that process IBM Cloud Pak® for Business Automation raw events have been successfully submitted to the Flink job manager. However, these jobs do not process the raw events types that they are designed to handle.
Attention: For 25.0.0, this feature only works for BPMN, and ICM jobs.
Cause
There are multiple possible causes. The connection settings to the Kafka brokers can be incorrect or some Flink jobs might have failed before they could process the raw events types. Tracing raw events allows to validate that events sent by emitters are received by Flink jobs.
Solution
The solution consists in activating verbose logs and elasticsearch timeseries. With verbose_logs, whenever a Flink job receives a raw event to process to a time series, this raw event is written as INFO log message in task manager log.
  1. Enable the options for verbose logs and elasticsearch timeseries in the custom resource (CR) YAML file.
    spec.bai_configuration.flink.verbose_logs: true
    If bpmn is enabled:
    spec.bai_configuration.bpmn.force_elasticsearch_timeseries
    If icm is enabled:
    spec.bai_configuration. icm.force_elasticsearch_timeseries
  2. Wait until Flink job is restarted. Check raw event is written as INFO log level to task manager log.

The bai-odm Kubernetes job is in Error state and its log contains a schema validation error

Problem
The bai-odm Kubernetes job is in Error status and the Flink job for processing ODM events is not running. After the bai-odm job is restarted with the following command, the bai-odm pod still shows the schema validation error.
kubectl get job <custom_resource_name>-bai-odm -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -
Schema validation error: processing-conf.json not conform to schema specification
Solution
  1. Retrieve the identifier of a management pod.
    kubectl get pods | grep insights-engine-management
  2. Run the pod and delete the directory that holds the configuration for Operational Decision Manager event processing in its mounted persistent volume.
    kubectl exec -it <management-pod> – bash -c "rm -rf /mnt/pv/processing-conf/dba/bai-odm"
  3. Retrieve and restart the bai-setup Kubernetes job.
    kubectl get job <custom_resource_name>-bai-setup -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -
  4. After the bai-setup pod passes in 0/1 Completed status, restart the bai-odm Kubernetes job.
    kubectl get job <custom_resource_name>-bai-odm -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -

Pods fail after an OpenShift cluster is restarted

If a Red Hat®® OpenShift®® cluster is restarted, JobManager pods with High Availability (HA) enabled may enter a CrashLoopBackOff state and fail to recover. To properly restart the cluster, see Pods fail after an OpenShift cluster is restarted External link opens a new window or tab