Troubleshooting Apache Flink jobs

If you observe that no events are flowing to OpenSearch and that Flink job logs report errors, explore possible diagnoses and solutions.

Tip:

Make sure that the jq command-line JSON processor is installed. Some of these troubleshooting procedures require this tool. The jq tool is available from this page: https://stedolan.github.io/jq/ External link opens a new window or tab .

The following list helps you identify how Flink jobs might fail.

Flink job registration failed due to incorrect task slot assignment
Flink job registration failed due to missing resources
After an update of the job submitter, the processing job is in Canceled state and issues an error message
Flink jobs are not running. Business Performance Center is not receiving events
Job pods, such as <custom_resource_name>-bai-bpmn or <custom_resource_name>-bai-icm, are stuck in Init:0/1 status
You are trying to remove an operator without first creating savepoints
Tracing IBM Business Automation Insights raw events in Flink jobs
The bai-odm Kubernetes job is in Error state and its log contains a schema validation error
Pods fail after an OpenShift cluster is restarted
Restarting Flink jobs after failure
Restarting Flink jobs after failure in a non production system

In code lines, the NAMESPACE, bai_namespace, , or ${NAMESPACE} placeholder is the namespace where Business Automation Insights is deployed. The CR_NAME or custom_resource_name placeholder is the name of the custom resource that was used to deploy IBM Business Automation Insights.

Flink job registration failed due to incorrect task slot assignment

Problem

The job manager log reports errors such as the following one.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. 
Slots required: 8, slots allocated: 0

Cause

If the issue happens after you have updated your Business Automation Insights configuration, the problem might indicate that Flink did not correctly update the metadata about task slot assignment after a failing task manager recovered.

Solution

Restart all task managers by running these delete command.

oc delete pod -l component="taskmanager"

Flink job registration failed due to missing resources

Problem

The job manager log reports errors such as the following one.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.

Cause

If you are using custom processing applications the value of parameter flink.additional_task_managers needs to be increased to allow deployment of all custom processing application.

Solution

Verify that the value of parameter flink.additional_task_managers allows to deploy all custom processing applications. Each processing application deployed as a Flink job requires a number of task manager equals to its parallelism value. Update value if additional task managers are needed.

If error still occurs, list all running Flink jobs with following command:

JOB_MANAGER_POD=$(oc get pods -l component=jobmanager --no-headers -o custom-columns=Name:.metadata.name | head -n 1)
oc exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink list | grep RUNNING

If custom processing application jobs should be cancelled, run following Flink CLI command to cancel Flink job and free slots allocated for the job:
```
kubeclt exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink cancel <jobid>
```

After an update of the job submitter, the processing job is in Canceled state and issues an error message

Problem

The following error message is displayed.

Get latest completed checkpoint for <job-id> job
REST endpoint did not return latest completed checkpoint. Getting it from Persistent Volume...
Error: There is no checkpoint to recover from.

Diagnosis

This error can happen when the version of a job is updated, for example to try to fix a failure, and this failure is preventing the creation of new checkpoints and savepoints.

Solution

Restart the job from the latest successful checkpoint or savepoint.

You can find the latest successful checkpoint in the <bai-pv>/checkpoints/<job-name>/<job-id> directory.

<bai-pv>

The directory where the persistent volume (PV) was created. Set this variable to /mnt/pv, which is the folder where the PV is mapped within the job submitters.

<job-name>

Business Automation Insights

The name of the failing job, for example bai/bpmn.

<job-id>

The job identifier that is indicated by the error message. Pick the most recent checkpoint, that is, the higher <checkpoint-id> value and verify that the folder is not empty.
If all <checkpoint-id> folders are empty, and only in this case, use the latest savepoint of the corresponding processing job, which you can find in the <bai-pv>/savepoints/<job-name> directory.
Update the <job_name>.recovery_path parameter of the failing job by following the procedure in Updating your Business Automation Insights custom resource.

Flink jobs are not running. Business Performance Center is not receiving events

Problem

Processing of raw events has been enabled in configuration of Business Automation Insights event processing parameters. However Flink jobs for processing these raw events are not running. All running Flink jobs can be listed by executing following commands:

JOB_MANAGER_POD=$(oc get pods -l component=jobmanager --no-headers -o custom-columns=Name:.metadata.name | head -n 1)
oc exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink list | grep RUNNING

Cause

The job manager pod experienced a failure and the related pod restarted. In that case, the Kubernetes submitter jobs are still present but do not restart. Other causes might break the Flink jobs, such as temporary network issues.

Solution

Restart the Flink jobs. See Restarting Flink jobs after failure.

If the flink jobs were running, check the emitters are correctly sending events to Kafka.

Job pods, such as `<custom_resource_name>`-bai-bpmn or `<custom_resource_name>`-bai-icm, are stuck in `Init:0/1` status

Problem

The pods of <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm jobs first require that <custom_resource_name>-bai-setup job completes successfully. The <custom_resource_name>-bai-setup job attempts up to three retries on failure. Past these three retries, it does not trigger any new pod creation. As a side-effect, this can cause pods of the <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm jobs to remain stuck in Init:0/1 status.

When you run the get pods command, you might observe the following results.

oc get pods -n <cp4ba_namespace>

Table 1. Pod retrieval results
NAME	READY	RESTARTS	STATUS	AGE
... `<custom_resource_name>-bai-bpmn-aaaaa` ... `<custom_resource_name>-bai-icm-bbbbb` ... Look up the logs of the <custom_resource_name>-bai-bpmn, <custom_resource_name>-bai-bawadv and <custom_resource_name>-bai-icm pods. Logs contain messages showing that expected OpenSearch mapping version is not found: `kubectl logs <custom_resource_name>-bai-bpmn -c wait-elasticsearch` … Checking if mappings version is up-to-date … (iteration 67) … Checking if mappings version is up-to-date … (iteration 68) …	... 0/1 ... 0/1 ...	... Init:0/1 ... Init:0/1 ...	... 0 ... 0 ...	... 2h ... 2h ...

Cause

This situation can happen if OpenSearch did not start as expected when installing the release. First, make sure that OpenSearch is properly up and running. After OpenSearch is up and running, you can apply the following solution.

Solution

Delete all the pods that were previously created by the <custom_resource_name>-bai-setup job.
```
kubectl delete pod -l component=bai-setup
```

Run the following command to re-create the <custom_resource_name>-bai-setup job.

kubectl get job <custom_resource_name>-bai-setup -o json | jq "del(.spec.selector)" | jq "del(.spec.template.metadata.labels)" | oc replace --force -f -

You are trying to remove an operator without first creating savepoints

Problem

The job submitter pods are in Error state and you find errors in the logs, such as the following one.

Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<savepoint-id> Cannot map checkpoint/savepoint state for operator xxxxxxxxxxxxxx to the new program, because the operator is not available in the new program.

Cause

This error occurs if you are trying to update your release and remove an operator but you did not first create the necessary savepoints and no values were passed to the <job_name>.recovery_path parameter of the jobs.

Solution

The error message contains a path to a savepoint that is created dynamically to try to update the job. You can use that savepoint to restart the jobs from them by updating the IBM Business Automation Insights release and passing the correct values for each job in its <job_name>.recovery_path parameter. For more information about the parameters that need to be updated in the release, see Advanced updates to your IBM Business Automation Insights deployment.

Tracing IBM Business Automation Insights raw events in Flink jobs

Problem

The Flink jobs that process IBM Cloud Pak® for Business Automation raw events have been successfully submitted to the Flink job manager. However, these jobs do not process the raw events types that they are designed to handle.

Attention: For 25.0.0, this feature only works for BPMN, and ICM jobs.

Cause

There are multiple possible causes. The connection settings to the Kafka brokers can be incorrect or some Flink jobs might have failed before they could process the raw events types. Tracing raw events allows to validate that events sent by emitters are received by Flink jobs.

Solution

The solution consists in activating verbose logs and elasticsearch timeseries. With verbose_logs, whenever a Flink job receives a raw event to process to a time series, this raw event is written as INFO log message in task manager log.

Enable the options for verbose logs and elasticsearch timeseries in the custom resource (CR) YAML file.

spec.bai_configuration.flink.verbose_logs: true

If bpmn is enabled:

spec.bai_configuration.bpmn.force_elasticsearch_timeseries

If icm is enabled:

spec.bai_configuration. icm.force_elasticsearch_timeseries

Wait until Flink job is restarted. Check raw event is written as INFO log level to task manager log.

The bai-odm Kubernetes job is in Error state and its log contains a schema validation error

Problem

The bai-odm Kubernetes job is in Error status and the Flink job for processing ODM events is not running. After the bai-odm job is restarted with the following command, the bai-odm pod still shows the schema validation error.

kubectl get job <custom_resource_name>-bai-odm -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -

Schema validation error: processing-conf.json not conform to schema specification

Solution

Retrieve the identifier of a management pod.

kubectl get pods | grep insights-engine-management

Run the pod and delete the directory that holds the configuration for Operational Decision Manager event processing in its mounted persistent volume.
```
kubectl exec -it <management-pod> – bash -c "rm -rf /mnt/pv/processing-conf/dba/bai-odm"
```

Retrieve and restart the bai-setup Kubernetes job.

kubectl get job <custom_resource_name>-bai-setup -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -

After the bai-setup pod passes in 0/1 Completed status, restart the bai-odm Kubernetes job.

kubectl get job <custom_resource_name>-bai-odm -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -

Pods fail after an OpenShift cluster is restarted

If a Red Hat®® OpenShift®® cluster is restarted, JobManager pods with High Availability (HA) enabled may enter a CrashLoopBackOff state and fail to recover. To properly restart the cluster, see Pods fail after an OpenShift cluster is restarted External link opens a new window or tab

Restarting Flink jobs after failure

Identify the latest checkpoints that were written by the job manager for each Flink job before the outage.

Connect to the job manager.

JOB_MANAGER_POD=$(oc get pods -l component=jobmanager --no-headers -o custom-columns=Name:.metadata.name | head -n 1)
JOB_MANAGER_POD=$(oc get pods -l component=jobmanager --no-headers -o custom-columns=Name:.metadata.name | head -n 1)
oc rsh -c jobmanager ${JOB_MANAGER_POD}

Identify the latest Flink job identifiers.

LATEST_JOB_IDS=$(for DIR in $(find /mnt/pv/checkpoints/dba/* -type d -prune); do echo -n $DIR/; ls -t $DIR | head -n 1 | tail -n 1;done)

Retrieve the directory paths containing the latest checkpoints that were written by the job manager for each running FLink job.
```
for CHK_DIR in ${LATEST_JOB_IDS}; do ls -t $CHK_DIR/chk*/_metadata 2>/dev/null| head -n 1 | tail -n 1; done | sed 's#/_metadata$##'
```
Note: If all <checkpoint-id> folders are empty, and only in this case, use the latest savepoint of the corresponding processing job, which you can find in the <bai-pv>/savepoints/<job-name> directory.
Restart the Flink jobs from their respective checkpoints.

Edit the Business Automation Insights custom resource for each Flink job. Example for the BPMN Flink job:spec:

bai_configuration:
  bpmn:
    install: true
    parallelism: 1
    recovery_path: /mnt/pv/checkpoints/dba/bai-bpmn/6d134e8ce753f2a0ec63222532145fef/chk-67

Save the changes. After a few minutes, check the Flink jobs are restarted.

oc exec -c flink-main-container -it ${JOB_MANAGER_POD} -- /opt/flink/bin/flink list | grep RUNNING

Restarting Flink jobs after failure in a non production system

The following procedure allows to quickly restart jobs but does not ensure data consistency with past processed data. For a system running in production job should be restarted from a checkpoint or a savepoint following procedure documented at Restarting Flink jobs after failure.

Retrieve the names of the jobs.

 oc get jobs -l 'app.kubernetes.io/name in (ibm-business-automation-insights)' -o=custom-columns=Name:.metadata.name --no-headers

The returned list might vary because the deployment of these jobs is conditional and depends on the activation of event processors in the custom resource. Exclude jobs whose name contains “setup” from result

For each of the job names, run the following command by replacing the <job_name> placeholder with the actual name of the job.

oc get job <job_name> -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | oc replace --force -f -

Troubleshooting Apache Flink jobs

Flink job registration failed due to incorrect task slot assignment

Flink job registration failed due to missing resources

After an update of the job submitter, the processing job is in Canceled state and issues an error message

Flink jobs are not running. Business Performance Center is not receiving events

Job pods, such as <custom_resource_name>-bai-bpmn or <custom_resource_name>-bai-icm, are stuck in Init:0/1 status

You are trying to remove an operator without first creating savepoints

Tracing IBM Business Automation Insights raw events in Flink jobs

The bai-odm Kubernetes job is in Error state and its log contains a schema validation error

Pods fail after an OpenShift cluster is restarted

Restarting Flink jobs after failure

Restarting Flink jobs after failure in a non production system

Job pods, such as `<custom_resource_name>`-bai-bpmn or `<custom_resource_name>`-bai-icm, are stuck in `Init:0/1` status