Checkpoints and savepoints

To ensure recovery from possible job failures, IBM Business Automation Insights uses Apache Flink checkpoints and savepoints, and high availability mode.

Checkpoints

Business Automation Insights processing jobs are stateful. For the state to be fault tolerant, checkpoints are checked at regular intervals. Checkpoints are stored in the persistent volume (PV) in the /mnt/pv/checkpoints/ folder. For each job, only the three most recent successful checkpoints are kept in the persistent volume.

The /mnt/pv/checkpoints/ folder contains one subfolder for each processing job.

You can modify the checkpointing frequency by setting the value of the bai_configuration.flink.job_checkpointing_interval parameter in the YAML file of your Business Automation Insights custom resource (CR). This property sets the number of milliseconds between two consecutive checkpoints. The default value is 5000 ms.

Savepoints

Job submitters create savepoints each time a new version of the processing job is deployed so that the job stops safely and the new job version redeploys from the new savepoint. Savepoints are stored in the persistent volume in the /mnt/pv/savepoints folder.

The /mnt/pv/savepoints folder contains one subfolder for each processing job.

RocksDB State backend

Business Automation Insights processing jobs use the RocksDB state backend to store the checkpoints and savepoints in the persistent volume. To configure RocksDB, use Flink predefined configuration options for this state backend. You can then update those options by modifying the value of the bai_configuration.flink.rocks_db_properties_config_map parameter in your custom resource YAML file. See Specifying a configuration map for RocksDB properties.
Note: Modifying RocksDB options might increase or decrease memory consumption by task managers.

Failure and recovery

Job recovery rolls out as follows.
  1. If any of the cluster components crashes after the new instance is brought back by Kubernetes, Flink recovers the jobs automatically. The recovered job is restarted from the latest successful checkpoint, and the processing resumes from where it was before the failure.
  2. If the failure occurs while a processing job is in progress, for example because of a wrong job parameter value, such as unauthorized access to the OpenSearch servers, the job restarts from the latest checkpoint.
  3. If the issue is not corrected, the job fails again and restarts from the latest checkpoint, and so on. In such cases, the issue can be resolved on the external service. The job will eventually run correctly again and event processing resumes from where it left off before the failure.
  4. If the issue can be solved by updating the job configuration, update it with the new values. In this case, a savepoint is created before the job is updated; then the job is canceled and the new version is submitted to the cluster through the savepoint that was created to enable the processing to resume from where it was before the failure.

For more information

You can find reference information about Flink parameters in Apache Flink parameters.

For more information about Flink predefined configuration options, see version 1.11 of the Enum PredefinedOptions pages of the Flink documentation.