Restarting from a checkpoint or savepoint

You can restart event processing from a specific checkpoint or savepoint, for example after an unrecoverable error.

Before you begin

Make sure you have the jq command-line JSON processor installed. The jq tool is available from this page: https://stedolan.github.io/jq/.

About this task

Restarting from a checkpoint or savepoint gives you the opportunity to fix an unrecoverable error and start over from a valid savepoint or checkpoint. You create savepoints as part of the upgrade procedure, as documented in Upgrading Business Automation Insights.

Important: Starting from savepoints is mandatory when you upgrade Business Automation Insights and the new version is based on a new Apache Flink version.

You can also create savepoints at any time by running the exec command as described next. If a job failure is preventing you from creating a savepoint, that is, if the create-savepoints.sh script returns an error, use the latest successful checkpoint.

You can then restart the processing by using the <job_name>.recoveryPath parameter of each job submitter in the release properties, by using either a savepoint or a checkpoint.

Tip: Always prefer to use savepoints over checkpoints because savepoints are always retained until you explicitly delete them. Use checkpoints when savepoint creation fails. However, because only the three latest successful checkpoints are retained, and to prevent them from being deleted while a new checkpoint is created, remember to first cancel the job. On cancellation, the latest three checkpoints are retained.

New in 18.0.2 For upgrades that include a new version of Flink, a savepoint is required for each processing job. You cannot use checkpoints in this case.

Procedure

Retrieve the name of the job manager pod.

JOBMANAGER=`kubectl get pods --selector=release=<custom_resource_name> --namespace <my-namespace> | grep bai-flink-jobmanager | awk '{print $1}'`

Create savepoints for all the running processing jobs by using the script provided in the job manager pod.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- scripts/create-savepoints.sh
```
The script returns the path of the created savepoints for the following job names: bai-ingestion, bai-bpmn, bai-bawadv, bai-icm, bai-odm, bai-content, bai-adw
```
Savepoint completed. Path: file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>
```
Savepoints are created for all the jobs while they continue running.
If you need to stop the processing right after the creation of the savepoints, run the script with the -s flag.
Typically, this action is required for an upgrade.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- scripts/create-savepoints.sh -s
```
This command stops the jobs and returns the path to the created savepoints.
```
Savepoint stored in file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>.
```
Optional: If the create-savepoints.sh script returns an error while savepoints are created, and only in this case, use the latest successful checkpoint.
The create-savepoints.sh script returns the names and identifiers of the jobs that failed to create savepoints.
```
The savepoint for job 'dba/bai-<job-name>' with ID: <job-id> could not be created.
```
1. Cancel the jobs to prevent the creation of new checkpoints.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- flink cancel <job-id>
```
2. Retrieve the latest successful checkpoint.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- curl -sk https://localhost:8081/jobs/<job-id>/checkpoints | jq ".latest.completed.external_path"
```
Attention: The error that prevents the creation of savepoints might reoccur when you restart from the latest successful checkpoint. Before you restart the job from the checkpoint, make sure to look into the job logs to identify the problem and take the necessary steps to fix it. For information about monitoring jobs, see Troubleshooting.
To update the <job_name>.recoveryPath parameter, follow the procedure in Updating your Business Automation Insights custom resource.

By default, you can restart a job from a same checkpoint or savepoint only once. This is a safety mechanism in case you forget to remove the value of the <job_name>.recoveryPath parameter. If you try to restart more than once, the job submitter falls into error state and returns a message such as Error: The savepoint <path/to/savepoint> was already used. The Job won't be run from there.

The job resumes processing from where it was when the specified checkpoint or savepoint was created.
Optional: If you really need to restart a job from the same checkpoint or savepoint more than once, first delete the /recovery/<job-name>/<savepoint-id> savepoint on the persistent volume (PV).

Results

Jobs are restored from the savepoint or checkpoint by using the allowNonRestoredState Flink parameter. That parameter is removed so that you can remove operators, such as HDFS storage or Kafka egress. Therefore, be careful when you restore from a savepoint or checkpoint and look up the logs. When a state is ignored, the job manager log contains a message such as the following one. Similar messages appear only when you disable operators, and only once for each operator.

2019-01-08 18:11:04,737 INFO org.apache.flink.runtime.checkpoint.Checkpoints - Skipping savepoint state for operator <operator-id>.

What to do next

The allowNonRestoredState parameter: For more information, see version 1.7, 1.9, or 1.10 of the Restore a savepoint sections of the Flink documentation, depending on the Flink version you use.
Disabling HDFS storage or Kafka egress: For detailed instructions, see Advanced updates to your IBM Business Automation Insights deployment.