Restarting from a checkpoint or savepoint
You can restart event processing from a specific checkpoint or savepoint, for example after an unrecoverable error.
Before you begin
About this task
Restarting from a checkpoint or savepoint gives you the opportunity to fix an unrecoverable error
and start over from a valid savepoint or checkpoint. You create savepoints as part of the upgrade
procedure, as documented in Upgrading
Business Automation Insights.
Important: Starting from savepoints is mandatory when you upgrade Business Automation Insights and the new version is based on a
new Apache Flink version.
You can also create savepoints at any time by running the exec command as described next. If a job failure is preventing you from creating a savepoint, that is, if the create-savepoints.sh script returns an error, use the latest successful checkpoint.
You can then restart the processing by using the <job_name>.recoveryPath
parameter of each job submitter in the release properties, by using either a savepoint or a
checkpoint.
Tip: Always prefer to use savepoints over checkpoints because savepoints are
always retained until you explicitly delete them. Use checkpoints when savepoint creation fails.
However, because only the three latest successful checkpoints are retained, and to prevent them from
being deleted while a new checkpoint is created, remember to first cancel the job. On cancellation,
the latest three checkpoints are retained.
New in 18.0.2 For upgrades that include a new version of Flink, a savepoint is required for each processing job. You cannot use checkpoints in this case.
Procedure
Results
Jobs are restored from the savepoint or checkpoint by using the
allowNonRestoredState Flink parameter. That parameter is removed so that you
can remove operators, such as HDFS storage or Kafka egress. Therefore, be careful when you restore
from a savepoint or checkpoint and look up the logs. When a state is ignored, the job manager log
contains a message such as the following one. Similar messages appear only when you disable
operators, and only once for each operator.
2019-01-08 18:11:04,737 INFO org.apache.flink.runtime.checkpoint.Checkpoints - Skipping savepoint state for operator <operator-id>.
What to do next
- The allowNonRestoredState parameter
- For more information, see version 1.7, 1.9, or 1.10 of the Restore a savepoint sections of the Flink documentation, depending on the Flink version you use.
- Disabling HDFS storage or Kafka egress
- For detailed instructions, see Advanced updates to your IBM Business Automation Insights deployment.