Data loss prevention
In a grid environment, there may be hundreds and thousands of compute hosts distributed in a cluster. In a typical risk management application, there may be hundreds of thousands of perturbations of market data and conditions. Each one of these can be a workload unit.
When you submit this workload to a grid, you expect the grid system to distribute the workload on grid, and guarantee processing without losing any workload, even if there are failures in hardware or software in:
- Grid management machines or software
- Compute machines and service applications
A reliable grid system should guarantee a transactional handling of application execution on the grid. A failure or even an entire system reboot should not require rerunning the workload from the beginning.
One problem in a traditional MPI-based parallel application is that when there is a failure in a distributed environment, the MPI-based application may fail and need to rerun from the beginning. Rerunning a large workload or the entire workload in the system not only wastes time and resources, but also may miss the time window of business opportunities.
Add recovery with recoverable sessions
IBM® Spectrum Symphony supports reliable computing by persisting Symphony session and task inputs and outputs. However, sometimes you may not want to recover your workload when a failure or error happens, or, you may want to trade persistency for performance— task persistency takes time and disk space and may slow down the overall system response time.
You can define whether a session is recoverable or non-recoverable in the application profile through the session type. In the client application, you can then specify the appropriate session type in createSession().
Choose a recoverable session when
- You have a long session that may last hours to compute many CPU-intensive tasks, and you do not want to waste CPU cycles to resubmit tasks in the session if a failure or error occurs.
- It is difficult or impossible to resubmit tasks in the session when a failure or error occurs.
- You have a mission-critical session that has to be finished before a deadline.
Choose a non-recoverable session when
- You have a short session that may only last for minutes, and you can always create a new session to resubmit tasks if a failure or error occurs.
- You want Symphony to immediately clean up the session and release the CPUs if a failure or error happens. Keeping this session running in the system is just waste of CPU cycles.
- You have an interactive online session that requires quick response time.
Implement application-level checkpointing for sessions
If you have long running tasks, you may not want to rerun a task from the beginning in case of failure.
A good practice is to have a long running task that periodically persists its intermediate results, such as every 10 minutes, so that when the task is rerun by Symphony, it can continue from where the last intermediate results that were persisted.
You need a persistent shared location like a persistent shared data cache or a shared file system because a task may be rerun on a different machine than previously.
Once a task can persist its intermediate results, you can perform application-level checkpointing by suspending the session.
A service instance can get an interrupt event by calling serviceContext.getLastInterruptEvent(), and use a grace period to persist intermediate results in a persistent shared location. Later on, either when the whole suspended session is resumed, or then the unfinished task is re-dispatched, another service instance picks up the task, and restores the intermediate results from the shared location.