DB2 Version 10.1 for Linux, UNIX, and Windows

Member restart and crash recovery

Member restart is the process of restarting the database server processes on a failed member and performing member crash recovery on each database that requires it.

If a software or hardware failure on a host causes a member to fail, DB2® cluster services detects the failure and automatically restarts the member. The member restart can either be a local restart, meaning that it is restarted on its original host (home host) or a restart light, meaning it is restarted on a different host:

Local restart: If a member fails because of a software failure but the member's home host is still active, DB2 cluster services attempts a local restart. The local member restart uses a reduced memory model to ensure the in-flight data is recovered to a consistent state quickly. Once the in-flight data has been recovered, the database's normal memory model is then initialized for full transaction processing.
Restart light: If a member's home host is inactive or if a local restart attempt fails, the member is automatically restarted as a guest member on another member's home host using a minimal amount of resources. A member running in restart light mode does not process new transactions because its sole purpose is to perform member crash recovery.

Multiple member failures are generally recoverable with independent concurrent member restart recoveries, and, as a result, a group restart is not usually required. The database continues to remain open for access through other surviving members. Only data that was in-flight on the failed members is unavailable for the duration of the member restarts.

Member crash recovery

Member crash recovery is responsible for the rolling back of incomplete transactions and the completing of committed transactions as a part of member restart. This will ensure that the database data modified by this member is brought to a consistent state. If indoubt transactions exist at the end of member crash recovery, the member will be made available to resolve them.

Member crash recovery will be performed when a viable cluster caching facility is still available and the database on a member is activated and it is determined that the member's log stream is inconsistent on disk due to an abnormal termination. In most cases, member crash recovery will be automatically invoked through member restart and the automatic recovery agent. The automatic recovery agent that is started when the instance is brought up, takes action when it detects that a database on the member is inconsistent.

Once member crash recovery of a database completes, the member will be able to accept incoming connection requests from other applications if that member was restarted on its original host.