How Db2 recovers from coupling facility failures

Failures of the coupling facility can be classified into two main groups: structure failures and connectivity failures.

Important: To ensure high availability for Db2 data sharing and reduce the time and effort for recovering from failures, follow the practices described in Planning for coupling facility availability for data sharing.

Structure failures occur when structures such as the group buffer pool, SCA, or lock structures in the coupling facility become damaged, but the coupling facility continues to operate.

A connectivity failure is a total failure of the coupling facility. They can be caused by the following types of problems:

  • Problems with the attachment of the z/OS® system to the coupling facility
  • A power failure that affects the coupling facility but leaves one or more z/OS systems running
  • Deactivation of the coupling facility partition
  • Failure of the coupling facility control code
  • Failure of the coupling facility CPC or LPAR
The following table summarizes how Db2 recovers from structure and connectivity failures for different structures and configurations.
Failure type Recovery if duplexed Recovery if simplexed
Group buffer pool structure The failing structure is deallocated, and processing continues with the running structure. This recovery is usually fast and unnoticeable. Recovery from the log can occur manually, as the result of a START DATABASE command, or if the group buffer pool is defined with the AUTOREC(YES) option, it can occur automatically.
SCA and lock structure The failing structure is deallocated, and processing continues on the running structure. Db2 uses information that is contained in its virtual storage to quickly rebuild the structures.

Recovery of a simplexed SCA is usually fast and has a minimal impact on performance. Recovery of a simplexed lock structure is also fast in normal cases, but the performance impact depends on the number of held locks at the time of the failure.

Db2 can rebuild a simplexed SCA and lock structure in the same coupling facility or in an alternate coupling facility, assuming that the following conditions are true:

  • You specified the alternate coupling facility in the CFRM policy preference list.
  • You allocated enough storage in the alternate coupling facility to rebuild the structures there.

If Db2 fails to rebuild the SCA and lock structure from virtual storage, all active members in the group terminate abnormally, and you must perform a group restart to recover the necessary information from the logs.

Connectivity Db2 switches to the structure with good connectivity Db2 rebuilds simplexed structures on the alternate coupling facility that is specified in the CFRM policy. Recovery of a simplexed SCA is fast and has little or no impact on performance. The performance of the recovery of a lock structure depends on the number of modify locks in the lock structure at failure time, but it is also fast in most cases.

However, recovery of simplexed group buffer pool structures is very disruptive to the system.

In rebuilding these structures, Db2 attempts to allocate storage on the alternate coupling facility. Db2 uses the current size of the structure for the initial size of the structure on the alternate coupling facility. If Db2 cannot allocate the storage for the SCA or lock structure, the rebuild fails. If z/OS cannot allocate the storage for the group buffer pools, the changed pages are written to disk.

Another type of failure is a channel failure, which occurs when the channel connecting a CPC to a coupling facility no longer operates. Without dual channels (sometimes called links), a channel failure is more likely to occur than a failure in the coupling facility. Losing connectivity to the SCA or lock structure can bring that particular member down, unless you specify duplexing or an alternative coupling facility in the CFRM policy preference list.