If properly configured, Db2 and IRLM can recover very quickly and with very little disruption from any kind of coupling facility failure. If coupling facilities are not properly configured, coupling facility failures can cause serious outages for users.
About this task
When planning your data sharing configuration for high availability, the primary concerns are physical protection of the coupling facility, protection of the structures in the coupling facility, and the network connectivity. The SCA and lock structure are both necessary for the group to function.
Multiple coupling facilities are always required for high availability because a single coupling facility is single point of failure for the data sharing group. With multiple coupling facilities, you can specify that structures be allocated in the secondary coupling facility, if the primary coupling facility is damaged. You can also consider duplexing SCA, lock, and group buffer pool structures. With duplexing, a secondary structure is always on standby in another coupling facility. This secondary structure is ready to take over if the primary structure fails or if a connectivity failure occurs. If you have three or more coupling facilities, you can even maintain duplexing while performing maintenance on one of the coupling facilities.
For more information about the different types of coupling facility failures, and how to recover from them, see How Db2 recovers from coupling facility failures.
Procedure
To ensure the highest availability for Db2 data sharing and reduce the time and effort for recovering from failures, use the following practices, which are listed in priority order, to configure the coupling facilities:
- Duplex the group buffer pool so that Db2 can switch to the secondary structure if the primary structure fails.
The performance cost of duplexed group buffer pools is negligible in most cases, and the availability benefits are very high. Although the loss of a group buffer pool does not require a group restart, availability for users and important applications requires that data in a group buffer pool be available as quickly as possible after a failure. Duplexing group buffer pool structures assures minimal impact from failures. Duplexing can also help you to avoid hours of recovery time, which are often required to recover simplexed group buffer pool structures.
Duplexing the SCA and lock structure is not as important for high availability because these structures can be rebuilt dynamically on an alternate coupling facility, if the coupling facility that contains the SCA and lock structure fails. However, if the SCA and lock structure are not duplexed, the coupling facility that contains these structures should be failure isolated as described below. If failure isolation is not possible for the SCA and lock structures, use the following approaches to duplex the structures:
- Use system-managed duplexing for the SCA structure. It has a performance cost, but the usage pattern of this structure might make it tolerable.
- Use asynchronous duplexing for the lock structure. It has a performance cost, but it is less than the cost of system-managed duplexing.
For more information about duplexing, see Duplexed structures.
- Isolate failures by configuring multiple coupling facilities, and physically separate each coupling facility on a separate central processor complex (CPC) from all z/OS sysplex members, including any Db2 data sharing members that use the coupling facility.
A
failure-isolated coupling facility resides in a central processor complex (CPC) that does not also contain any data sharing member that is connected to structures in that coupling facility.
By separating the SCA and lock structures from the systems that use them, you can minimize the chances of performing a lengthy group restart after a lengthy outage because Db2 uses information in the lock structure and SCA for damage assessment to determine which databases must be recovered for quick recovery of group buffer pools. If you lose the lock structure or SCA at the same time as one or more group buffer pools, Db2 waits until the lock structure and SCA are rebuilt before doing damage assessment.
If the SCA and lock structure reside in a non-failure-isolated coupling facility (a coupling facility that contains the SCA and lock structure and resides in a CPC that also contains a member of that data sharing group), the CPC becomes a single point of failure. If the CPC fails, the entire data sharing group comes down. Duplexing the SCA and lock structure, or keeping the SCA and lock structure in a failure-isolated coupling facility, avoids this single point of failure.
Also, consider putting the lock structure and SCA in a coupling facility that does not contain important cache structures (such as group buffer pool 0). You are less likely to lose the SCA, lock structure, and the group buffer pool at the same time if you carefully separate these structures by placing them in different coupling facilities.
- Use non-volatile coupling facilities that use uninterruptible power supply equipment.
If a coupling facility is configured to be non-volatile (using the proper power backup, such as a battery backup), volatility is generally a transient state that might occur, if for example you remove the battery. If you lose power to a non-volatile coupling facility, the coupling facility enters power save mode and saves the data that is contained in the structures. When power is returned, you do not need to do a group restart nor recover the data from the structures.
When a coupling facility is in a volatile state, data in the coupling facility is not saved in the event of a power failure. The following messages indicate that a coupling facility is in a vulnerable state.
- DSNB302I or DSNB301I for group buffer pools.
- DXR141I for the LOCK1 structure.
- DSN7507I or DSN7509I for the SCA structure.
- Take the following actions for simplexed structures:
- Specify system weights in an active system failure management (SFM) policy. Unless you do this, it is not possible to automatically rebuild simplexed coupling facility structures. If the SCA and lock structure cannot be rebuilt, Db2 abnormally terminates the members affected by the loss of those structures, or the loss of connectivity to those structures. If the group buffer pool cannot be rebuilt, which is only attempted when a subset of members lose connectivity, those members disconnect from the group buffer pool.
- Specify a REBUILDPERCENT value specified in the CFRM policy for all Db2-related structures. In general, specify a low REBUILDPERCENT value to allow for automatic rebuild when a member loses connectivity. For more information, see Specifying when structure rebuilds occur after connectivity is lost.
- Configure adequate storage in an alternate coupling facility to rebuild or reallocate structures as needed. For rebuild, z/OS uses the current size structure of the CFRM policy on the alternate coupling facility to allocate storage. If z/OS cannot allocate enough storage to rebuild the SCA or lock structure, the rebuild fails. If it cannot allocate enough storage for the group buffer pool, Db2 must write the changed pages to disk instead of rebuilding them into the alternate group buffer pool.
- Enable automatic recovery for the group buffer pool. For more information, see Automatic recovery requirements for group buffer pools.
- To prepare for channel failures, consider using dual channels between each CPC and a coupling facility.
Without dual channels (sometimes called links), a channel failure is more likely to occur than a failure in the coupling facility. Losing connectivity to the SCA or lock structure can bring that particular member down, unless you specify duplexing or an alternative coupling facility in the CFRM policy preference list.
Results
The following table summarizes how
Db2 recovers from structure and connectivity failures for different structures and configurations.
Failure type |
Recovery if duplexed |
Recovery if simplexed |
Group buffer pool structure |
The failing structure is deallocated, and processing continues with the running structure. This recovery is usually fast and unnoticeable. |
Recovery from the log can occur manually, as the result of a START DATABASE command, or if the group buffer pool is defined with the AUTOREC(YES) option, it can occur automatically. |
SCA and lock structure |
The failing structure is deallocated, and processing continues on the running structure. |
Db2 uses information that is contained in its virtual storage to quickly rebuild the structures. Recovery of a simplexed SCA is usually fast and has a minimal impact on performance. Recovery of a simplexed lock structure is also fast in normal cases, but the performance impact depends on the number of held locks at the time of the failure.
Db2 can rebuild a simplexed SCA and lock structure in the same coupling facility or in an alternate coupling facility, assuming that the following conditions are true:
- You specified the alternate coupling facility in the CFRM policy preference list.
- You allocated enough storage in the alternate coupling facility to rebuild the structures there.
If Db2 fails to rebuild the SCA and lock structure from virtual storage, all active members in the group terminate abnormally, and you must perform a group restart to recover the necessary information from the logs.
|
Connectivity |
Db2 switches to the structure with good connectivity |
Db2 rebuilds simplexed structures on the alternate coupling facility that is specified in the CFRM policy. Recovery of a simplexed SCA is fast and has little or no impact on performance. The performance of the recovery of a lock structure depends on the number of modify locks in the lock structure at failure time, but it is also fast in most cases. However, recovery of simplexed group buffer pool structures is very disruptive to the system.
In rebuilding these structures, Db2 attempts to allocate storage on the alternate coupling facility. Db2 uses the current size of the structure for the initial size of the structure on the alternate coupling facility. If Db2 cannot allocate the storage for the SCA or lock structure, the rebuild fails. If z/OS cannot allocate the storage for the group buffer pools, the changed pages are written to disk.
|
For more information about recovering from coupling facility failures, see How Db2 recovers from coupling facility failures.