This post is part of the Db2 12 GA feature series, which highlights new features and capabilities introduced by the Db2 12 for z/OS, at general availability (GA) in October 2016.
By Paul McWilliams and Eric Katayama.
Db2 12 for z/OS introduced the capability for Db2 data sharing members to participate in the recovery of failed peer members. With this peer recovery capability, a Db2 data sharing group can automatically recover retained locks for failed members, without the need for other automation or manual procedures.
When a data sharing member fails, the active locks that Db2 uses to control updates to the data (also called "modify locks") become retained locks. The coupling facility stores information about the retained locks until they are released during restart of the Db2 member that held them. The retained locks protect any data that the failed member was updating from being accessed in an inconsistent manner by the remaining active members of the group. As the active members continue processing the workload at reduced capacity, they cannot access the data that is protected by retained locks held for the failed member. Cleaning up retained locks and freeing other held resources to restart the failed member is a manual procedure, unless you configure automation in your environment, such as z/OS automatic restart manager (ARM) or other automation solutions.
When peer recovery is enabled, one of the assisting members initiates the restart operation for the failed member. The LIGHT(YES) option is used for the restart, along with the subsystem parameters settings that were specified in the parameter module last used by the failed member.
Restart light enables Db2 to restart with a minimal storage footprint and then terminate normally after the locks are released. So after you resolve the problem that caused original failure, you still need to start the failed member in its original configuration, to restore the normal operating capacity of the data sharing group.
The first assisting peer member to obtain the lock for the failed member attempts the restart light operation. If this restart fails, the assisting member does not issue another restart light operation for the failed member for the same instance of the failure. Each assisting member tries to restart the failed member only one time. If all of the restart attempts fail, the failed member remains failed, and manual intervention would be required to restart the member and remove its retained locks.
The PEER_RECOVERY subsystem parameter value controls the peer recovery capability for each data sharing member. ASSIST means that the member attempts to initiate the peer recovery for other failed members. RECOVER means that other members assist the data sharing member if it fails. You can also specify BOTH, which means that ASSIST and RECOVER apply for that member. You can also specify NONE to disable this feature for specific members.
If you already have automation configured for restarting failed members in your Db2 data sharing group, it's best to continue using that, and to keep PEER_RECOVERY set to the default value NONE. However, the new peer recovery capability is a good option for cases where such automation is not already configured.
If you do not have other automation configured, you can use BOTH in most cases. However, the ASSIST and RECOVER options give you the flexibility to control this capability for reach member, to tailor the configuration for any specific constraints in your data sharing group.
Always get the latest news about Db2 for z/OS from the IBM lab! How to subscribe
Follow us on Twitter: @DB2zLabNews