Backout records for in-doubt and long-running units of recovery

A unit of recovery can become in-doubt between the time when the resource managers reply positively to the prepare notification from RRS and the time when RRS begins the commit phase. A unit of recovery can become in-doubt only if one of the interested resource managers has taken on the role of distributed or communication resource manager, and if more than one system is involved. In this case, after a commit request is issued and both systems have responded positively to the prepare request, the following processing occurs:

The unit of recovery on the system that issued the commit request goes into the in-commit state
The unit of recovery on the other system goes into the in-doubt state until the syncpoint resource manager receives the prepare response from the system that initiated the commit request.

A unit of recovery is considered long-running if it survives two activity keypoints without a sync point. This can cause the unit of recovery to hold a large number of locks until the next sync point, as well as to write a large number of log records.

The backout records for in-doubt and long-running units of recovery present space-management problems within undo log streams. Ideally, the backout records in an undo log stream have a short life cycle. This enables the obsolete portion of the log stream to be deleted. Also, the system logger does not need to off-load the log data from the coupling facility to DASD data sets. Units of recovery that do not reach an end-of-unit of recovery status within a short period of time interfere with this space management algorithm.

If too much old data is left in the log, there are two conditions that can occur when attempting to write records to a log stream. First, the system logger could return a return-and-reason code that indicates that the coupling facility storage limit was reached. Second, the system logger can return a return-and-reason code that indicates that the staging data set storage limit was reached. In either case, the system logger offloads data to DASD. DFSMStvs cannot write any further information to its log streams until the problem has been resolved.

DFSMStvs uses a secondary log, called a shunt log. This log tracks units of recovery for which DFSMStvs is unable to complete processing, for example due to an I/O error or unavailability of a resource, such as a volume or a cache.