CRR Functional Flow
This section provides a generalized functional flow of CRR. This information may be helpful to a programmer who has the responsibility of designing code to get an IBM or non-IBM resource to participate in CRR. For more information about CRR participation, see z/VM: CMS Application Development Guide.
In this z/VM® example, assume an application:
- Updates two protected resources
- Starts a protected conversation with distributed application
- Issues a commit
Also, assume the protected resources are SFS file pools and the participating resource managers are the SFS file pool servers.
The following describes the CRR functional flow for this application:
- Application accesses, by means of the SFS resource adapter, the protected resources. The application can also set a transaction tag, by using the Set Transaction Tag CSL routine (DMSSETAG), that could supply useful recovery information about the coordinated transaction in case there was an error. The transaction tag data is written to the CRR logs and SFS writes it to the SFS logs too. See z/VM: CMS Callable Services Reference for more information about the DMSSETAG CSL routine.
- SFS resource adapter:
- Registers the protected resources with the SPM that resides in the application's virtual machine. See z/VM: CMS Callable Services Reference and z/VM: CMS Application Development Guide for more information about the resource adapter registration CSL routine (DMSREG) and the other participation interfaces.
- Obtains information about the CRR recovery server that is on the same system as this application and passes the information to the participating resource managers.
- Participating resource managers determine if they must do an initial exchange of log names (ELN) with the CRR recovery server. For more information on the initial ELN, also known as resynchronization initialization, see z/VM: CMS Application Development Guide.
- Application starts a protected conversation with a distributed application.
- SPM obtains an LUWID, which identifies this coordinated transaction, from the CRR recovery server.
- The protected conversation is registered with the SPM that resides in the application's virtual machine. Also, the protected conversation is registered with the SPM that resides in the partner distributed application (target of the protected conversation).
- Application issues a commit, which is the start of sync point
processing that consists of these SPM exits:
- Precoordination exit–SPM asks if the resource adapter of the participating resource manager is ready for sync point processing.
- Coordination exit–Calls the participating resource manager for sync point processing, which is an implementation of the two-phase commit protocol.
- Postcoordination exit–Allows the participating resource manager to clean up after sync point processing.
- During the precoordination exit, the participating resource manager is asked if it is ready for sync point processing. If it is not, control is returned to the application to correct the situation and retry the commit.
- At the start of the coordination exit processing, the CRR recovery
server, at the SPM's request, writes a CRR log record that describes
the participants in this sync point and identifies its LUWID.
The SPM then goes through the coordination exit (two-phase commit of sync point processing) with all protected resources and protected conversations.
SFS resource adapters, during the first phase of the two-phase commit, pass the LUWID and transaction tag to the participating resource managers.
Other participating resource adapters must pass the LUWID to their participating resource managers prior to the end of the first phase of the two-phase commit.
Note: Passing the transaction tag is optional, but very beneficial.Each participating resource manager writes a resource-specific log record that describes the data to be committed.
If any protected resource or protected conversation cannot complete the first phase of the two-phase commit, all updates to all protected resources in this LUWID are rolled back to their original value.
If all protected resources can commit, the second phase of the two-phase commit tells all the participating resource managers to commit the data.
If there is an error during the two-phase commit, the resynchronization function attempts to complete the coordinated transaction.
If the application had called the Set Sync Point Options CSL routine (DMSSSPTO) to set WAIT=NO, the application makes one attempt at resynchronization. If the attempt fails, the application does not wait for the automatic periodic retry of resynchronization, which occurs asynchronous to the application. If the application had set WAIT=YES, the application waits for resynchronization processing to complete.
At the start of resynchronization, the CRR recovery server does an ELN and a compare states with the participating resource manager. This process is called resynchronization recovery. The compare states provides the action (for example, commit or backout) for the participating resource manager to follow to recover the protected resource that failed in the LUWID instance.
For protected conversations, the CRR recovery server does an ELN and compare states with the CRR recovery server at the node that is the target of the protected conversation. Then the target CRR recovery server does whatever processing necessary to recover all protected resources associated with the failed LUWID instance. For more information on the ELN and compare states, also known as resynchronization recovery, see z/VM: CMS Application Development Guide.
After resynchronization completes, all protected resources involved in the failed LUWID, should be in a consistent state (for example, all the protected resources are either committed or backed out).
If resynchronization processing cannot complete the LUWID, the LUWID instance issues the automatic periodic retry of resynchronization, which consists of cycles of timed waits and resynchronization attempts.
If recovery is required sooner than the automatic periodic retry of resynchronization can provide it, manual intervention is needed. Manual intervention involves the CRR operator or the participating resource manager operator or both:
- Determining whether the protected resource should be committed or backed out. This is a heuristic decision that involves using the CRR QUERY LU and CRR QUERY LUWID operator commands and resource-specific commands such as SFS's QUERY PREPARED operator command.
- Forcing the protected resource to be committed or backed out.
The force can be done by the CRR operator using the CRR RESYNC command
or by the participating resource manager operator using resource-specific
commands such as SFS's FORCE PREPARED operator command.
For more information about manual intervention, see Participation in CRR (SFS only) and Problem Management.
- At the completion of the coordination exit, the SPM drives the participating resource adapters through the postcoordination exit to allow the participating resource managers to clean up.