Putting PPRC error recovery procedures into effect

Table 1 identifies how the system processes an error to a volume in a volume pair depending on how the volume and path were established. The options for the CESTPATH CGROUP parameter and the CESTPAIR CRIT parameter both affect the way the recovery system processes the volume pair when a device error occurs.

Table 1. Choosing the level of volume recovery for PPRC
What level of error recovery do you want?	Specify . . .	In the event of a PPRC volume error, the ESS storage subsystem does the following . . .	Some considerations . . .
Total data consistency between primary and secondary volumes.	CRIT(YES) on CESTPAIR command (see Note).	Suspends the volume pair that has the error. Unit checks all write I/O to the primary device. Issues an IEA494I message for the suspended condition.	PPRC stops all I/O to the volume pair, and suspends the pair, on any volume error.
Automated error recovery procedures.	CRIT(YES) on CESTPAIR command, and CGROUP(YES) on CESTPATH command. Note: If a CGROUP FREEZE command is issued to the volume's logical subsystem when the volume is in Extended Long Busy state, the following situation occurs: The volume status is changed to CRIT(NO). Writes is not inhibited (unit checked) when the Extended Long Busy state ends.	Issues an IEA494 message for the Suspending and Extended-Long-Busy state conditions. Suspends the volume pair that has the error. Queues application I/O, for volume pairs added with CESTPATH CGROUP(YES), in cache for two minutes. Issues an IEA494I message for the suspended condition.	After two minutes, queued I/O resumes to the storage subsystem, if possible. Each installation is responsible for the viability of the automated recovery procedures.
A manual error recovery process.	The default for the CESTPAIR command is CRIT(NO), and the default for the CESTPATH command is CGROUP(NO).	Suspends the volume pair that has the error. Does not unit check the primary device. Issues an IEA494I message for the suspended condition.	Each installation is responsible for the local error recovery procedures.
Note: The CRIT(YES) parameter has two modes of operation, depending on how the storage subsystem is configured. CRIT(YES-PATHS) inhibits all write operations when all paths to a secondary volume are unavailable. CRIT(YES-ALL) inhibits writes on any failure, including secondary device failures. For this reason the use of CRIT(YES) should be used only if specifically recommended by IBM® support personnel.

Typically, a PPRC volume pair enters suspended state when the primary volume’s storage control fails to complete a write operation to the secondary volume’s storage control. The failed I/O operation receives channel end, device end, and unit check (with ‘FB’ indicated in sense byte 8), and the MVS host disk ERP issues an IEA49xx message. The NVS in the primary volume’s storage control records cylinders that have changed while the volume pair is in suspended or duplex state. After you have corrected the conditions that caused the suspension, recopy these cylinders in order to return both volumes to a synchronized state.

Table 2, Table 3, and Table 4 list general failures, failures detected by the host, and failures detected by the storage subsystem, and how to recover from them.

Table 2. General failure recovery procedures for PPRC
Error condition	Recovery action
Room power failure	If power is lost to the: Primary storage subsystem, all volumes are placed in suspended or duplex state when power is restored. Use the CQUERY command to determine the exact state of all PPRC volume pairs. Secondary storage subsystem, all volumes are placed in suspended or duplex state, and the primary volume’s storage control records changes to the volume in its NVS. Issue the CESTPATH command to reestablish paths, and the CESTPAIR command with the RESYNC option to resynchronize a suspended or duplex PPRC volume pair, and to return the volumes to full duplex state. Since room power failures do not power off all devices at the same exact time, the RESYNC option is necessary in order to copy all updates that may have been made to the primary volumes, but not yet secured on the recovery site subsystem. The primary volume’s subsystem maintains an NVS bitmap of all changed cylinders. This is important where power is restored to the primary hosts and primary volume’s subsystem before it is restored to the secondary volume’s subsystem. The CESTPAIR with the RESYNC option resynchronizes the volume pairs when power is restored to the secondary volume’s subsystem.

Table 3. Host failure recovery procedures for PPRC
Error condition	Recovery action
Application host failure	Re-IPL the host. If power has not been lost to the subsystems, the peer-to-peer remote copy pairs will still be operational and in duplex state. Issue the CQUERY command to verify the PPRC volume state for each PPRC volume pair. If the volume pair is in duplex state, no further action is required. If the volume pair is in suspended or duplex state, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Last channel path failure	Host connectivity to the subsystem is lost. PPRC pair operation continues, assuming that the channel path error does not affect the PPRC link. When host connectivity is restored, issue the CQUERY command to verify the PPRC volume state for each PPRC volume pair. If the volume pair is in duplex state, no further action is required. If the volume pair is in suspended or duplex state, issue the CESTPATH command, and the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Last PPRC path failure	When the last path to the recovery site storage control is lost, the PPRC pair is placed into suspended or duplex state. When the paths are repaired or reconfigured, the primary volume’s storage control automatically attempts to reestablish the paths. Issue the CESTPAIR command with the RESYNC option to return the pair to duplex state.
Device varied offline	Depending on how the device is varied offline, host connectivity to the subsystem may or may not be lost. Application connectivity to the subsystem is lost, but assuming that the device has not been varied offline in “force” or “boxed” modes, the PPRC pairs continue in operation. When the device is varied back online, verify the PPRC volume status with the CQUERY command. If the volume pair is in duplex state, no further action is required. If the volume pair is in suspended or duplex state, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Missing interrupt	A missing interrupt generally occurs when a host issues an I/O operation to a subsystem, and the operation does not complete within a specific amount of time. Analyze and correct the missing interrupt condition, using information contained in the message that the host sent to the operator console. Depending on the cause of the missing interrupt, the PPRC pair operation may or may not continue. When you have corrected the missing interrupt condition, issue the CQUERY command to verify the state of the affected PPRC volume pair. If the volume pair is in duplex state, no further action is required. If the volume pair is in suspended or duplex state, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.

Table 4. Storage subsystem failure recovery procedures for PPRC
Error condition	Recovery action
Equipment check	A number of errors fall into this category. The error may or may not be detected on the actual I/O operation that caused the failure. Generally, an equipment check that suspends a PPRC volume pair is reported as follows: The subsequent I/O command is unit checked (with ‘FB’ sense) with indicators set for equipment check, permanent error, and environmental data present. The disk ERP issues the appropriate IEA49xx error message. The PPRC volume pair is placed into suspended or duplex state. For the primary PPRC volume’s storage control: Change recording will not continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair, or recopy the entire volume if its data has been corrupted by the recovery process.
Storage control failure	A number of errors fall into this category. The error may or may not be detected on the actual I/O operation that has caused the failure. Most storage control failures are reported as follows: The I/O command that encountered the error is unit checked (with ‘FB’ sense) with indicators set for equipment check, permanent error, and environmental data present. The disk ERP issues an IEA49xx error message. The PPRC volume pair is placed into suspended or duplex state. For the primary PPRC volume’s storage control: Change recording willnot continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Invalid track format	An invalid track format error is reported as follows: Returned sense indicates an equipment check, permanent error, with environmental data present. The disk ERP issues a “Volume Suspended” error message (the same action that occurs with dual copy). For the primary PPRC volume’s storage control: Change recording will not continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Permanent data check	For a permanent data check on a primary PPRC volume, the PPRC volume pair status is not changed as a result of this specific error. The following actions occur: A message is sent to the operator console. The application program receives channel end, device end, and unit check in response to its I/O operation. The disk ERP takes the appropriate recovery actions, including issuing the operator message and logging the error. The application program performs its recovery actions. During a destage operation, a permanent data check on a track in the primary volume of a PPRC volume pair causes the primary volume’s storage control to pin that track in NVS. A subsequent I/O command is unit checked with ‘FB’ sense. The disk ERP issues the appropriate IEA49xx error message, and the PPRC volume pair is placed into suspended or duplex state. Change recording continues for the primary volume. When the pinned data condition is corrected on the primary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. A permanent data check on a track in a secondary volume of a PPRC volume pair causes the secondary volume’s storage control to pin that track in NVS. A subsequent I/O command is unit checked with ‘FB’ sense. The disk ERP issues the appropriate IEA49xx error message, and the PPRC volume pair is placed into suspended or duplex state. Change recording continues for the secondary volume. When the pinned data condition is corrected on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Intervention required	For the primary PPRC volume, this normal system operating condition is handled the same as it is for simplex devices. The host issues an “intervention required” message for the device, then takes the appropriate actions based on the operator’s reply. No data is changed on the primary volume, and the PPRC volume pair state remains unchanged. When the intervention condition is cleared, normal operations resume. For the secondary PPRC volume: The subsequent I/O command is unit checked (with ‘FB’ sense) with equipment check, permanent error, and environmental data present indicators set. The disk ERP issues an IEA49xx error message indicating “Intervention Required” for the secondary device. The PPRC volume pair is placed into suspended or duplex state. Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair.
Cache or NVS failure, or cache reinitialization	PPRC implements write I/O operations to both primary and secondary volumes. The cache and NVS in each respective storage control hold the write updates until the updates are destaged. An NVS or cache error, or the reinitialization of cache, puts the PPRC pair into suspended or duplex state. Change recording continues. For a cache or NVS failure or cache reinitialization that affects the storage control at either end, issue the CESTPAIR command with the RESYNC option to the suspended volume after the cache and NVS have been made available. This action reestablishes the volume pair and copies any cylinders that were modified while the pair was suspended. The storage control automatically copies all cylinders from the primary volume to the secondary volume whenever an NVS failure affects the primary volume’s storage control. This is necessary so that the primary volume’s storage control can maintain the changed-cylinder map within its NVS.
Note: Generally, repairs to volumes are done using ICKDSF. The volume pair must be reinitialized if corrective actions change the data on a track so that the primary and secondary volume tracks are no longer identical. That track can be restored from a backup volume if the error causes data on a specific track to be overwritten. Select the appropriate CESTPAIR copy mode based on the completed repair action and the status of the data on the primary and secondary volumes.