Table 1 identifies how the system processes an error to a volume in a volume pair depending on how the volume and path were established. The options for the CESTPATH CGROUP parameter and the CESTPAIR CRIT parameter both affect the way the recovery system processes the volume pair when a device error occurs.
What level of error recovery do you want? | Specify . . . | In the event of a PPRC volume error, the ESS storage subsystem does the following . . . | Some considerations . . . |
---|---|---|---|
Total data consistency between primary and secondary volumes. | CRIT(YES) on CESTPAIR command (see Note). |
|
PPRC stops all I/O to the volume pair, and suspends the pair, on any volume error. |
Automated error recovery procedures. | CRIT(YES) on CESTPAIR command, and CGROUP(YES) on CESTPATH command. Note:
If a CGROUP FREEZE command is issued to the volume's logical subsystem when the
volume is in Extended Long Busy state, the following situation occurs:
|
|
After two minutes, queued I/O resumes to the storage subsystem, if possible. Each installation is responsible for the viability of the automated recovery procedures. |
A manual error recovery process. | The default for the CESTPAIR command is CRIT(NO), and the default for the CESTPATH command is CGROUP(NO). |
|
Each installation is responsible for the local error recovery procedures. |
Note: The CRIT(YES) parameter has two modes of operation, depending on how the storage subsystem is
configured. CRIT(YES-PATHS) inhibits all write operations when all paths to a secondary volume are
unavailable. CRIT(YES-ALL) inhibits writes on any failure, including secondary device failures.
For this reason the use of CRIT(YES) should be used only if specifically recommended by IBM® support personnel.
|
Typically, a PPRC volume pair enters suspended state when the primary volume’s storage control fails to complete a write operation to the secondary volume’s storage control. The failed I/O operation receives channel end, device end, and unit check (with ‘FB’ indicated in sense byte 8), and the MVS host disk ERP issues an IEA49xx message. The NVS in the primary volume’s storage control records cylinders that have changed while the volume pair is in suspended or duplex state. After you have corrected the conditions that caused the suspension, recopy these cylinders in order to return both volumes to a synchronized state.
Table 2, Table 3, and Table 4 list general failures, failures detected by the host, and failures detected by the storage subsystem, and how to recover from them.
Error condition | Recovery action |
---|---|
Room power failure | If power is lost to the:
Issue the CESTPATH command to reestablish paths, and the CESTPAIR command with the RESYNC option to resynchronize a suspended or duplex PPRC volume pair, and to return the volumes to full duplex state. Since room power failures do not power off all devices at the same exact time, the RESYNC option is necessary in order to copy all updates that may have been made to the primary volumes, but not yet secured on the recovery site subsystem. The primary volume’s subsystem maintains an NVS bitmap of all changed cylinders. This is important where power is restored to the primary hosts and primary volume’s subsystem before it is restored to the secondary volume’s subsystem. The CESTPAIR with the RESYNC option resynchronizes the volume pairs when power is restored to the secondary volume’s subsystem. |
Error condition | Recovery action |
---|---|
Application host failure |
|
Last channel path failure | Host connectivity to the subsystem is lost. PPRC pair operation continues,
assuming that the channel path error does not affect the PPRC link.
|
Last PPRC path failure | When the last path to the recovery site storage control is lost, the PPRC pair is placed into suspended or duplex state. When the paths are repaired or reconfigured, the primary volume’s storage control automatically attempts to reestablish the paths. Issue the CESTPAIR command with the RESYNC option to return the pair to duplex state. |
Device varied offline | Depending on how the device is varied offline, host connectivity to the
subsystem may or may not be lost. Application connectivity to the subsystem is lost, but assuming
that the device has not been varied offline in “force” or “boxed” modes, the PPRC pairs continue in
operation.
|
Missing interrupt | A missing interrupt generally occurs when a host issues an I/O operation to a
subsystem, and the operation does not complete within a specific amount of time.
|
Error condition | Recovery action |
---|---|
Equipment check | A number of errors fall into this category. The error may or may
not be detected on the actual I/O operation that caused the failure. Generally, an equipment check
that suspends a PPRC volume pair is reported as follows:
For the primary PPRC volume’s storage control: Change recording will not continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair, or recopy the entire volume if its data has been corrupted by the recovery process. |
Storage control failure | A number of errors fall into this category. The error may or may
not be detected on the actual I/O operation that has caused the failure. Most storage control
failures are reported as follows:
For the primary PPRC volume’s storage control: Change recording willnot continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. |
Invalid track format | An invalid track format error is reported as follows:
For the primary PPRC volume’s storage control: Change recording will not continue for the primary volume. When you have corrected the error condition on the primary volume’s storage control, issue the CESTPAIR command with the COPY or RESYNC option to recover the PPRC volume pair. See the note at the end of this table for more information on CESTPAIR command options. For the secondary PPRC volume’s storage control: Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. |
Permanent data check | For a permanent data check on a primary PPRC volume, the PPRC
volume pair status is not changed as a result of this specific error. The following actions occur:
During a destage operation, a permanent data check on a track in the primary volume of a PPRC volume pair causes the primary volume’s storage control to pin that track in NVS. A subsequent I/O command is unit checked with ‘FB’ sense. The disk ERP issues the appropriate IEA49xx error message, and the PPRC volume pair is placed into suspended or duplex state. Change recording continues for the primary volume. When the pinned data condition is corrected on the primary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. A permanent data check on a track in a secondary volume of a PPRC volume pair causes the secondary volume’s storage control to pin that track in NVS. A subsequent I/O command is unit checked with ‘FB’ sense. The disk ERP issues the appropriate IEA49xx error message, and the PPRC volume pair is placed into suspended or duplex state. Change recording continues for the secondary volume. When the pinned data condition is corrected on the secondary volume’s storage control, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. |
Intervention required | For the primary PPRC volume, this normal system operating condition
is handled the same as it is for simplex devices. The host issues an “intervention required” message
for the device, then takes the appropriate actions based on the operator’s reply. No data is changed
on the primary volume, and the PPRC volume pair state remains unchanged. When the intervention
condition is cleared, normal operations resume. For the secondary PPRC volume:
Change recording will continue for the primary volume. When you have corrected the error condition on the secondary volume, issue the CESTPAIR command with the RESYNC option to recover the PPRC volume pair. |
Cache or NVS failure, or cache reinitialization | PPRC implements write I/O operations to both primary and secondary
volumes. The cache and NVS in each respective storage control hold the write updates until the
updates are destaged. An NVS or cache error, or the reinitialization of cache, puts the PPRC pair
into suspended or duplex state. Change recording continues. For a cache or NVS failure or cache reinitialization that affects the storage control at either end, issue the CESTPAIR command with the RESYNC option to the suspended volume after the cache and NVS have been made available. This action reestablishes the volume pair and copies any cylinders that were modified while the pair was suspended. The storage control automatically copies all cylinders from the primary volume to the secondary volume whenever an NVS failure affects the primary volume’s storage control. This is necessary so that the primary volume’s storage control can maintain the changed-cylinder map within its NVS. |
Note: Generally, repairs to volumes are done using ICKDSF. The volume pair must be reinitialized if
corrective actions change the data on a track so that the primary and secondary volume tracks are no
longer identical. That track can be restored from a backup volume if the error causes data on a
specific track to be overwritten. Select the appropriate CESTPAIR copy mode based on the completed
repair action and the status of the data on the primary and secondary volumes.
|