How Mirroring Handles DASD Problems

Troubleshooting

Problem

This document provides a general overview regarding how mirrroring handles DASD problems.

Resolving The Problem

Mirroring keeps the system available when a DASD-related error occurs. IBM i uses a scatter-loading technique across all disks that it believes are available (that is, configured). However, if the system loses a drive, it stops as any vital code may be on that drive, either whole or in part. The idea of Mirroring is to provide each drive in an ASP with a hot backup so that if we determine a drive is bad, the system can continue to operate. Following outlines how we make those determinations:

1.	For an unrecoverable device error (the entire unit fails): a. The system disables the failing unit. Mirroring is suspended for this mirrored pair. If the other unit of the pair is already suspended, the system halts. b. System continues operation with the other unit of the mirrored pair. c. System sends messages to QSYSOPR and QHST indicating what has happened. d. CE replaces/repairs failing unit using concurrent maintenance; for example, while system continues to run. e. CE resumes mirroring on the pair. f. System synchronizes the repaired unit with its pair; for example, writes any changes since the suspension to the repaired drive. g. System sends messages to QSYSOPR and QHST indicating mirroring has been resumed.
2.	For a permanent read error (failed to read correct data from disk): a. The system reads from the other unit of the pair. If the permanent read error also occurs on the other unit of the mirrored pair, the permanent error is not recovered. The original read request completes with a permanent read error. b. If the read from the other unit is successful, the system writes the data from the second disk to an alternate sector on the first, and the original sector is marked as bad. If the write back fails, the first unit is determined to have an unrecoverable error, and the situation is handled as described in 1. c. The system attempts the read again and, if successful, we have recovered. Error Logs or PALS have entries indicating what happened.
3.	For a nonoperational device (power loss, not ready): System stops operation with an SRCA6XX0244 or SRCA6XX0266 (X indicates we do not care about the value) and attempts recovery. There is a time limit which determines how long we wait for the unit to recover. But there is no definitive guide as to how long that could take. If the unit can be recovered, the system resumes normal operation with no other intervention. Therefore, any call which has these SRCs should be referred immediately to the hardware queue. If the time limit is exceeded then the scenario described in Step 1 pertains.
4.	For a connection failure on a device (device time out): a. The system attempts connection recovery. Any job with I/O to that unit waits during the connection recovery. If the connection recovery is successful, normal system operation continues with mirroring protection and with no suspension/resuming required. b. If connection recovery fails, the unit is considered to have an unrecoverable error and the procedure outlined in Step 1 pertains.
5.	For an IOP or Bus failure: a. The system determines if all DASD units attached to the failing IOP or Bus have still active mirrored units on different buses or IOPs. If not, the system fails with an SRC code. b. The system disables each DASD unit attached to the failing IOP or BUS as outlined in Step 1. c. The system dumps the failing IOP so the problem can be diagnosed and continues normal operation unless problems on the remaining disks are detected.
6.	Load source DASD failure during IPL and before Storage Management Recovery The system determines if there is a mirrored load source unit that has not failed. If not, the system stops. If yes, then it continues to IPL by issuing a programmed IPL to the other load source unit.

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m0z0000000C4BAAU","label":"IBM i"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions"}]

Historical Number

7873047

Was this topic helpful?

Document Information

Modified date:
01 April 2025

UID

nas8N1010271

Tips