Recovery for channel path errors

When you define your I/O configuration, many devices share common hardware components (such as channels, channel cards, switches, control unit ports, control unit adapter cards, and fiber-optic links). For example, all devices for a specific control unit definition share the same hardware components since they share the same channels and control unit ports. Therefore, when a hardware-related error occurs on a channel path, multiple devices are affected.

When an error occurs on a channel path, the system performs path recovery which consists of issuing one or more recovery-related I/Os to test the channel path to see if it is still usable. If path recovery determines that the channel path is no longer usable, the path is removed (varied offline) from the affected device. Otherwise, the channel path remains online to the device.

Path recovery is typically performed one device at a time. This means that when an error occurs on one device, only that device is processed. Errors on other devices are processed independently, even if they share common hardware components. This may affect application performance since the application is delayed while the system performs path recovery and then retries the original I/O request. If the application uses multiple devices that share a failing or malfunctioning hardware component, additional errors are encountered and further delays occur.

Additionally, certain types of path errors can be intermittent. That is, an error occurs, but path recovery is successful, so the path is not removed from the device. This also affects performance because applications may encounter errors multiple times. If this occurs, you may need to manually remove the bad path or paths from the affected devices to stop the errors from occurring.

The PATH_SCOPE option on the RECOVERY statement in the IECIOSxx parmlib member and the SETIOS RECOVERY command, along with the PATH_THRESHOLD and PATH_INTERVAL options, allows you to reduce the elapsed time it takes for the system to recover from channel path-related errors, and helps prevent system performance problems that can occur when a significant amount of time is spent in repetitive channel path error recovery. For more information on the syntax of the RECOVERY statement in IECIOSxx, see z/OS MVS Initialization and Tuning Reference. For more information on the syntax of the SETIOS RECOVERY command, see SETIOS command.

Specify a PATH_SCOPE of either CU or DEVICE to enable path recovery either for all devices attached to the control unit (CU) or on a device-by-device basis (DEVICE). The default is PATH_SCOPE=DEVICE.

If PATH_SCOPE=DEVICE is specified, then path recovery is on a device-by-device basis and no monitoring of intermittent errors is performed. The PATH_INTERVAL and PATH_THRESHOLD keywords may not be specified with PATH_SCOPE=DEVICE.

If PATH_SCOPE=CU is specified and path recovery determines that a channel path needs to be removed from a device, the path is removed from all devices defined to the control unit. Additionally, for intermittent channel path errors, the system collects error statistics over a period of time, and if the number of errors reaches or exceeds a threshold value, the channel path is removed from all devices defined to the control unit. The time period and threshold are controlled by the PATH_INTERVAL and PATH_THRESHOLD parameters as follows: For example, specifying a PATH_INTERVAL of 10 (minutes) and a PATH_THRESHOLD of 20 (errors per minute) means that at least 20 errors must occur every minute for 10 consecutive minutes before the path from the failing hardware component is removed from all the affected devices.
Note: Do not set the PATH_INTERVAL and PATH_THRESHOLD values to a very low value (for example, setting the PATH_INTERVAL to 1 minute or setting the PATH_THRESHOLD to 1 error) because this may interfere with normal system recovery and cause the system to remove channel paths unnecessarily. When a channel path becomes not operational, the system takes the path offline to the affected devices. Later, when the path becomes operational, the channel subsystem notifies the system so that it can bring the path back online. If there are I/Os that are active when the channel path becomes not operational, these I/Os are terminated with an interface control check. If PATH_SCOPE=CU is specified, these interface control checks are counted towards the PATH_THRESHOLD value and may cause the system to internally remove the path from all devices on the control unit if the PATH_THRESHOLD and PATH_INTERVAL values are too small. Later, when the channel subsystem notifies the system that the channel path is now available, the system does not automatically bring the channel path online; the channel path must be brought online manually.

When PATH_SCOPE=CU is specified and the system internally varies the path offline to all devices on a control unit, the system does not remove the last path to a device if the device is online, allocated, reserved, or in use by a system component. However, if the path becomes not operational because of a link threshold condition, then the last path is taken offline. A link threshold condition, also known as a flapping links condition, occurs when a channel path transitions between not operational and operational multiple times within a short period of time. This is usually a sign of some type of hardware problem. These transitions cause the system to perform path-related recovery, which delays applications until the recovery completes. If the channel path transitions too many times within a short period of time, the channel subsystem keeps the channel path offline to prevent further path recovery.

When PATH_SCOPE=CU is specified, channel paths that are internally varied offline by the system are not varied back online automatically. You must use one of the following commands to bring the path back online after ensuring that the problem that caused the path errors has been resolved:
Note: You should bring the path online to a single device first and then wait a short period of time (minutes) to allow I/Os to be issued to the device before bringing the path online to the remaining devices. This ensures that the problem has been resolved and no further errors will occur.