"Hung" Command Recovery
The purpose of this function is to detect hung commands that often result in multisystem outages. We distinguish three situations:
- Commands that inhibit other commands from completing execution
- Commands that inhibit jobs from completing execution
- Jobs that inhibit commands from completing execution
Automation examines ENQ contention associated with command processing and builds a list of blockers and waiters. The SA z/OS policy is then examined to see how long waiting commands and waiting jobs are allowed to wait before automated action is taken. The policy is also examined to determine what action (DUMP, NODUMP, KEEP or exclude) is to be taken against the blocking command or job, as follows:
- When a command inhibits other commands from completing and no policy definitions exist for any of the waiting commands, no automated action is taken.
- When a command inhibits jobs from completing and no policy definitions exist for the blocking command, no automated action is taken.
- When a job inhibits commands from completing and no policy definitions exist for any of the waiting commands, no automated action is taken.
If long-running ENQ and hung command recovery detect that the same resource requires automated action at the same time, the hung command recovery policy definitions take precedence and hung command recovery automates the resource.
The action taken (DUMP, NODUMP, KEEP or exclude) is identical to the long-running ENQ recovery action.
In either case only commands that are waiting on blocked resources are considered. "Hung" command recovery only considers those resources that are not being monitored by long-running ENQ recovery. If long-running ENQ recovery is disabled then all resources, even those defined as long-running ENQ resources, are considered for "hung" command recovery. It is also important to realize that if long-running ENQ recovery is enabled and a generic "catchall" resource definition applies, then "hung" command recovery cannot occur, because long-running ENQ recovery always take precedence.
Commands are executed by the master and console address spaces. Thus when a resource blocker is from either of these address spaces it is considered to be a blocking command rather than a blocking job.
As with resources, you can make similar definitions for commands that determine how long a command is permitted to lock a resource while other commands are waiting for the resource.
If the resource blocker is a job then recovery actions are only taken when the job has blocked the command for 3 consecutive iterations of "hung" command recovery processing. This results in a job blocking a command for no more than 90 to <120 seconds.
Recovery action for the blocking job or the job that issued the blocking command is the same as that specified for long-running ENQ recovery automation.