Hard CPU errors

A hard machine check indicates that the current instruction could not complete. The system records the error in the logrec data set. Then the system either abnormally ends the interrupted task or retries the interrupted task at a predefined retry point. Even though the task may be ended, the system usually continues to run.

The CPU errors that cause hard machine checks are:

System Damage (SD): A malfunction has caused the processor to lose control over the operation it was performing to the extent that the cause of the error cannot be determined.
Instruction Processing Damage (PD): A malfunction has occurred in the processing of an instruction.
Invalid PSW or Registers (IV): The hardware was unable to store the PSW or registers at the time of error, as indicated by validity bits in the MCIC. Any error, even a soft machine check, associated with these validity bits is treated as a hard machine check because the operating system does not have a valid address to use to resume operation. The error goes through recovery processing.
Timing Facility Damage: Damage to the following has been detected:
- TOD clock (TC)
- Processor timer (PT)
- Clock comparator (CC)
- External Time Reference (ETR)
The four types of ETR-related machine checks are: primary synchronization damage, ETR attachment damage, switch to local, and ETR synchronization check.

To overcome the effects of numerous hard machine checks, the MODE command allows the operator to define machine check thresholds for each type. When reached, the thresholds cause the failing processor to be configured offline by alternate CPU recovery (ACR). Thus, the operator can control whether, and to what extent, the system monitors the frequency of hard machine checks, and can define a separate threshold and time interval for each.

The default threshold value for most hard machine checks is 5. The default for PD machine checks is 16. The default for ETR machine checks is 5 in 300 seconds.