CPU errors

CPU errors result from a malfunction of a hardware element, such as a timing facility, instruction-processing hardware, or microcode. When a CPU error occurs, the recovery processing has, in general, two stages depending on the severity and type of error:
  1. When possible, the hardware retries the failing operation a certain number of times. If the retry works, the hardware may issue a recovery machine check interruption, which is repressible, so that the operating system can record the error in the logrec data set. After recording, the operating system returns control to the interrupted task.
  2. If the error is too severe for hardware retry or the retries fail, the hardware issues either a hard or ending machine check interruption. The system determines the severity of the error and takes the appropriate action, which may range from ending the interrupted task to ending the entire system.
The next topics describe the following CPU errors:

Then the recovery actions of alternate CPU recovery (ACR) are described.