Terminating errors on CPUs

A terminating machine check occurs when the operating system or the hardware considers a failure severe enough that a processor cannot continue operation.

In a uniprocessor (UP), the operating system enters a disabled non-restartable wait state, such as X'A01' or X'A26', and issues the following message:
IGF910W UNRECOVERABLE MACHINE FAILURE, RE-IPL SYSTEM
In a multiprocessor (MP), the action taken is as follows:
  • If the hardware determines that a processor cannot continue operation, it places the processor in a check-stop state and attempts to signal the other processor(s) by issuing a malfunction alert (MFA) external interruption. The hardware issues an MFA when:
    • It cannot store the machine check logout data about the error.
    • It cannot load the machine check new PSW.
    • It is disabled for hard machine checks when a hard error is detected.
  • If the operating system determines that a processor cannot continue operation, it attempts to signal the other processor(s) by issuing a Signal Processor (SIGP) instruction to cause an emergency-signal (EMS) external interruption. The operating system issues an SIGP instruction when:
    • The system is processing one machine check when another machine check occurs that cannot be handled.
    • A hard-machine-check threshold, which is an installation option established by entering the MODE command, has been reached.
    • Channel subsystem damage is detected.
    • The content of the MCIC is incorrect.

When a processor receives either an MFA or EMS external interruption for these conditions, the system receives control. The system, in turn, invokes ACR processing, which takes the malfunctioning processor offline and initiates recovery processing for that processor.

In a multiprocessor environment, an MFA or EMS is received by all the other online processors. On the first processor to receive the signal, the system tests and sets a flag before starting to process the error. When the other processors receive the interruption, the system sees that the error is already being processed and returns to the interrupted task.