Processor deallocation process
AIX can stop a failing processor by deallocating it.
The typical flow of events for processor deallocation is as follows:
- The firmware detects that a recoverable error threshold has been reached by one of the processors.
- The firmware error report is logged in the system error log, and, when AIX is executing on a machine that supports processor deallocation, AIX starts the deallocation process.
- AIX notifies non-kernel processes and threads bound to the last bind CPU.
- AIX waits up to ten minutes for all the bound threads to move away from the last bind CPU. If threads remain bound, AIX aborts the deallocation.
- If all processes or threads are unbound from the ailing processor, the previously registered High Availability Event Handlers (HAEHs) are invoked. An HAEH might return an error that aborts the deallocation.
- Unless aborted, the deallocation process ultimately stops the failing processor.
If there is a failure at any point of the deallocation, the failure and its cause are logged. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because an application did not unbind its bound threads, the system administrator can stop the application, restart the deallocation, and then restart the application.