Alternate Processor Recovery and Partition Availability Priority
Power Systems servers have some pretty interesting availability features. The following is one that you may not have known about. Processor failures are very rare, but with PowerVM and the Power Systems hypervisor, some magic can happen.
If processor instruction retry for a fault within a core occurs multiple times without success, the fault is considered to be a solid failure. In some instances PowerVM can work with the processor hardware to migrate workload running on the failing processor to a spare or alternate processor. This migration is accomplished by migrating the pertinent processor core state from one core to another with the new core taking over at the instruction that failed on the faulty core. Successful migration keeps the application running during the migration without need to terminate the failing application.
Successful migration does require that there is sufficient spare capacity available to reduce the overall processing capacity within the system by one processor core. Typically in highly virtualized environment the requirements of partitions can simply be reduced to accomplish this without any further impact to running applications.
In systems without sufficient reserve capacity it may be necessary to terminate at least one partition to free up resources for the migration. In advance, PowerVM users can identify which partitions have the highest priority and which do not. Leveraging this Partition Priority feature of PowerVM, should a partition need to be terminated for Alternate Processor Recovery to complete, the system can terminate lower priority partitions to keep the higher priority partitions up and running, even when an unrecoverable error occurred on a core running the highest priority workload.
Partition Availability Priority is assigned to partitions by using a weight value or integer rating. The lowest priority partition is rated at 0 (zero) and the highest priority partition is rated at 255. The default value is set to 127 for standard partitions and 192 for Virtual I/O Server (VIOS) partitions. Priorities can be modified through the hardware management console of systems using a hardware management console.
For more information, look for an updated Chapter 4. in the following papers:
- IBM Power Systems S812L and S822L Technical Overview and Introduction, Redpapers
- IBM Power Systems S814 and S824 Technical Overview and Introduction, Redpapers
- IBM Power System S822 Technical Overview and Introduction, Redpapers