What happens if a processor fails in my IBM Power System?
- Will it all stop working?
- Will one LPAR stop working?
- Will it carry on but just have less resource?
This is, understandably quite a common query, so I thought that I'd write up my fairly standard answer.
Executive summary: It depends :-)
To try to be a little more helpful, let's look at various things.
Firstly, the Power Systems are specifically designed with high RAS in mind. The Enterprise class systems are architected for the highest RAS, but even the Scale-out servers have far higher RAS features than is normal in other similar sized servers in the industry.
When discussing this subject, it is worth observing that the word "processor" means different things to different people, for example we have Single and Dual Chip Modules (SCMs and DCMs) which fit in sockets; a DCM has two chips; and chips can have one or typically more processor cores (with associated caches etc) and other components such as accelerator units:
Here is a Whitepaper about RAS in POWER8 Processor based systems:
It goes into far more detail than this blog entry and is recommended for further reading.
It is often possible to predict a likely failure in a processor module, for example by noting a number of retries, and the system can deconfigure the module, or more likely just a part of it (such as a core or a cache unit) before an actual failure occurs. This is the default mode of operation. In the event of a processor, or part of one, being deconfigured in this way, the impact is likely to be very minor as the deconfiguration happens in a controlled manner.
It is also worth mentioning that available RAS features will also depend on the hypervisor in use and the way that the OS behaves in various situations. PowerVM and an IBM OS (AIX or IBM i) will give you the best RAS, e.g.: compared to using PowerKVM and/or Linux or even bare metal. This is not to say that any of those is bad, and the RAS features for them is constantly improving, but the IBM "metal to OS" approach will give the best integration when it comes to RAS.
OK, so what do we have?
Worst case scenario, you could lose everything. This is the case, I believe, for all servers from all manufacturers. For example, if the module short-circuits/bursts into flames. (NB: this is very unlikely).
It is far more likely that if a chip did fail that it would not be that disastrous. :-)
It is unlikely but possible that a whole chip (with multiple cores) which fails is not being used by any LPARs and none of the memory paths associated with the socket are in use at all, so there would be little effect, apart from messages etc.
People don't tend to buy machines to have spare activated resources though, so the most likely situation is somewhere in between.
Again, it depends.
Chips can fail totally and suddenly, and if that happened, all LPARs using the chip would be affected. Any processes currently running at the time on the CPU cores will be lost and the processes register state is lost. Your application may or may not be able to handle killed processes without a restart.
It is much more likely for an individual CPU core to fail than a whole chip. These normally suffer glitches that can be caught and handled by POWER8. It actually happens more than you might think with cosmic ray bit flips and other environmental issues. It is seen as instruction failure and detected by memory access and internal register sum checks. This gives the POWER8 an opportunity to correct the problem or work around it without losing the processor state and without failing the process it's running.
At this point bear in mind all the good instruction retry type features in our OSes/hypervisors/firmware/processors. The processor state can be retried on an alternative CPU core and the failing core taken offline with no loss of a process.
If LPARs are using dedicated CPU cores, the failed CPU cores would not be available to the LPARs. If an LPAR loses all of its CPU cores, clearly it will fail. If an LPAR loses some of its CPUs, it may be able to survive at reduced capacity, it depends what processes were running at the time.
Dynamic Processor Sparing: In the event of a processor being disabled, if there are spare processors available in the system for Capacity on Demand, the system will automatically activate the necessary resources (in place of the failing ones) so that the system runs with the licensed quantity of resources available. This is automatic and does not count against activation days. (Incidentally, this RAS feature also applies to memory).
If LPARs are using shared processors, things can be better. The loss of available CPU cores is felt by more than one LPAR, but they can share the loss, so they may have reduced cycles scheduled to them by the hypervisor .... but at least they do get those cycles. It is unlikely that an LPAR, running an IBM OS would fail, due to the RAS brought by using the IBM combination of OS/VIOS/PowerVM/hypervisor/POWER8 with all of the prediction and retry technology. On the other hand I could not rule out that the whole system could go down if, for example, an uncorrectable error occurred while some system critical hypervisor code was running (as I say though, this would be the case for all manufacturers kit, and is a pretty standard disclaimer).
Another factor here is the memory. Memory hangs off processor sockets, and the fabric for the memory comes from the processor module being in place. In the most likely scenario of the processor module suffering CPU core failures, it is highly likely that the memory paths would survive meaning that LPARs would not go down due to memory "disappearing". The IBM POWER8 memory has parity, chip kill, bit steering and whole line failure recovery built in.
IBM Power Systems employ "First Failure Data Capture" (FFDC). This is a hardware based technology, so all/any operating systems benefit from it.
Each subsystem in the processor hardware has registers devoted to collecting and reporting fault information as they occur. The design for error checking is rigorous and detailed. The value of data is checked generally wherever it is stored. This is true, of course for data used in computations, but also nearly any other data structure, including arrays used only to store performance and debug data. The RAS Whitepaper has much more detail, but in short, being able to detect faults at the source enables superior fault isolation which translates to identifying the correct part which has failed - first time.
It must also be stressed that even in the event of a failure which causes some outage, the system will de-configure the failing part and it will be "guarded" (locked out of the configured components), even through a re-IPL (complete power cycle) of the entire system. So, unlike servers from some other manufacturers, the failing item will not cause a subsequent outage. In normal circumstances, if the system is managed by an HMC, it will be able to "call home" to IBM so that a hardware call can be opened automatically and new parts and an engineer quickly sent to the customer.
IBM Redbooks and Redpapers about specific servers have detailed information about UEs (Uncorrectable Errors) and the way that our systems can often cope - whereas lesser* systems would usually fail (*you might think that I am talking about Intel based servers, but I neither confirm nor deny). For example REDP-5097 IBM Power Systems S814 and S824 Technical Overview and Introduction.
Another great blog entry is here:
It was written by Chet Mehta who is an IBM Distinguished Engineer. He is the POWER Firmware Technical Lead (so he knows what he's talking about). You can find Chet on Twitter as @Chet_Mehta.
For more information, do read the 90 page Whitepaper and relevant Redbooks.