Handling DIMM failure

When a compute node has a DIMM failure and the corresponding faulty DIMM is used by ESXi kernel, it is not able to evacuate virtual machines on it.

Problem

According to VMware, no operations on the compute node with DIMM failure are safe. It is a job of the hardware to protect ESXi and Guest OS layers from possible data corruption in such cases. If the underlying hardware can isolate the broken memory blocks, all operations on the layers continue to work as expected. However, if the hardware fails to do so, you might notice catastrophic issues at the ESXi and Guest OS level.

Resolution

According to VMware, to avoid unpredictable business impact, immediately ask for managed downtime to shut down all the virtual machines on the impacted compute node when you see the DIMM failure event. Do not migrate the virtual machines unless you restart the compute node. From IBM® Cloud Pak System perspective, during any DIMM failure hardware situation and to avoid any data corruptions on the virtual machines, consider these actions:
  • First, shut down the running virtual machines on that host.
  • Restart the host.
  • Put the host to quiesce state.
  • Add more compute node if poweredOff state requires more compute node to balance the resources.
  • Put this impacted host into maintenance mode to get the powered off virtual machines migrated to other hosts.
  • Power on the virtual machines.
  • Then, get the hardware issue fixed.
Follow these steps:
  1. Request for critical managed downtime.
  2. Gracefully shut down the compute node.
  3. Restart the compute node and recover the virtual machines on it.
  4. Migrate the virtual machines to the other compute nodes in the cloud group.
  5. Replace the compute node with good compute node in the environment and add to the cloud group to get back the workloads (virtual machines) on the new compute node.
  6. Put this impacted compute node in maintenance mode so that this compute node cannot be used until faulty hardware is fixed.
  7. Track the callhome ticket to get the hardware service to replace the fault DIMM. The faulty DIMM failure compute node must not be used until the hardware service gets completed.