High availability overview for virtual machines (VMs)

Cloud Pak System provides a high availability framework for virtual machines (VMs) to eliminate single points of failure and provide peer to peer failover for multiple Platform System Managers.

Virtual machine availability

Cloud Pak System offers high availability of a single system to address failures and keep virtual machine instances running.
  • Redundant hardware, such as networking, storage, compute, and power supplies.
  • No single points of failure for management node cloud groups with active high availability containing two management nodes.
  • Virtual machines remain available during system maintenance updates or hardware failures, leveraging reserved capacity and mobility actions within the system.
  • Additional capacity can be added and used with no service interruption.

    Power off is disabled for the leader Platform System Manager virtual machine and you cannot perform the shutdown operation.

In order for a system to be highly available, all components must be highly available. Currently, the only component for which you can control its high availability mode is a cloud group. When a cloud group's high availability is active, the physical capacity is reserved to ensure that even during peak utilization the overall functionality and state of the system remains healthy while virtual machine instances are evacuated, during both system failures and updates. The amount of reserved physical capacity is determined by the cloud group type: dedicated or average.

The following table shows how a virtual machine's CPU and memory reservations are mapped to hardware resources:
Note: VMware CPU overhead is amortized over each physical CPU. There is a 10% overhead that is reserved on each pCPU for ESX giving you 0.9 of a core for a dedicated cloud group, and 0.1125 for an average cloud group.
Table 1. Mapping of CPU and memory reservations to hardware resources
Type CPU count (1 vCPU) Virtual memory (1 MB)
Dedicated 0.9 pCPU per vCPU 1 physical MB
Average 0.1125 pCPU per vCPU 1 physical MB
Note: VMware CPU overhead is amortized over each physical CPU. There is a 10% overhead that is reserved on each pCPU for ESX giving you 0.9 of a core for a dedicated cloud group, and 0.1125 for an average cloud group.

Optionally, a cloud group can be set to reserve resources for high availability. This option reserves resources (CPU and memory) within the cloud group equivalent to one compute node. The reserved capacity in a cloud group containing N compute nodes is 1 / N of the resources (CPU and memory) on each compute node.

If the Reserve resources for availability option is enabled, the evacuation of virtual machine instances from one compute node to another, if required, will always complete successfully without impacting the virtual machine instances because the required resources within the cloud group have been set aside in advance.

If the Reserve resources for availability option is disabled, one of three situations regarding evacuation can occur during the system update process:
  • When an evacuation is not required, the updates can complete without requiring the movement of virtual machines off of their existing compute nodes.
  • When an evacuation is required and the cloud group has enough physical capacity for the evacuation to occur, as determined by the cloud group type, the updates can complete without impacting the running virtual machine instances.
  • When an evacuation is required and the cloud group does not have enough physical capacity for the evacuation to occur, the updates cannot complete without affecting the virtual machine instances.
Note: When you evacuate a virtual machine instance from one compute node to another, the log file can include numerous DUPLICATE IP ADDRESS DETECTED messages. These messages are for informational purposes only and no action is required.