High availability overview for virtual machines (VMs)
Cloud Pak System provides a high availability framework for virtual machines (VMs) to eliminate single points of failure and provide peer to peer failover for multiple Platform System Managers.
Virtual machine availability
- Redundant hardware, such as networking, storage, compute, and power supplies.
- No single points of failure for management node cloud groups with active high availability containing two management nodes.
- Virtual machines remain available during system maintenance updates or hardware failures, leveraging reserved capacity and mobility actions within the system.
- Additional capacity can be added and used with no service interruption.
Power off is disabled for the leader Platform System Manager virtual machine and you cannot perform the shutdown operation.
In order for a system to be highly available, all components must be highly available. Currently, the only component for which you can control its high availability mode is a cloud group. When a cloud group's high availability is active, the physical capacity is reserved to ensure that even during peak utilization the overall functionality and state of the system remains healthy while virtual machine instances are evacuated, during both system failures and updates. The amount of reserved physical capacity is determined by the cloud group type: dedicated or average.
Type | CPU count (1 vCPU) | Virtual memory (1 MB) |
---|---|---|
Dedicated | 0.9 pCPU per vCPU | 1 physical MB |
Average | 0.1125 pCPU per vCPU | 1 physical MB |
Optionally, a cloud group can be set to reserve resources for high availability. This option reserves resources (CPU and memory) within the cloud group equivalent to one compute node. The reserved capacity in a cloud group containing N compute nodes is 1 / N of the resources (CPU and memory) on each compute node.
If the Reserve resources for availability option is enabled, the evacuation of virtual machine instances from one compute node to another, if required, will always complete successfully without impacting the virtual machine instances because the required resources within the cloud group have been set aside in advance.
- When an evacuation is not required, the updates can complete without requiring the movement of virtual machines off of their existing compute nodes.
- When an evacuation is required and the cloud group has enough physical capacity for the evacuation to occur, as determined by the cloud group type, the updates can complete without impacting the running virtual machine instances.
- When an evacuation is required and the cloud group does not have enough physical capacity for the evacuation to occur, the updates cannot complete without affecting the virtual machine instances.