Recovering hosts, VMs, and applications

After you configure the resources and policies in the KSYS subsystem, the KSYS continues to monitor the environment for any failures or issues. When any planned or unplanned outage occurs, the KSYS subsystem restarts the virtual machines on another host based on the specified policies.

The KSYS subsystem can be configured to monitor the hosts, virtual machines (VMs), and applications to perform recovery operations. By default, all VMs in the host group are managed. If you unmanage one or more VMs and run the discovery operation, the KSYS subsystem does not monitor and manage those VMs for high availability. Similarly, the KSYS subsystem does not monitor the specified resources for high availability if you disable HA monitoring for the entire system or a host group. However, you can monitor the health of VM and application only when you install the VM agent and enable HA monitoring in each VM that needs HA monitoring.

Recovering virtual machines in an unplanned outage

The KSYS subsystem performs the following types of recoveries for unplanned outages based on specified policies:
Automatic restart of virtual machines
When a host, VM, or critical application fails and the restart_policy attribute is set to auto, the KSYS subsystem restarts the virtual machines automatically on other hosts. The KSYS notifies you about the events; you do not have to take any actions.

However, if the KSYS subsystem could not successfully stop the VMs in the source host, the VMs are not restarted automatically. Also, if the KSYS subsystem identifies a problem, but cannot determine the issue, the VMs are not restarted on other hosts automatically to avoid unnecessary outage because of false failure detection. In both these cases, the KSYS subsystem notifies you about the problem. You must review the notified problem and then manually start the VMs, if necessary.

Manual recovery of virtual machines
When a host, VM, or critical application fails and the restart_policy attribute is set to advisory_mode, the KSYS notifies you about the issue. You can review the issue and manually restart the virtual machines on another hosts.

If you have configured the VM agent in each of your virtual machines, the KSYS notifies you when a virtual machine or a registered critical application fails or stops working correctly. In such cases also, you can restart the virtual machines on another hosts based on the specified policies.

To restart the virtual machines manually on another host, complete the following steps:
  1. Restart specific virtual machines or all virtual machines in a host by running the following command:
    ksysmgr [-f] restart vm vmname1[,vmname2,….] [to=hostname|uuid]
    Or,
    ksysmgr [-f] restart host hostname|uuid [to=hostname|uuid]
    After the virtual machines are restarted successfully, the KSYS subsystem automatically cleans the VMs on the source host and on the HMC by removing the LPAR profile from the HMC.
  2. If the output of the ksysmgr restart command indicates cleanup errors, clean up the VM details manually in the source host by running the following command:
    ksysmgr cleanup vm vmname host=hostname
  3. If the restart operations fail, recover the virtual machine in the same host where it is located currently by running the following command:
    ksysmgr [-f] recover vm vmname

Planned migration of virtual machines

The KSYS subsystem uses the HMC-provided Live Partition Mobility (LPM) capability to support the planned HA management. You can also use HMC to perform the LPM operations and the KSYS adapts to the movements of the VMs within the host group as part of its regular discovery operation. If you plan for a host maintenance or an upgrade operation, you can move all the virtual machines to another host by using the LPM operation and also restore the virtual machines back to the same host after the maintenance or the upgrade operation is complete. You can also test whether the movement of virtual machines to another host will be successful by using LPM validation without moving the virtual machines. This validation is useful to avoid any errors that might occur during the relocation of virtual machines.

To migrate the virtual machines to another host by using the LPM operation, complete the following steps:
  1. Validate the LPM operation without migrating the virtual machines by running one of the following commands:
    • To validate the LPM operation for specific virtual machines, run the following command:
      ksysmgr [-f] lpm vm vmname1[,vmname2,..] action=validate
    • To validate the LPM operation for all virtual machines in a specific host, run the following command:
      ksysmgr [-f] lpm host hostname|uuid action=validate
    If the output displays any errors, you must resolve those errors.
  2. Migrate the virtual machines from the source host to another host by running one of the following commands:
    • To migrate specific virtual machines, run the following command:
      ksysmgr [-f] lpm vm vmname1[,vmname2,..] [to=hostname|uuid]
    • To migrate all virtual machines in a specific host, that is to migrate all the VMs from the host, run the following command:
      ksysmgr [-f] lpm host hostname|uuid [to=hostname|uuid]
    When you run this command, the virtual machines are restarted on another host according to the specified policies in the KSYS configuration settings. If you have not specified the destination host where the virtual machines must be started, the KSYS subsystem identifies the most suitable host that can be used to start each virtual machine.

    If you have HMC Version 9 Release 9.3.0, or later, you can view the LPM progress as a percentage value.

  3. Run the discovery and verify operations after each LPM operation to update the LPM validation state by running the following command:
    ksysmgr discover host_group hg_name verify=true
  4. After the maintenance or upgrade activities are complete in the source host, restore all virtual machines by running the following command:
    ksysmgr restore host hostname|uuid
Note: If the discovery operation failed due to an error, the HAState attribute displays DISCOVERY_FAILED. Similarly, if the verify operation failed due to an error, the HAState attribute displays VERIFY_FAILED.