VM Recovery Manager HA limitations

Consider the following restrictions for the VM Recovery Manager HA solution.

KSYS limitations

  • The KSYS subsystem follows the KSYS_<peer domain>_<HG_ID> format to name an HA SSP. The KSYS subsystem uses this format to differentiate between an HA SSP and a user-defined SSP. Therefore, you must not use this format for all user-defined SSPs.
  • The following commands can be run without considering any of the policies:
    ksysmgr verify host_group <host_group_name>
    ksysmgr lpm host <host_name> action=validation
    Therefore, successful completion of the verification and validation operations does not mean that the virtual machines can be relocated successfully.
  • When you remove the KSYS cluster, the KSYS subsystem fails to delete HA-specific VM and VIOS adapters if the cleanup operation continues for a long time. You must delete the VIOS adapters manually to avoid inconsistencies across the Virtual I/O Servers. If you create the KSYS cluster again, the KSYS subsystem can reuse the previous HA-specific adapters.
    • Workaround:
      • To remove the KSYS cluster, run the following command with the -f option:
        ksysmgr -f remove ksyscluster <ksyscluster_name>
      • To remove the Shared Storage Pool (SSP) cluster, run the following command in one of the Virtual I/O Server (VIOS) of the SSP cluster:
        cluster -remove
      • To check if the hsmon daemon is stopped in all Virtual I/O Servers (VIOS), run the following command:
        lssrc -s ksys_hsmon
      • To stop the hsmon daemon in the VIOS, run the following command in all Virtual I/O Servers:
        stopsrc -s ksys_hsmon -c
  • The KSYS subsystem supports the Shared Storage Pool (SSP) cluster's high availability disk only when the Shared Storage Pool (SSP) is created from the KSYS subsystem. The KSYS subsystem does not display the high availability disk in any query when you use a user-defined SSP cluster.
  • You cannot modify a KSYS subsystem's high availability disk after creating the SSP cluster from the KSYS node. If you want to modify the HA disk, you must delete the host group and re-create the host group with the HA disk details.
  • start of changeAfter configuring applications on VMRM environment, if you shutdown a virtual machine (VM), the KSYS subsystem does not change the status of the VM to red, the status remains green. However, if you shutdown the same VM through the HMC, the KSYS subsystem changes the status of the application to red.end of change
  • A maximum of 10 scripts can be added in the KSYS subsystem for add notify command.
  • On VIOS nodes, if the disks of a shared storage pool (SSP) are not accessible after the system is re-activated due to shutdown or reboot, the disk state continues to be down. This impacts the start of pool and requires a quorum to come back online. As a workaround, choose one of the following options. If you do not want to reboot your VIOS, follow the workaround option 1.
    • Workaround option 1: Complete the following procedure:
      1. Restore the disk connectivity.
      2. Run the cfgmgr command as a root user to make the system aware of the disks.
      3. Run the command padmin: clstartstop -stop -m <node>.
      4. Run the command padmin: clstartstop -start -m <node>.
    • Workaround option 2: Complete the following procedure:
      1. Restore the disk connectivity.
      2. Reboot the VIOS node.
  • For a VM with vSCSI disk, the cleanup operation fails in the local database mode.
    • Workaround: You must bring the SSP cluster back to the global mode.
  • The KSYS subsystem does not handle the application dependency, if the VM has been shut down manually and the dependent application is part of the VM.
  • VM Recovery Manager HA does not work if the Live Partition Mobility (LPM) feature is disabled at firmware level.
  • If a current repository disk is down, automatic replacement does not occur on previously used repository disk that has the same cluster signature. In this case, a free backup repository disk might not be available, hence the automatic replacement operation fails.
    • Workaround: Run the following command to clear the previous cluster signatures:
      cleandisk -r <diskname> 
  • In the scalability environment where the VMs are spread across the hosts of a host group, and the LPM verification operation is run on the host group, based on the type of configuration, at some point of time, many requests might go to one host and if the number of requests are more than the maximum requests that the host can handle, the verification operation might fail with following error:
    HSCLB401 The maximum number of partition migration commands allowed are already in progress. 
  • In the KSYS LPAR, if you upgrade the AIX operating system after upgrading the KSYS software, a few class IDs might be missing in the /usr/sbin/rsct/cfg/ct_class_ids file and the KSYS daemon might stop working.
    • Workaround: Run the following command to check whether the class IDs are reserved.
      cat /usr/sbin/rsct/cfg/ct_class_ids
      IBM.VMR_HMC                                     510
      IBM.VMR_CEC                                     511
      IBM.VMR_LPAR                                    512
      IBM.VMR_VIOS                                    513
      IBM.VMR_SSP                                     514
      IBM.VMR_SITE                                    515
      IBM.VMR_SA                                      516
      IBM.VMR_DP                                      517
      IBM.VMR_DG                                      518
      IBM.VMR_KNODE                                   519
      IBM.VMR_KCLUSTER                                520
      IBM.VMR_HG                                      521
      IBM.VMR_APP                                     522
      IBM.VMR_CLOUD                                   523
      IBM.VMR_DP_CLD                                  524
      IBM.VMR_SA_CLD                                  525
      IBM.VMR_LPAR_CLD                                526
      IBM.VMR_SITE_CLD                                527
      IBM.VMR_VMG_CLD                                 528
      IBM.VMR_APP_CLD                                 529
      If any of the class IDs that are displayed in the preceding screen are missing in your output, add the missing entries in the /usr/sbin/rsct/cfg/ct_class_ids file to restart the VMR services.
  • start of changeFor a virtual machine on which the ha_monitor attribute is enabled by using the KSYS subsystem and you have shut down the virtual machine by using the immediate option from the HMC, the KSYS subsystem does not display the error message when you run the discover or verify operation. end of change
  • start of changeFor a multi-node KSYS cluster, the VM Recovery Manager HA Version 1.7 supports only two KSYS nodes.end of change
  • start of changeFor addMS/addVM related issue, ensure that the ksys_hsmon daemon is not running on any VIOS before the first discovery operation. If you do not clean the previous cluster properly, the ksys_hsmon daemon remains in an active state. When you create a cluster where the ksys_hsmon is running, the addMS/addVM related issue occurs in the KSYS subsystem.end of change
  • start of changeFor a multi-node KSYS setup, while adding clusters, it is recommended to have the operating system version and the RSCT version to be same on all nodes. If the operating system that is running on a node is on an earlier version than the other node, you must specify the node that is running the earlier version of the operating system before the node that is running the later version of the operating system when you run a command to add nodes to the cluster.end of change

KSYS LPM Limitations

  • You cannot run the Live Partition Mobility (LPM) operation simultaneously on multiple hosts by using the ksysmgr command. You must specify multiple virtual machines, in a comma-separated list, in the ksysmgr command. Also, you can perform the LPM operation on a list of virtual machines simultaneously only if all the virtual machines are present in the same host.
  • The flexible capacity policy is applicable only for VM failover operations. The flexible capacity function is not supported for virtual machines that are migrated by using the LPM operation.
  • The flexible capacity policy is applicable only on CPU and memory resources. It is not applied on I/O resources. You must ensure enough I/O resources are available in the target host.
  • If a VM migrates from host1 to host2, and applications in the VM become stable. At a later point of time, if the VM from the host2 needs to be migrated due to an application failure, the host1 will not be considered as a backup for application failure migration, because the VM had previously failed on host1. If host1 needs to be considered as a backup for future application failure, use the following workaround.
    • Workaround: After the VM is stable on the host2, clear the FailedHostList list of the VM. Run the command chrsrc -s 'Name="VMName"' IBM.VMR_LPAR VmRestartFailedCecs='{""}' to clear the FailedHostList list for the VM.
  • The discovery operation or the KSYS restart operation automatically starts the dependency applications that were stopped by the user before the discovery or the restart of the KSYS subsystem.
    • Workaround: Complete the following procedure:
      1. Do not perform the discovery operation after stopping the dependency application.
      2. Disable the auto discover and the quick discovery features.
      3. Do not perform the KSYS subsystem restart.

VM agent limitations

  • The ksysvmmgr start|stop app command supports only one application at a time.
  • The ksysvmmgr suspend|resume command is not supported for the applications that are configured in an application dependency setup.
  • For all applications that are installed on the non-rootvg disks, you must enable the automatic varyon option for volume groups and the auto mount option for file systems after the virtual machine is restarted on the AIX® operating system.
  • If the application is in any of the failure states, for example, NOT_STOPPABLE, NOT_STARTABLE, ABNORMAL, or FAILURE, you must fix the failure issue, and then use the ksysvmmgr start|resume application command to start and monitor the application.
  • If the KSYS cluster is deleted, or if a virtual machine is not included for the HA management, the VM agent daemon becomes inoperative. You must manually re-start the VM agent daemon in the virtual machine to bring the VM agent daemon to operative state.
  • For the VMs running on the Linux VM agent, the restart operation might take longer time than expected, and the rediscovery operation might fail and display the following message:
    Rediscovery has encountered error for VM VM_Name
    • Workaround: Run the discovery operation after the virtual machine is in the active state.
  • start of changeThe state of application that is running on a VM does not change automatically to normal when the VM is recovered by the KSYS subsystem. The state of application changes to normal when you run the resume command from the VM.
    • Workaround: To resume the application, run the following command:
      ksysvmmgr -s resume app <app name>
      An output that is similar to the following example is displayed:
      Modifying application "App1" into daemon configuration successfully performed.
    end of change

GUI limitations

  • The VM Recovery Manager HA GUI does not support multiple sessions that are originating from the same computer.
  • The VM Recovery Manager HA GUI does not support duplicate names for host group, HMC, host, VIOS, and VMs. If a duplicate name exists in the KSYS configuration, the GUI might have issues during host group creation or in displaying the dashboard data.
  • The VM Recovery Manager HA GUI refreshes automatically after each topology change (for example, VM migration operation and host migration operation). After the refresh operation is complete, the default KSYS dashboard is displayed. You must expand the topology to view the log information in the Activity window for a specific entity.
  • Any operation performed by a user from the command-line interface of VM Recovery Manager HA is not displayed in the activity window of the VM Recovery Manager HA GUI.

Miscellaneous

  • The VM Recovery Manager HA solution does not support internet Small Computer Systems Interface (iSCSI) disk type. Only N_Port ID virtualization (NPIV) and virtual Small Computer System Interface (vSCSI) disk types are supported.
  • In a user-defined SSP cluster, if you want to add a host or VIOS to the environment, you must add it in the shared storage pool (SSP) cluster first. Then, you can add the host or VIOS to the KSYS cluster. Also, if you want to remove a host or VIOS from the environment, you must first remove it from the KSYS cluster and then remove it from the SSP cluster.
  • VM Recovery Manager HA supports only detailed-type snapshot.
  • After each manage VIOS operation and unmanage VIOS operation, you must perform the discovery operation.
  • start of changeIf you have configured an application as critical on an virtual machine, ensure that the KDB option for the virtual machine is disabled.end of change

Errors that the KSYS subsystem cannot handle

The KSYS subsystem automatically restarts the VMs only when the KSYS subsystem is certain of the failures. If the KSYS subsystem is unsure, it sends an alert message to the administrator to review the issue and to manually restart VMs, if required.

Sometimes, the KSYS subsystem cannot identify whether the host failure is real or the host failure is because of a partitioning network. The KSYS subsystem does not automatically restart VMs in the following example scenarios:
  • When the KSYS subsystem cannot connect to the HMC to quiesce the failed VM (fencing operation) on the source host before restarting the VM on the target host. The fencing operation is required to ensure that the VM is not running on two hosts simultaneously.
  • The host monitor module and the VIOS can monitor their own network and storage. Sometimes, network and storage errors are reported by the VIOS and these error events are notified to the administrator through email and text messages. In these cases, the KSYS subsystem does not move the VMs automatically to avoid false relocation.
  • When a host group is spread across two buildings with storage subsystem technologies such as IBM® SAN Volume Controller (SVC) HyperSwap®, where HMCs, hosts and other required resources exist in each building and the KSYS LPAR is deployed on the backup building, the following scenarios cannot be automatically handled:
    • Power failure in the main building: The KSYS subsystem cannot connect to the HMCs and hosts in the main site. The KSYS subsystem detects the host failure and notifies the administrator.
    • Issues in network and storage partitioning between the buildings: The KSYS subsystem cannot connect to the HMCs, and therefore notifies the administrator about the host failure. The administrator must review the environment and decide whether to move the VMs. The VMs might be operating correctly on the main host. The administrator can rectify the network links between the hosts and the KSYS subsystem will start operating in normal mode.