Solving common problems

This section describes the solutions to some problems that you might encounter when you use the VM Recovery Manager HA solution.

The discovery operation for a host group, host, or VM failed

Problem
The discovery operation failed or a VM is not discovered by the KSYS subsystem during the discovery operation.
Solution
  1. Ensure that you have completed all the prerequisites that are specified in the Requirements section and the configuration steps that are specified in the Configuring section.
  2. Check whether the ha_monitor attribute is enabled or disabled for site, host group, host, and VM by using the following commands:
    lsrsrc IBM.VMR_SITE HAmonitor 
    lsrsrc IBM.VMR_HG Name HAmonitor 
    lsrsrc IBM.VMR_LPAR HaMonitor 
    lsrsrc IBM.VMR_LPAR Name HAmonitor 
  3. If the HAMonitor field shows Disabled or not set, enable the ha_monitor attribute by using one of the following commands:
    chrsrc -c IBM.VMR_SITE HAmonitor="Enabled"
    ksysmgr modify system|hg|vm [<name>] ha_monitor=enable
    If the ha_monitor attribute for a VM is not enabled at a VM-level, host group-level, or system-level, the VM is not considered for the discovery operation.
  4. Ensure that you have started the VM monitor daemon by running the ksysvmmgr start vmm command. The VM agent can send heartbeats to the host monitor only when you start the VM monitor daemon.
  5. Ensure that you have set all the HMC options that are specified in the Requirements section.

The verification operation for a host group, host, or VM failed

Problem
The Phase or the PhaseDetail attribute of a VM indicates a status that the verification operation failed.
Solution
During the verification operation, you can troubleshoot the possible issues as described in the following steps:
  1. Analyze the verification operation flow in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files and check the Phase, PhaseDetail, and HAState fields during each operation.
  2. If any of the remote copy programs (RCP) for host group, host, or LPAR is deleted or does not exist, re-create the host group and add the hosts by using the ksysmgr command. Run the discovery operation and the verification operation to check whether the issue is resolved.
  3. If the verification lock is acquired by some other process, the verification process might fail. Check the trace.ksys.* trace files to ensure that verification locks are not acquired by other threads before you start the verification operation.
  4. The verification process for a VM cannot start if 32 threads are already running. Ensure that sufficient number of empty threads are available for VM task to complete in specified time.
  5. Check the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.krestlong.* trace files to identify whether the LPM operation is successful. Rerun the verification operation, if required. Also, check whether the previous phase and PhaseDetail fields are cleared before you start the LPM operation.

The flexible capacity values cannot be calculated correctly

Problem
The verification operation failed because of an error in the calculated flexible capacity values.
Solution
  1. If any of the remote copy programs (RCP) for host group, host, or LPAR is deleted or does not exist, the verification thread might fail. Recreate the host group and run the discovery operation.
  2. If the calculated flexible capacity values are not correct, consider the following solutions:
    • Review the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files to check whether the policy map is created correctly by the policy manager of the target host and check the logical memory block (LMB) size that is calculated by the target host in the trace files.
    • Ensure that the specified capacity is between the minimum and maximum capacity for a VM, otherwise the VM moves to the target host with the same capacity.
    • Ensure that flexible capacity table is set properly in the host group by using the following commands:
      #lsrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexProcCapacityTable
      #lsrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexMemCapacityTable
      #chrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexMemCapacityTable={"100","70","50"}
      #chrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexProcCapacityTable={"100","70","50"}
      #lsrsrc -s 'Name="hg_name"' IBM.VMR_HG Priority
      #chrsrc -s 'Name="hg_name"' IBM.VMR_HG Priority="Medium"

The LPM verification or LPM operation failed

Problem
The VM cannot be moved to another host while performing the Live Partition Mobility (LPM) operation.
Solution
If the VM move operation failed while performing the LPM operation on the VMs from the KSYS subsystem or while restarting the VMs in another host, you must check the events, the VM state, and the log files to diagnose and resolve the issue. Use the following troubleshooting steps to diagnose the issue:
  1. If you received an event notification, check the event details in the /var/ksys/events.log file and review the suggested action.
  2. Identify the reason for the move operation (LPM or restart) failure in the /var/ksys/log/ksysmgr.log file.
  3. Ensure that the RMC connection between the HMC and the VM is working. If the VM contains any firewall service, ensure that the RMC connection is allowed by the firewall. You can use the HMC diagrmc command to verify and correct the RMC connection.
  4. If the move operation failed because of policy conflicts (such as collocation policy and anti-collocation policy), resolve the conflicts and run the move operation again. If the move operation failed because of insufficient capacity resources in the target host, increase the target host capacity and retry the move operation.
  5. If the LPM or restore operation failed and the VM exists on both the source and target hosts, recover the VM by using the following command:
    ksysmgr [-f] lpm vm vmname1 action=recover
  6. Run the discovery and verification operations after each LPM operation to update the LPM validation state.

The restarted VM does not move to a stable state

Problem
When you restart a failed VM on another host, the VM does not move to a stable state.
Solution
The LPAR profile on the source host is deleted after the virtual machine is restarted successfully on the target host. However, if the virtual machine does not move to proper state, perform the following steps:
  1. Restart the virtual machines in the target host by running the following command:
    ksysmgr [-f] restart vm|host [vmname1[,vmname2,...]] 
  2. If the restart operations fail, recover the virtual machine in the same host where it is located currently by running the following command:
    ksysmgr [-f] recover vm vmname
  3. If the output of the restart command indicates cleanup errors, run the cleanup command manually to clean up the VM details in the source host by running the following command:
    ksysmgr cleanup vm vmname host=hostname

The KSYS subsystem cannot find suitable target hosts

Problem
The failed VMs cannot be restarted on other hosts successfully because the policy manager module cannot find suitable target hosts.
Solution
Whenever a VM or a host failure is identified by the KSYS subsystem, the policy manager module finds the best fit target host where the VM can be migrated based on various policies such as collocation, anti-collocation, affinity, priority, and blacklist. Sometimes, the policy manager module cannot find the target host for a VM because of policy conflict or resource check failures. Perform the following steps to troubleshoot this error:
  1. Search for the string Policymap VmtoCECMapTable in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.user.* trace file to find the VM-host mapping table.
  2. Review the miscellaneous column of the VM-host mapping table to check the policies that have conflicts. Resolve the conflicts, run the discovery and verification operations, and then retry the recovery operations.

Restart operation failed because of a version mismatch

Problem
The failed VMs cannot be restarted on other hosts successfully because of version mismatch.
Solution
The major versions of host monitor filesets and VM monitor filesets must match for successful heartbeat operation. Run the following commands to identify the fileset versions of VM monitor and host monitor:
ksysmgr query vm 
ksysmgr query vios
If any of the host monitor or VM monitor major version does not match, you must upgrade the minor version to any of the major versions.

The KSYS node restarted during the automatic restart operation

Problem
The VM restart operation is interrupted by the KSYS node restart operation.
Solution
If the KSYS node restarts during the automatic restart operation of virtual machines, the move operation is added to the policy manager module automatically and is resumed from the point where the move operation was interrupted before the KSYS node reboot operation. The move operation of virtual machines, whose restart operation is completed, are not added to the policy manager module. Check the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files to check the details of the operations.

The SSP cluster is not created or updated successfully

Problem
The KSYS subsystem cannot create an SSP cluster, cannot add VIOS to an existing SSP cluster, or cannot collect required SSP information.
Solution
  • Re-create the host group to fix the initial SSP remote copy program (RCP).
  • If the SSP cluster is not created successfully, search the following text in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files: Could not create single node cluster on VIOS. Based on your analysis, perform the following steps:
    1. If any values of the input variables for the create_ssp() API are missing or incorrect, the problem might be in the KSYS configuration settings. Check and update the KSYS configuration settings and rerun the discovery operation.
    2. Check the libkrest logs for the kriSubmitCreateSSP() KREST API by examining the return code and error message to identify whether the problem is from the HMC or the VIOS.
    3. The HMC might get overloaded with multiple retry requests from the KSYS subsystem. Therefore, if you receive a message that HMC is busy, wait for some time and then retry the operation.
    4. Run the cluster -create command on the VIOS to identify whether the VIOS has any problems to create the SSP cluster. For more information about running the cluster command, see the cluster command documentation in VIOS.
  • If the KSYS subsystem cannot add the VIOS to an existing SSP cluster, search the following text in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files: Could not add VIOS: xxxx to cluster xxx. Based on your analysis, perform the following steps:
    1. If any values of the input variables for the add_SSP_node() API are missing or incorrect, the problem might be in the KSYS configuration settings. Check and update the KSYS configuration settings, and rerun the discovery operation.
    2. Check the kriSubmitaddSSPNode() API for the return code and error message to verify whether the problem is from the HMC or the VIOS. The KSYS subsystem uses the HMC REST APIs to handle the requests, therefore, HMC waits to get the acknowledgment of job completion from the VIOS. The retry operation failure does not necessarily mean that the request failed, if the error occurred in the first few retry operations.
    3. Run the cluster -addnode command on the VIOS to identify whether the VIOS has any problems to add node to the SSP cluster. For more information about running the cluster command, see the cluster command documentation in VIOS.
  • If the KSYS subsystem fails to collect all the required SSP information, one or more attributes of the SSP cluster might be missing their values that are collected during the discovery operation. Check whether the storage pools are in operational state by running the cluster -status command in the VIOS. If any of the pools are not in operational state, the KSYS subsystem fails to collect the required SSP data. Refer to VIOS SSP documentation to fix the issue.
  • Re-run the discovery operation so that the SSP data can be updated in the registry.

The repository disk failed

Problem
You receive a repository disk event, REPOSITORY_DISK_FAILURE.
Solution 1
When a repository disk fails, you can manually replace repository disk by running the modify hg command with a new repository disk ID. Run the modify_Hg command from the KSYS subsystem.
ksysmgr modify host_group <name> options [repo_disk=<ViodiskID>] 
Solution 2
When a repository disk fails, you can manually replace repository disk by completing the following steps in the HMC GUI:
  1. Log in to HMC GUI in a web browser as hscroot user.
  2. Go to Resources > All Shared Storage Pool Clusters.
  3. Select your cluster and click on the cluster name.
  4. Click Replace Disk.
  5. In the Replace Repository Disk panel, select one of the available free shared physical volumes as the new repository disk to replace the existing repository disk.
  6. Click UUID value to validate the complete UUID and the local hdisk name on each VIOS.
  7. Click OK to replace the repository disk. After the operation is complete, in the Shared Storage Pool Cluster window, click the repository disk UUID value to check whether it matches the selected new repository disk.
  8. Run the discovery operation to update the KSYS configuration settings by running the following command:
    /opt/IBM/ksys/ksysmgr –t discover host_group <HGName>
  9. After the discovery operation is complete, run the following command to verify whether the updated repository disk UUID in SSP remote copy matches the UUID in HMC:
    /opt/IBM/ksys/ksysmgr -v query host_group
    For example,
    (0) root @ ksys305: /var/ksys
    # /opt/IBM/ksys/ksysmgr -v q host_group
    Name:                HG1
    Hosts:               hk-8247-22L-2139F7A
                         ko-8284-22A-10FDC13
    Memory_capacity:     Priority Based Settings
                         high:100
                         medium:100
                         low:100
    CPU_capacity:        Priority Based Settings
                         high:100
                         medium:100
                         low:100
    Skip_power_on:       No
    HA_monitor:          enable
    Restart_policy:      auto
    VM_failure_detection_speed:    normal
    Host_failure_detection_time:   90
    
    SSP Cluster Attributes
    Sspname:             KSYS_env30_ha_1
    Sspstate:            UP
    Ssp_version:         VIOS 3.1.0.00 
    VIOS:                kov2
                         hkv2
                         hkv1
                         kov1
    Repo_disk:           01M0lCTTIxMDc5MDA2MDA1MDc2MzAzRkZEM0ZGMDAwMDAwMDAwMDAwMDQxNg==
    HA_disk:             01M0lCTTIxMDc5MDA2MDA1MDc2MzAzRkZEM0ZGMDAwMDAwMDAwMDAwMDQxNw==
    SspUuid:             d5e5d382-dc01-38e8-ab55-083ca3ffe826
    XSD_version:         4.00
    PoolName:            default_pool_1
    PoolUuid:            000000000A281695000000005BD73628
    

Virtual switch, trunk adapter, or Ethernet adapters are not created successfully

Problem
The KSYS subsystem cannot create or delete virtual switch on a host, cannot create or delete trunk adapters on a VIOS, or cannot create or delete Ethernet adapters on VMs.
Solution
  1. Ensure that the KSYS network configuration is set properly by performing the following steps:
    1. Each managed host must contain a switch for the KSYS configuration that is created during the discovery operation. Check the switch UUID that is stored in the KSYS registry by running the lsrsrc IBM.VMR_CEC SwitchUUID command.
    2. Each VIOS must contain two trunk adapters with VLAN IDs 101 and 102 in the associated host. Check the ActiveHM fields for the VIOS by running the lsrsrc IBM.VMR_VIOS ActiveHM AdapterMACAddress AdapterSlot command. Other virtual I/O servers show the ActiveHM field as blank.
    3. Each managed virtual machine must contain two virtual adapters. Check the MAC addresses and UUIDs that are saved in the registry by running the lsrsrc IBM.VMR_LPAR AdapterMACAddress1 AdapterUUID1 AdapterMACAddress2 AdapterUUID2 command.
  2. Rerun the discovery operation. If the required KSYS network is still not configured, manual intervention might be required.
  3. If you find the following errors in the traces.ksys.* trace file:
    “Dynamic add of virtual I/O resources failed”
    “The specified slot contains a device or devices that are currently configured.”
    “This virtual switch cannot be deleted since the following virtual networks are using this virtual switch…”
    Investigate these errors at the HMC level and resolve them as specified in the error message.
  4. Access the HMC and run the chhwres command to create or delete switches and adapters. After you verify the configuration on the HMC, run the discovery operation to update the KSYS configuration settings.

The host or VM failures are not detected correctly

Problem
The failure detection engine (FDE) module of the KSYS subsystem cannot connect to the VIOS or is not working correctly.
Solution
The Failure Detection Engine (FDE) module is a KSYS module that detects the health of a virtual machine and initiates the relocation request if the VM must be moved. The most common reasons for monitoring failures by the FDE module are as follows:
  • The HA monitoring at the site or system level is disabled.
  • The HA monitoring at the VM level is disabled.
  • The KSYS daemon might be repeatedly restarting.
Diagnose the issue by performing the following steps:
  1. Check whether the FDE module is active by performing the following steps:
    1. Check whether the HA monitoring is enabled by searching the Trace Started – Pid string in the /var/ct/< cluster_name>/log/mc/IBM.VMR/trace.fde.* trace files. For example:
       [00] 06/12/18 ____ 14:14:29.274292  ******************* Trace Started - Pid = 6553828 ********************** 
      [00] 06/12/18 _VMR 14:15:20.910220 DEBUG FDEthread.C[125]: Monitoring Enabled.
    2. Review the HAmonitor attribute settings by using the lsrsrc command in the KSYS node as follows:
      1. Check the persistent values that are specific to the HA monitoring and are saved in the VMR_SITE class by running the following command:
        # lsrsrc -c IBM.VMR_SITE HAmonitor hostFDT VMFDS VMthreshold FDEpollInt
        Resource Class Persistent Attributes for IBM.VMR_SITE
        resource 1:
                HAmonitor   = "Enabled"
                hostFDT     = 90
                VMFDS       = "Normal"
                VMthreshold = 50
                FDEpollInt  = 20
        The high-availability monitoring is enabled at a global level based on the HAmonitor value.
      2. Check the persistent values that are specific to the HA monitoring and are saved in the VMR_HG (host group) class by running the following command:
        # lsrsrc IBM.VMR_HG Name HAmonitor hostFDT
        Resource Persistent Attributes for IBM.VMR_HG
        resource 1:
                Name      = "HG1"
                HAmonitor = "Enabled"
                hostFDT   = 0
        If the failure detection time for host (hostFDT value) is 0, the value specified at the site level is used.
      3. Check the persistent values that are specific to the HA monitoring and are saved in the VMR_CEC (host) class by running the following command:
        # lsrsrc IBM.VMR_CEC HAmonitor VMFDS
        Resource Persistent Attributes for IBM.VMR_CEC
        resource 1:
                HAmonitor = "Enabled"
                VMFDS     = ""
      4. Check the persistent values that are specific to the HA monitoring and are saved in the VMR_VIOS class by running the following command:
        # lsrsrc IBM.VMR_VIOS Name ViosUuid CecUuid ActiveHM CAAstate HMrespSec HMresponsive CAAstateReason MonitorMode SlowIO
        Resource Persistent Attributes for IBM.VMR_VIOS
        resource 1:
                Name           = "lasagnav1"
                ViosUuid       = "7B5099B5-019E-443F-98EB-E04A680D6DA6"
                CecUuid        = "ac420f59-cdbf-3ab8-b523-37f66d461741"
                ActiveHM       = 102
                CAAstate       = "UP"
                HMrespSec      = 1
                HMresponsive   = "Yes"
                CAAstateReason = ""
                MonitorMode    = "GLOBAL"
                SlowIO         = 0
        If a VIOS is running in a LOCAL mode, the MonitorMode field is set to LOCAL. If the host monitor is not operating correctly, the MonitorMode field is set to DOWN.
      5. Check the persistent values that are specific to the HA monitoring and are saved in the VMR_LPAR class by running the following command:
        lsrsrc IBM.VMR_LPAR Name LparUuid CecUuid HMstate HAmonitor 
             HBmissed HMstateHM1 HBmissedHM1 HMstateHM2 HBmissedHM2 notAvailHM1 notAvailHM2
        ...
        resource 8:
                Name        = "romano001"
                LparUuid    = "2C55D2BB-1C50-49F1-B1A3-5C952E7070C7"
                CecUuid     = "caffee0a-4206-3ee7-bfc2-f9d2bd3e866f"
                HMstate     = "STARTED"
                HAmonitor   = "Enabled"
                HBmissed    = 0
                HMstateHM1  = ""
                HBmissedHM1 = 0
                HMstateHM2  = ""
                HBmissedHM2 = 0
                notAvailHM1 = 0
                notAvailHM2 = 0
        The variables HMstateHM1, HBmissedHM1, HMstateHM2, HBmissedHM2, notAvailHM1, and notAvailHM2 are applicable only for the LOCAL mode. The HMstateHM1, HMstateHM2, HBmissedHM1, and HBmissedHM2 variables store the state of the VM as observed from VIOS1 and VIOS2. If the notAvailHM1 and notAvailHM2 variables are set to 1, it implies that no data was available for this VM from the VIOS.
  2. Check whether the FDE module is requesting health information from the VIOS and whether it obtained data from the VIOS by performing the following steps:
    1. Identify the VIOS that is associated with the request. For example:
      [00] 06/12/18 _VMR 14:15:20.910261 DEBUG FDEthread.C[190]: Use VIOS 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A for polling
    2. Check whether a REST request was initiated. For example:
      [00] 06/12/18 _VMR 14:15:21.096723 DEBUG VMR_HMC.C[6728]: 
      getQuickQuery: Calling kriSubmitQuickQuery!. HMC:9.3.18.186, 
      viosUuid: 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A
      [00] 06/12/18 _VMR 14:16:04.250468 DEBUG VMR_HMC.C[6617]:
      getNeedAttn: Calling kriSubmitNeedAttn!. HMC:9.3.18.186, 
      viosUuid: 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A
    3. Check whether the REST request was successful. For example:
      [00] 06/12/18 _VMR 14:16:05.537662 DEBUG VMR_HG.C[10768]: 
      FDE doNeedAttn success GLOBAL_DATA
    4. Determine the VIOS health packet content. For example:
      [00] 06/12/18 _VMR 14:16:05.537635 DEBUG VMR_HMC.C[6666]: JobOutput
      [00] 06/12/18 _VMR 14:16:05.537635 DEBUG <VIO><Response>
      … XML nodes here with data inside the Response node …
  3. Identify the actions taken by the FDE module, if any, by performing the following steps:
    1. Search for the string Task added to check whether the FDE module has passed the tasks to other components. For example:
       [00] 06/13/18 _VMR 13:18:02.631206 DEBUG needAttn.C[918]: 
      RESYNC HM TASK ADDED: vios 1f5d7ffc-34bd-45b6-bd4f-101512d9bd2a
      If the FDE module passed the task, the task is added to the KSYS queue. The trace.ksys.* trace files might contain further details.
    2. Check whether a move operation is initiated by searching the RECOVERY TASK ADDED for LPAR string. If you cannot find this string, the VM has not met the criteria for a move operation. For example, threshold on missed heartbeats has not reached:
      [15] 06/11/18 _VMR 12:34:08.266355 DEBUG VMR_LPAR.C[14541]: 
      ssetHBmissed 46 for romano001: 2C55D2BB-1C50-49F1-B1A3-5C952E7070C7
    3. Check whether the FDE module enabled the local mode. For example:
      [06] 06/11/18 _VMR 09:18:48.906817 DEBUG FDEthread.C[209]: 
          Did not find a VIOS - Going into local database mode
      [06] 06/11/18 _VMR 09:18:48.906874 DEBUG FDEthread.C[679]: 
          Use VIOS 6F97A18C-3738-4DE6-901A-96A338A3BA80 for local DB VLANID 101
      [06] 06/11/18 _VMR 09:18:48.907018 DEBUG FDEthread.C[679]: 
          Use VIOS 50C3E089-2254-4322-9B98-57038A701813 for local DB VLANID 102
      [06] 06/11/18 _VMR 09:18:48.907065 DEBUG VMR_HG.C[10841]: 
          FDE performing doNeedAttn LOCAL_DATA
      In the global mode, the request is sent to the VIOS and the FDE module waits for a response. The response is parsed and the FDE module either takes action or moves the task to the KSYS subsystem. The local mode provides information about when heartbeat was missed.

You cannot restore a previously backed up KSYS configuration snapshot

Problem
When you attempt to restore a previously backed up KSYS configuration snapshot, you receive error messages indicating that the restore operation is not successful.
Solution
  1. Search the /var/ksys/log/ksysmgr.log file for any of the following text when you receive error messages during the snapshot operations to find the cause of the error: add snapshot, removing old configuration, creating new cluster, creating HMC, host, host group, and so on.
  2. Ensure that the existing KSYS node is not in the corrupted state. If the KSYS node is corrupted, reinstall all the KSYS filesets.
  3. Query the snapshots to check whether all resource attribute values are set correctly by using the following command:
    ksysmgr query snapshot filepath=filename
    The default location for a saved snapshot is /var/ksys/snapshots/.
  4. If you receive a host group creation error, one of the HA disks (ha_disk) and repository disks (repo_disk) might not be available. In this case, check and resolve the disk availability.
  5. If you receive error messages about cluster type, check whether you have set the type of the cluster. After a cluster is created and the IBM.VMR daemon is started, set the ClusterType persistent attribute for the IBM.VMR_SITE class by running the following command:
    chrsrc -c IBM.VMR_SITE 'ClusterType="HA|DR"‘
  6. Ensure that the IBM.VMR daemon is in the active state. If not, reinstall the daemon.

The failed application does not move to a stable state after a restart operation

Problem
The VM agent subsystem cannot restart the failed application successfully.
Solution
  1. Run the ksysvmmgr query app <NAME> command to check the state and UUID of an application. The application is in one of the following states:
    UNSET
    State of an application when the application monitoring starts, but its status is not set.
    TO_START
    State of an application when the application monitoring has failed. The application is successfully stopped and must be started.
    NORMAL
    State of an application when the application is monitored properly.
    NOT_MONITORED
    State of an application when the application is not monitored because the daemon is not started or because the application monitoring is suspended.
    FAILING
    State of an application when the application is receiving monitor script error. The application has not yet failed because the number of successive failures to trigger a restart operation is not reached.
    TO_STOP
    State of an application when the application monitoring has failed, the threshold frequency of application monitoring is passed, the application has failed and must be restarted (first stopped, then restarted).
    NOT_STOPPABLE
    State of an application when the application cannot be stopped. Although the stop script is run, the stop operation fails continuously or times out.
    NOT_STARTABLE
    State of an application when the application cannot be started. Although the start script is run, the start operation fails continuously.
    ABNORMAL
    State of an application when an abnormal condition occurs during monitoring, stopping or starting operations. For example, monitor, stop, or start scripts are not found or cannot be run.
    FAILURE
    State of an application when the application can be restarted but remains in the failure state even after successful restart operations.
  2. Search the UUID of the associated application in the /var/ksys/log/ksys_vmm.log file to get more information about the application failure such as heartbeat requests, VM removal requests, and application reporting.
  3. Ensure that you have provided sufficient inputs to the application agents. The VM agent supports the following application agents:
    ORACLE
    1. Ensure that you provide the correct instance name and oracle database name to the Oracle agent scripts. For example: oracle (instance name) and DBRESP (database name).
    2. Ensure that you specify the correct listener.ora file in the ORACLE_HOME/TNS_ADMIN location for the listener processes to work.
    3. Ensure that the specified start, stop, and monitor scripts are working correctly with the database.
    4. Analyze the /var/ksys/log/agents/oracle_agent/<logfilename> file to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
    DB2
    1. Ensure that you provide the correct DB2 instance owner name to the DB2 agent scripts. For example: db2inst1 (instance owner).
    2. Ensure that you create the DB2 database before running any script. The scripts monitor the created database for the instance owner.
    3. Analyze the /var/ksys/log/agents/db2_agent/<logfilename> file to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
    SAPHANA
    1. Ensure that you provide the correct instance name and the database number to SAP HANA agent scripts. For example: S01 (instance name) and HDB01 (database number).
    2. Ensure that you specify the application version, instance name, and database number while adding the application. Otherwise, the application version field remains empty.
    3. Analyze the log files in the /var/ksys/log/agents/saphana/ directory to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
    4. Ensure that you have marked the application as critical by using the ksysvmmgr modify app <NAME> critical=yes command. The KSYS subsystem restarts a failed application only when you mark the application as critical. When a critical application in a VM reports a permanent failure state, diagnose the issue in the VM by checking the ksys_vmm.log file. When a non-critical application fails, the KSYS subsystem flags this application as failed and notifies you to take further action.

Core dump error in the POSTGRES database

Verify whether the POSTGRES database is running on all the VIOS nodes. If the database not running, run the following command to restart the POSTGRES database.
start - vdba -cm -start

You cannot log in to the VM Recovery Manager HA GUI

Problem
You cannot log in to the VM Recovery Manager HA GUI.
Solution
  1. Check for issues in the /opt/IBM/ksys/ui/server/logs/uiserver.log file.
  2. If you received an error message, Permission missing on Smuiauth: login will not be done, verify that the smuiauth command is installed correctly. Also, verify that the smuiauth command has the correct permissions by running the ls -l command from the /opt/IBM/ksys/ui/server/lib/auth/smuiauth directory. An example output follows:
    -r-x------    1 root     system        21183 Jun 11      21:48
  3. Verify that you can run the smuiauth command successfully by running the command along with the -h flag.
  4. Verify that the pluggable authentication module (PAM) framework is configured correctly by locating the following lines in the /etc/pam.conf file:
    smuiauth        auth       required     pam_aix
    smuiauth        account    required     pam_aix
    The PAM is configured when you install the ksys.ui.server fileset.

You cannot register a KSYS node in the VM Recovery Manager HA GUI server

Problem
You cannot register a KSYS node in the VM Recovery Manager HA GUI server.
Solution
  1. Check for issues in the /opt/IBM/ksys/ui/server/logs/uiserver.log file by performing the following steps:
    1. If SSH File Transfer Protocol (SFTP)-related signatures exist in the log file, such as Received exit code 127 while establishing SFTP session, a problem exists with the SSH communication between the VM Recovery Manager HA GUI server and the KSYS node that you are trying to add.
    2. From the command line, verify that you can connect to the target system by using SFTP. If you cannot connect, verify that the daemon is running on the GUI server and the target node by running the ps -ef | grep -w sshd | grep -v grep command.
    3. Check the SFTP subsystem configuration in the /etc/ssh/sshd_config file and verify that following path is correct.
      Subsystem       sftp    /usr/sbin/sftp-server
      If the path is not correct, you must enter the correct path in the /etc/ssh/sshd_config file, and then restart the sshd subsystem.
  2. Check for issues in the /opt/IBM/ksys/ui/agent/logs/agent_deploy.log file on the target cluster.

Node server down: GUI fails to start

Problem
The VM Recovery Manager HA GUI server is not working correctly.
Solution
If the applications are not running correctly, the node server status might be causing the issue. Run the ps -ef | grep node command to check the status and run the startsrc -s vmruiserver command to start the node server.

Unplanned system reboot causes fallover attempt to start GUI

Problem
You cannot access the VM Recovery Manager HA GUI because of an unplanned GUI server or GUI agent node reboot operation.
Solution
During the system node reboot operation, you cannot access the GUI. Run the lssrc -s vmruiserver command to check the status of the vmruiserver subsystem.
#lssrc -s vmruiserver
Subsystem       Group     PID     Status
vmruiserver     vmrui             inoperative
If the status of the vmruiserver subsystem is displayed as inoperative, run the startsrc -s vmruiserver command to restart the UI server node from the command line. You can then access the GUI and register the agent nodes again.

Unsuccessful Deployment: Dependency file missing during installation

Problem
A dependency file is missing during the installation of the GUI server and the GUI agent filesets.
Solution
Determine the missing file from the log files that you received by using the installp -e flag and install that dependency file from a certified host.

You cannot stop or start the GUI server and GUI agent processes

Problem
You cannot stop or start the GUI server and agent processes.
Solution
  • GUI server: Stop the GUI server by running the following command: stopsrc -s vmruiserver.

    Restart the GUI server by running the following command: startsrc -s vmruiserver. If you are starting the GUI server for the first time after installing GUI server, run the vmruiinst.ksh command. For information about running this command, see Installing GUI server filesets.

  • GUI agent: Stop the GUI agent process by running the following command in the guest VM: stopsrc -s vmruiagent. This command unregisters the KSYS node from the GUI server and the KSYS node will no longer be accessible from the GUI server.

    Restart the GUI agent by running the following command: startsrc -s vmruiagent. This command registers the KSYS node again.

The database network (DBN) node lost network connectivity or lost access to the pool of disks
When the database network (DBN) node lose network connectivity or lose access to the pool of disks for a long time, all Virtual I/O Servers operate in local mode.

Postgres memory dumps and unable to access the VIOS database

Problem
The Postgres database memory dumps and you are unable to access the VIOS database. When you query the VIOS cluster, the following error message is displayed:
Unable to connect to Database
The issue occurs when the Postgres database does not have the required storage capacity during critical write period. The Postgres database cannot restart without memory core dump.
Solution
  1. Login through HMC console to an VIOS which is getting core dumps.
  2. To go to the root directory, run the following command:
    $ oem_setup_env
  3. To find the database node (DBN), run the following command from any Virtual I/O Server (VIOS) which is part of the cluster:
    # clcmd ls -l /var/vio/SSP/<clustername>/D_E_F_A_U_L_T*/VIOSCFG
  4. To find the Postgres version, run the following command from the database node:
    # cat /var/vio/SSP/<clustername>/D_E_F_A_U_L_T*/VIOSCFG/DB/PG/PG_VERSION   ====>You’ll get output either 13 or 10
    • If you get 13 as output, run the following command:
      # cd /usr/ios/db/postgres13/bin
    • If you get 10 as output, run the following command:
      # cd /usr/ios/db/postgres10/bin
  5. To change user credential to the admin user, run the following command:
    # su vpgadmin
  6. Run the following command:
    $ /usr/ios/db/postgres13/bin/ pg_resetwal -f /var/vio/SSP/<clustername>/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG