Solving common problems
This section describes the solutions to some problems that you might encounter when you use the VM Recovery Manager HA solution.
- The discovery operation for a host group, host, or VM failed
- The verification operation for a host group, host, or VM failed
- The flexible capacity values cannot be calculated correctly
- The LPM verification or LPM operation failed
- The restarted VM does not move to a stable state
- The KSYS subsystem cannot find suitable target hosts
- Restart operation failed because of a version mismatch
- The KSYS node restarted during the automatic restart operation
- The SSP cluster is not created or updated successfully
- The repository disk failed
- Virtual switch, trunk adapter, or Ethernet adapters are not created successfully
- The host or VM failures are not detected correctly
- You cannot restore a previously backed up KSYS configuration snapshot
- The failed application does not move to a stable state after a restart operation
- You cannot log in to the VM Recovery Manager HA GUI
- You cannot register a KSYS node in the VM Recovery Manager HA GUI server
- Node server down: GUI fails to start
- Unplanned system reboot causes fallover attempt to start GUI
- Unsuccessful Deployment: Dependency file missing during installation
- You cannot stop or start the GUI server and GUI agent processes
The discovery operation for a host group, host, or VM failed
- Problem
- The discovery operation failed or a VM is not discovered by the KSYS subsystem during the discovery operation.
- Solution
-
- Ensure that you have completed all the prerequisites that are specified in the Requirements section and the configuration steps that are specified in the Configuring section.
- Check whether the ha_monitor attribute is enabled or disabled for site,
host group, host, and VM by using the following commands:
lsrsrc IBM.VMR_SITE HAmonitor lsrsrc IBM.VMR_HG Name HAmonitor lsrsrc IBM.VMR_LPAR HaMonitor lsrsrc IBM.VMR_LPAR Name HAmonitor
- If the HAMonitor field shows
Disabled
or not set, enable the ha_monitor attribute by using one of the following commands:chrsrc -c IBM.VMR_SITE HAmonitor="Enabled"
If the ha_monitor attribute for a VM is not enabled at a VM-level, host group-level, or system-level, the VM is not considered for the discovery operation.ksysmgr modify system|hg|vm [<name>] ha_monitor=enable
- Ensure that you have started the VM monitor daemon by running the ksysvmmgr start vmm command. The VM agent can send heartbeats to the host monitor only when you start the VM monitor daemon.
- Ensure that you have set all the HMC options that are specified in the Requirements section.
The verification operation for a host group, host, or VM failed
- Problem
- The
Phase
or thePhaseDetail
attribute of a VM indicates a status that the verification operation failed. - Solution
- During the verification operation, you can troubleshoot the possible issues as described in the
following steps:
- Analyze the verification operation flow in the
/var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace
files and check the
Phase
,PhaseDetail
, andHAState
fields during each operation. - If any of the remote copy programs (RCP) for host group, host, or LPAR is deleted or does not exist, re-create the host group and add the hosts by using the ksysmgr command. Run the discovery operation and the verification operation to check whether the issue is resolved.
- If the verification lock is acquired by some other process, the verification process might fail. Check the trace.ksys.* trace files to ensure that verification locks are not acquired by other threads before you start the verification operation.
- The verification process for a VM cannot start if 32 threads are already running. Ensure that sufficient number of empty threads are available for VM task to complete in specified time.
- Check the
/var/ct/<cluster_name>/log/mc/IBM.VMR/trace.krestlong.*
trace files to identify whether the LPM operation is successful. Rerun the verification operation,
if required. Also, check whether the previous phase and
PhaseDetail
fields are cleared before you start the LPM operation.
- Analyze the verification operation flow in the
/var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace
files and check the
The flexible capacity values cannot be calculated correctly
- Problem
- The verification operation failed because of an error in the calculated flexible capacity values.
- Solution
-
- If any of the remote copy programs (RCP) for host group, host, or LPAR is deleted or does not exist, the verification thread might fail. Recreate the host group and run the discovery operation.
- If the calculated flexible capacity values are not correct, consider the following solutions:
- Review the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files to check whether the policy map is created correctly by the policy manager of the target host and check the logical memory block (LMB) size that is calculated by the target host in the trace files.
- Ensure that the specified capacity is between the minimum and maximum capacity for a VM, otherwise the VM moves to the target host with the same capacity.
- Ensure that flexible capacity table is set properly in the host group by using the following
commands:
#lsrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexProcCapacityTable #lsrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexMemCapacityTable #chrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexMemCapacityTable={"100","70","50"} #chrsrc -s 'Name="hg_name"' IBM.VMR_HG FlexProcCapacityTable={"100","70","50"} #lsrsrc -s 'Name="hg_name"' IBM.VMR_HG Priority #chrsrc -s 'Name="hg_name"' IBM.VMR_HG Priority="Medium"
The LPM verification or LPM operation failed
- Problem
- The VM cannot be moved to another host while performing the Live Partition Mobility (LPM) operation.
- Solution
- If the VM move operation failed while performing the LPM operation on the VMs from the KSYS
subsystem or while restarting the VMs in another host, you must check the events, the VM state, and
the log files to diagnose and resolve the issue. Use the following troubleshooting steps to diagnose
the issue:
- If you received an event notification, check the event details in the /var/ksys/events.log file and review the suggested action.
- Identify the reason for the move operation (LPM or restart) failure in the /var/ksys/log/ksysmgr.log file.
- Ensure that the RMC connection between the HMC and the VM is working. If the VM contains any firewall service, ensure that the RMC connection is allowed by the firewall. You can use the HMC diagrmc command to verify and correct the RMC connection.
- If the move operation failed because of policy conflicts (such as collocation policy and anti-collocation policy), resolve the conflicts and run the move operation again. If the move operation failed because of insufficient capacity resources in the target host, increase the target host capacity and retry the move operation.
- If the LPM or restore operation failed and the VM exists on both the source and target hosts,
recover the VM by using the following
command:
ksysmgr [-f] lpm vm vmname1 action=recover
- Run the discovery and verification operations after each LPM operation to update the LPM validation state.
The restarted VM does not move to a stable state
- Problem
- When you restart a failed VM on another host, the VM does not move to a stable state.
- Solution
- The LPAR profile on the source host is deleted after the virtual machine is restarted
successfully on the target host. However, if the virtual machine does not move to proper state,
perform the following steps:
- Restart the virtual machines in the target host by running the following
command:
ksysmgr [-f] restart vm|host [vmname1[,vmname2,...]]
- If the restart operations fail, recover the virtual machine in the same host where it is located
currently by running the following
command:
ksysmgr [-f] recover vm vmname
- If the output of the restart command indicates cleanup errors, run the cleanup command manually
to clean up the VM details in the source host by running the following
command:
ksysmgr cleanup vm vmname host=hostname
- Restart the virtual machines in the target host by running the following
command:
The KSYS subsystem cannot find suitable target hosts
- Problem
- The failed VMs cannot be restarted on other hosts successfully because the policy manager module cannot find suitable target hosts.
- Solution
- Whenever a VM or a host failure is identified by the KSYS subsystem, the policy manager module
finds the best fit target host where the VM can be migrated based on various policies such as
collocation, anti-collocation, affinity, priority, and blacklist. Sometimes, the policy manager
module cannot find the target host for a VM because of policy conflict or resource check failures.
Perform the following steps to troubleshoot this error:
- Search for the string
Policymap VmtoCECMapTable
in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.user.* trace file to find the VM-host mapping table. - Review the miscellaneous column of the VM-host mapping table to check the policies that have conflicts. Resolve the conflicts, run the discovery and verification operations, and then retry the recovery operations.
- Search for the string
Restart operation failed because of a version mismatch
- Problem
- The failed VMs cannot be restarted on other hosts successfully because of version mismatch.
- Solution
- The major versions of host monitor filesets and VM monitor filesets must match for successful
heartbeat operation. Run the following commands to identify the fileset versions of VM monitor and
host monitor:
If any of the host monitor or VM monitor major version does not match, you must upgrade the minor version to any of the major versions.ksysmgr query vm ksysmgr query vios
The KSYS node restarted during the automatic restart operation
- Problem
- The VM restart operation is interrupted by the KSYS node restart operation.
- Solution
- If the KSYS node restarts during the automatic restart operation of virtual machines, the move operation is added to the policy manager module automatically and is resumed from the point where the move operation was interrupted before the KSYS node reboot operation. The move operation of virtual machines, whose restart operation is completed, are not added to the policy manager module. Check the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace files to check the details of the operations.
The SSP cluster is not created or updated successfully
- Problem
- The KSYS subsystem cannot create an SSP cluster, cannot add VIOS to an existing SSP cluster, or cannot collect required SSP information.
- Solution
-
- Re-create the host group to fix the initial SSP remote copy program (RCP).
- If the SSP cluster is not created successfully, search the following text in the
/var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.* trace
files:
Could not create single node cluster on VIOS
. Based on your analysis, perform the following steps:- If any values of the input variables for the create_ssp() API are missing or incorrect, the problem might be in the KSYS configuration settings. Check and update the KSYS configuration settings and rerun the discovery operation.
- Check the libkrest logs for the kriSubmitCreateSSP() KREST API by examining the return code and error message to identify whether the problem is from the HMC or the VIOS.
- The HMC might get overloaded with multiple retry requests from the KSYS subsystem. Therefore, if you receive a message that HMC is busy, wait for some time and then retry the operation.
- Run the cluster -create command on the VIOS to identify whether the VIOS has any problems to create the SSP cluster. For more information about running the cluster command, see the cluster command documentation in VIOS.
- If the KSYS subsystem cannot add the VIOS to an existing SSP cluster, search the following text
in the /var/ct/<cluster_name>/log/mc/IBM.VMR/trace.ksys.*
trace files:
Could not add VIOS: xxxx to cluster xxx
. Based on your analysis, perform the following steps:- If any values of the input variables for the add_SSP_node() API are missing or incorrect, the problem might be in the KSYS configuration settings. Check and update the KSYS configuration settings, and rerun the discovery operation.
- Check the kriSubmitaddSSPNode() API for the return code and error message to verify whether the problem is from the HMC or the VIOS. The KSYS subsystem uses the HMC REST APIs to handle the requests, therefore, HMC waits to get the acknowledgment of job completion from the VIOS. The retry operation failure does not necessarily mean that the request failed, if the error occurred in the first few retry operations.
- Run the cluster -addnode command on the VIOS to identify whether the VIOS has any problems to add node to the SSP cluster. For more information about running the cluster command, see the cluster command documentation in VIOS.
- If the KSYS subsystem fails to collect all the required SSP information, one or more attributes of the SSP cluster might be missing their values that are collected during the discovery operation. Check whether the storage pools are in operational state by running the cluster -status command in the VIOS. If any of the pools are not in operational state, the KSYS subsystem fails to collect the required SSP data. Refer to VIOS SSP documentation to fix the issue.
- Re-run the discovery operation so that the SSP data can be updated in the registry.
The repository disk failed
- Problem
- You receive a repository disk event,
REPOSITORY_DISK_FAILURE
. - Solution 1
- When a repository disk fails, you can manually replace repository disk by running the
modify hg
command with a new repository disk ID. Run themodify_Hg
command from the KSYS subsystem.ksysmgr modify host_group <name> options [repo_disk=<ViodiskID>]
- Solution 2
- When a repository disk fails, you can manually replace repository disk by completing the
following steps in the HMC GUI:
- Log in to HMC GUI in a web browser as
hscroot
user. - Go to Resources > All Shared Storage Pool Clusters.
- Select your cluster and click on the cluster name.
- Click Replace Disk.
- In the Replace Repository Disk panel, select one of the available free shared physical volumes as the new repository disk to replace the existing repository disk.
- Click UUID value to validate the complete UUID and the local hdisk name on each VIOS.
- Click OK to replace the repository disk. After the operation is complete, in the Shared Storage Pool Cluster window, click the repository disk UUID value to check whether it matches the selected new repository disk.
- Run the discovery operation to update the KSYS configuration settings by running the following
command:
/opt/IBM/ksys/ksysmgr –t discover host_group <HGName>
- After the discovery operation is complete, run the following command to verify whether the
updated repository disk UUID in SSP remote copy matches the UUID in
HMC:
For example,/opt/IBM/ksys/ksysmgr -v query host_group
(0) root @ ksys305: /var/ksys # /opt/IBM/ksys/ksysmgr -v q host_group Name: HG1 Hosts: hk-8247-22L-2139F7A ko-8284-22A-10FDC13 Memory_capacity: Priority Based Settings high:100 medium:100 low:100 CPU_capacity: Priority Based Settings high:100 medium:100 low:100 Skip_power_on: No HA_monitor: enable Restart_policy: auto VM_failure_detection_speed: normal Host_failure_detection_time: 90 SSP Cluster Attributes Sspname: KSYS_env30_ha_1 Sspstate: UP Ssp_version: VIOS 3.1.0.00 VIOS: kov2 hkv2 hkv1 kov1 Repo_disk: 01M0lCTTIxMDc5MDA2MDA1MDc2MzAzRkZEM0ZGMDAwMDAwMDAwMDAwMDQxNg== HA_disk: 01M0lCTTIxMDc5MDA2MDA1MDc2MzAzRkZEM0ZGMDAwMDAwMDAwMDAwMDQxNw== SspUuid: d5e5d382-dc01-38e8-ab55-083ca3ffe826 XSD_version: 4.00 PoolName: default_pool_1 PoolUuid: 000000000A281695000000005BD73628
- Log in to HMC GUI in a web browser as
Virtual switch, trunk adapter, or Ethernet adapters are not created successfully
- Problem
- The KSYS subsystem cannot create or delete virtual switch on a host, cannot create or delete trunk adapters on a VIOS, or cannot create or delete Ethernet adapters on VMs.
- Solution
-
- Ensure that the KSYS network configuration is set properly by performing the following steps:
- Each managed host must contain a switch for the KSYS configuration that is created during the discovery operation. Check the switch UUID that is stored in the KSYS registry by running the lsrsrc IBM.VMR_CEC SwitchUUID command.
- Each VIOS must contain two trunk adapters with VLAN IDs 101 and 102 in the associated host.
Check the
ActiveHM
fields for the VIOS by running the lsrsrc IBM.VMR_VIOS ActiveHM AdapterMACAddress AdapterSlot command. Other virtual I/O servers show theActiveHM
field as blank. - Each managed virtual machine must contain two virtual adapters. Check the MAC addresses and UUIDs that are saved in the registry by running the lsrsrc IBM.VMR_LPAR AdapterMACAddress1 AdapterUUID1 AdapterMACAddress2 AdapterUUID2 command.
- Rerun the discovery operation. If the required KSYS network is still not configured, manual intervention might be required.
- If you find the following errors in the traces.ksys.* trace
file:
Investigate these errors at the HMC level and resolve them as specified in the error message.“Dynamic add of virtual I/O resources failed” “The specified slot contains a device or devices that are currently configured.” “This virtual switch cannot be deleted since the following virtual networks are using this virtual switch…”
- Access the HMC and run the chhwres command to create or delete switches and adapters. After you verify the configuration on the HMC, run the discovery operation to update the KSYS configuration settings.
- Ensure that the KSYS network configuration is set properly by performing the following steps:
The host or VM failures are not detected correctly
- Problem
- The failure detection engine (FDE) module of the KSYS subsystem cannot connect to the VIOS or is not working correctly.
- Solution
- The Failure Detection Engine (FDE) module is a KSYS module that detects the health of a virtual
machine and initiates the relocation request if the VM must be moved. The most common reasons for
monitoring failures by the FDE module are as follows:
- The HA monitoring at the site or system level is disabled.
- The HA monitoring at the VM level is disabled.
- The KSYS daemon might be repeatedly restarting.
- Check whether the FDE module is active by performing the following steps:
- Check whether the HA monitoring is enabled by searching the
Trace Started – Pid
string in the /var/ct/< cluster_name>/log/mc/IBM.VMR/trace.fde.* trace files. For example:[00] 06/12/18 ____ 14:14:29.274292 ******************* Trace Started - Pid = 6553828 ********************** [00] 06/12/18 _VMR 14:15:20.910220 DEBUG FDEthread.C[125]: Monitoring Enabled.
- Review the HAmonitor attribute settings by using the
lsrsrc command in the KSYS node as follows:
- Check the persistent values that are specific to the HA monitoring and are saved in the
VMR_SITE
class by running the following command:
The high-availability monitoring is enabled at a global level based on the HAmonitor value.# lsrsrc -c IBM.VMR_SITE HAmonitor hostFDT VMFDS VMthreshold FDEpollInt Resource Class Persistent Attributes for IBM.VMR_SITE resource 1: HAmonitor = "Enabled" hostFDT = 90 VMFDS = "Normal" VMthreshold = 50 FDEpollInt = 20
- Check the persistent values that are specific to the HA monitoring and are saved in the
VMR_HG
(host group) class by running the following command:
If the failure detection time for host (# lsrsrc IBM.VMR_HG Name HAmonitor hostFDT Resource Persistent Attributes for IBM.VMR_HG resource 1: Name = "HG1" HAmonitor = "Enabled" hostFDT = 0
hostFDT
value) is 0, the value specified at the site level is used. - Check the persistent values that are specific to the HA monitoring and are saved in the
VMR_CEC
(host) class by running the following command:# lsrsrc IBM.VMR_CEC HAmonitor VMFDS Resource Persistent Attributes for IBM.VMR_CEC resource 1: HAmonitor = "Enabled" VMFDS = ""
- Check the persistent values that are specific to the HA monitoring and are saved in the
VMR_VIOS
class by running the following command:
If a VIOS is running in a# lsrsrc IBM.VMR_VIOS Name ViosUuid CecUuid ActiveHM CAAstate HMrespSec HMresponsive CAAstateReason MonitorMode SlowIO Resource Persistent Attributes for IBM.VMR_VIOS resource 1: Name = "lasagnav1" ViosUuid = "7B5099B5-019E-443F-98EB-E04A680D6DA6" CecUuid = "ac420f59-cdbf-3ab8-b523-37f66d461741" ActiveHM = 102 CAAstate = "UP" HMrespSec = 1 HMresponsive = "Yes" CAAstateReason = "" MonitorMode = "GLOBAL" SlowIO = 0
LOCAL
mode, theMonitorMode
field is set toLOCAL
. If the host monitor is not operating correctly, theMonitorMode
field is set toDOWN
. - Check the persistent values that are specific to the HA monitoring and are saved in the
VMR_LPAR
class by running the following command:
The variableslsrsrc IBM.VMR_LPAR Name LparUuid CecUuid HMstate HAmonitor HBmissed HMstateHM1 HBmissedHM1 HMstateHM2 HBmissedHM2 notAvailHM1 notAvailHM2 ... resource 8: Name = "romano001" LparUuid = "2C55D2BB-1C50-49F1-B1A3-5C952E7070C7" CecUuid = "caffee0a-4206-3ee7-bfc2-f9d2bd3e866f" HMstate = "STARTED" HAmonitor = "Enabled" HBmissed = 0 HMstateHM1 = "" HBmissedHM1 = 0 HMstateHM2 = "" HBmissedHM2 = 0 notAvailHM1 = 0 notAvailHM2 = 0
HMstateHM1
,HBmissedHM1
,HMstateHM2
,HBmissedHM2
,notAvailHM1
, andnotAvailHM2
are applicable only for theLOCAL
mode. TheHMstateHM1
,HMstateHM2
,HBmissedHM1
, andHBmissedHM2
variables store the state of the VM as observed from VIOS1 and VIOS2. If thenotAvailHM1
andnotAvailHM2
variables are set to 1, it implies that no data was available for this VM from the VIOS.
- Check the persistent values that are specific to the HA monitoring and are saved in the
- Check whether the HA monitoring is enabled by searching the
- Check whether the FDE module is requesting health information from the VIOS and whether it
obtained data from the VIOS by performing the following steps:
- Identify the VIOS that is associated with the request. For example:
[00] 06/12/18 _VMR 14:15:20.910261 DEBUG FDEthread.C[190]: Use VIOS 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A for polling
- Check whether a REST request was initiated. For
example:
[00] 06/12/18 _VMR 14:15:21.096723 DEBUG VMR_HMC.C[6728]: getQuickQuery: Calling kriSubmitQuickQuery!. HMC:9.3.18.186, viosUuid: 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A
[00] 06/12/18 _VMR 14:16:04.250468 DEBUG VMR_HMC.C[6617]: getNeedAttn: Calling kriSubmitNeedAttn!. HMC:9.3.18.186, viosUuid: 1F5D7FFC-34BD-45B6-BD4F-101512D9BD2A
- Check whether the REST request was successful. For example:
[00] 06/12/18 _VMR 14:16:05.537662 DEBUG VMR_HG.C[10768]: FDE doNeedAttn success GLOBAL_DATA
- Determine the VIOS health packet content. For
example:
[00] 06/12/18 _VMR 14:16:05.537635 DEBUG VMR_HMC.C[6666]: JobOutput [00] 06/12/18 _VMR 14:16:05.537635 DEBUG <VIO><Response> … XML nodes here with data inside the Response node …
- Identify the VIOS that is associated with the request. For example:
- Identify the actions taken by the FDE module, if any, by performing the following steps:
- Search for the string
Task added
to check whether the FDE module has passed the tasks to other components. For example:
If the FDE module passed the task, the task is added to the KSYS queue. The trace.ksys.* trace files might contain further details.[00] 06/13/18 _VMR 13:18:02.631206 DEBUG needAttn.C[918]: RESYNC HM TASK ADDED: vios 1f5d7ffc-34bd-45b6-bd4f-101512d9bd2a
- Check whether a move operation is initiated by searching the
RECOVERY TASK ADDED for LPAR
string. If you cannot find this string, the VM has not met the criteria for a move operation. For example, threshold on missed heartbeats has not reached:[15] 06/11/18 _VMR 12:34:08.266355 DEBUG VMR_LPAR.C[14541]: ssetHBmissed 46 for romano001: 2C55D2BB-1C50-49F1-B1A3-5C952E7070C7
- Check whether the FDE module enabled the local mode. For example:
In the global mode, the request is sent to the VIOS and the FDE module waits for a response. The response is parsed and the FDE module either takes action or moves the task to the KSYS subsystem. The local mode provides information about when heartbeat was missed.[06] 06/11/18 _VMR 09:18:48.906817 DEBUG FDEthread.C[209]: Did not find a VIOS - Going into local database mode [06] 06/11/18 _VMR 09:18:48.906874 DEBUG FDEthread.C[679]: Use VIOS 6F97A18C-3738-4DE6-901A-96A338A3BA80 for local DB VLANID 101 [06] 06/11/18 _VMR 09:18:48.907018 DEBUG FDEthread.C[679]: Use VIOS 50C3E089-2254-4322-9B98-57038A701813 for local DB VLANID 102 [06] 06/11/18 _VMR 09:18:48.907065 DEBUG VMR_HG.C[10841]: FDE performing doNeedAttn LOCAL_DATA
- Search for the string
You cannot restore a previously backed up KSYS configuration snapshot
- Problem
- When you attempt to restore a previously backed up KSYS configuration snapshot, you receive error messages indicating that the restore operation is not successful.
- Solution
-
- Search the /var/ksys/log/ksysmgr.log file for any of the following text
when you receive error messages during the snapshot operations to find the cause of the error:
add snapshot
,removing old configuration
,creating new cluster
,creating HMC, host, host group
, and so on. - Ensure that the existing KSYS node is not in the corrupted state. If the KSYS node is corrupted, reinstall all the KSYS filesets.
- Query the snapshots to check whether all resource attribute values are set correctly by using
the following
command:
The default location for a saved snapshot is /var/ksys/snapshots/.ksysmgr query snapshot filepath=filename
- If you receive a host group creation error, one of the HA disks (
ha_disk
) and repository disks (repo_disk
) might not be available. In this case, check and resolve the disk availability. - If you receive error messages about cluster type, check whether you have set the type of the
cluster. After a cluster is created and the
IBM.VMR
daemon is started, set the ClusterType persistent attribute for the IBM.VMR_SITE class by running the following command:chrsrc -c IBM.VMR_SITE 'ClusterType="HA|DR"‘
- Ensure that the
IBM.VMR
daemon is in the active state. If not, reinstall the daemon.
- Search the /var/ksys/log/ksysmgr.log file for any of the following text
when you receive error messages during the snapshot operations to find the cause of the error:
The failed application does not move to a stable state after a restart operation
- Problem
- The VM agent subsystem cannot restart the failed application successfully.
- Solution
-
- Run the ksysvmmgr query app <NAME> command to check the state and UUID of
an application. The application is in one of the following states:
- UNSET
- State of an application when the application monitoring starts, but its status is not set.
- TO_START
- State of an application when the application monitoring has failed. The application is successfully stopped and must be started.
- NORMAL
- State of an application when the application is monitored properly.
- NOT_MONITORED
- State of an application when the application is not monitored because the daemon is not started or because the application monitoring is suspended.
- FAILING
- State of an application when the application is receiving monitor script error. The application has not yet failed because the number of successive failures to trigger a restart operation is not reached.
- TO_STOP
- State of an application when the application monitoring has failed, the threshold frequency of application monitoring is passed, the application has failed and must be restarted (first stopped, then restarted).
- NOT_STOPPABLE
- State of an application when the application cannot be stopped. Although the stop script is run, the stop operation fails continuously or times out.
- NOT_STARTABLE
- State of an application when the application cannot be started. Although the start script is run, the start operation fails continuously.
- ABNORMAL
- State of an application when an abnormal condition occurs during monitoring, stopping or starting operations. For example, monitor, stop, or start scripts are not found or cannot be run.
- FAILURE
- State of an application when the application can be restarted but remains in the failure state even after successful restart operations.
- Search the UUID of the associated application in the /var/ksys/log/ksys_vmm.log file to get more information about the application failure such as heartbeat requests, VM removal requests, and application reporting.
- Ensure that you have provided sufficient inputs to the application agents. The VM agent supports
the following application agents:
- ORACLE
-
- Ensure that you provide the correct instance name and oracle database name to the Oracle agent
scripts. For example:
oracle
(instance name) andDBRESP
(database name). - Ensure that you specify the correct listener.ora file in the ORACLE_HOME/TNS_ADMIN location for the listener processes to work.
- Ensure that the specified start, stop, and monitor scripts are working correctly with the database.
- Analyze the /var/ksys/log/agents/oracle_agent/<logfilename> file to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
- Ensure that you provide the correct instance name and oracle database name to the Oracle agent
scripts. For example:
- DB2
-
- Ensure that you provide the correct DB2 instance owner name to the DB2 agent scripts. For
example:
db2inst1
(instance owner). - Ensure that you create the DB2 database before running any script. The scripts monitor the created database for the instance owner.
- Analyze the /var/ksys/log/agents/db2_agent/<logfilename> file to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
- Ensure that you provide the correct DB2 instance owner name to the DB2 agent scripts. For
example:
- SAPHANA
-
- Ensure that you provide the correct instance name and the database number to SAP HANA agent
scripts. For example:
S01
(instance name) andHDB01
(database number). - Ensure that you specify the application version, instance name, and database number while adding the application. Otherwise, the application version field remains empty.
- Analyze the log files in the /var/ksys/log/agents/saphana/ directory to diagnose the agent script failures. These log files contain information about any missing attribute or parameter.
- Ensure that you have marked the application as critical by using the ksysvmmgr modify app <NAME> critical=yes command. The KSYS subsystem restarts a failed application only when you mark the application as critical. When a critical application in a VM reports a permanent failure state, diagnose the issue in the VM by checking the ksys_vmm.log file. When a non-critical application fails, the KSYS subsystem flags this application as failed and notifies you to take further action.
- Ensure that you provide the correct instance name and the database number to SAP HANA agent
scripts. For example:
- Run the ksysvmmgr query app <NAME> command to check the state and UUID of
an application. The application is in one of the following states:
Core dump error in the POSTGRES database
Verify whether the POSTGRES database is running on all the VIOS nodes. If the database
not running, run the following command to restart the POSTGRES database.
start - vdba -cm -start
You cannot log in to the VM Recovery Manager HA GUI
- Problem
- You cannot log in to the VM Recovery Manager HA GUI.
- Solution
-
- Check for issues in the /opt/IBM/ksys/ui/server/logs/uiserver.log file.
- If you received an error message,
Permission missing on Smuiauth: login will not be done
, verify that the smuiauth command is installed correctly. Also, verify that the smuiauth command has the correct permissions by running the ls -l command from the /opt/IBM/ksys/ui/server/lib/auth/smuiauth directory. An example output follows:-r-x------ 1 root system 21183 Jun 11 21:48
- Verify that you can run the smuiauth command successfully by running the command along with the -h flag.
- Verify that the pluggable authentication module (PAM) framework is configured correctly by
locating the following lines in the /etc/pam.conf
file:
The PAM is configured when you install the ksys.ui.server fileset.smuiauth auth required pam_aix smuiauth account required pam_aix
You cannot register a KSYS node in the VM Recovery Manager HA GUI server
- Problem
- You cannot register a KSYS node in the VM Recovery Manager HA GUI server.
- Solution
-
- Check for issues in the /opt/IBM/ksys/ui/server/logs/uiserver.log file by
performing the following steps:
- If SSH File Transfer Protocol (SFTP)-related signatures exist in the log file, such as
Received exit code 127
while establishing SFTP session, a problem exists with the SSH communication between the VM Recovery Manager HA GUI server and the KSYS node that you are trying to add. - From the command line, verify that you can connect to the target system by using SFTP. If you cannot connect, verify that the daemon is running on the GUI server and the target node by running the ps -ef | grep -w sshd | grep -v grep command.
- Check the SFTP subsystem configuration in the /etc/ssh/sshd_config file and
verify that following path is
correct.
If the path is not correct, you must enter the correct path in the /etc/ssh/sshd_config file, and then restart theSubsystem sftp /usr/sbin/sftp-server
sshd
subsystem.
- If SSH File Transfer Protocol (SFTP)-related signatures exist in the log file, such as
- Check for issues in the /opt/IBM/ksys/ui/agent/logs/agent_deploy.log file on the target cluster.
- Check for issues in the /opt/IBM/ksys/ui/server/logs/uiserver.log file by
performing the following steps:
Node server down: GUI fails to start
- Problem
- The VM Recovery Manager HA GUI server is not working correctly.
- Solution
- If the applications are not running correctly, the node server status might be causing the issue. Run the ps -ef | grep node command to check the status and run the startsrc -s vmruiserver command to start the node server.
Unplanned system reboot causes fallover attempt to start GUI
- Problem
- You cannot access the VM Recovery Manager HA GUI because of an unplanned GUI server or GUI agent node reboot operation.
- Solution
- During the system node reboot operation, you cannot access the GUI. Run the lssrc -s
vmruiserver command to check the status of the
vmruiserver
subsystem.
If the status of the#lssrc -s vmruiserver Subsystem Group PID Status vmruiserver vmrui inoperative
vmruiserver
subsystem is displayed asinoperative
, run the startsrc -s vmruiserver command to restart the UI server node from the command line. You can then access the GUI and register the agent nodes again.
Unsuccessful Deployment: Dependency file missing during installation
- Problem
- A dependency file is missing during the installation of the GUI server and the GUI agent filesets.
- Solution
- Determine the missing file from the log files that you received by using the installp -e flag and install that dependency file from a certified host.
You cannot stop or start the GUI server and GUI agent processes
- Problem
- You cannot stop or start the GUI server and agent processes.
- Solution
-
- GUI server: Stop the GUI server by running the following command:
stopsrc -s vmruiserver
.Restart the GUI server by running the following command:
startsrc -s vmruiserver
. If you are starting the GUI server for the first time after installing GUI server, run the vmruiinst.ksh command. For information about running this command, see Installing GUI server filesets. - GUI agent: Stop the GUI agent process by running the following command in the guest VM:
stopsrc -s vmruiagent
. This command unregisters the KSYS node from the GUI server and the KSYS node will no longer be accessible from the GUI server.Restart the GUI agent by running the following command:
startsrc -s vmruiagent
. This command registers the KSYS node again.
- GUI server: Stop the GUI server by running the following command:
- The database network (DBN) node lost network connectivity or lost access to the pool of disks
- When the database network (DBN) node lose network connectivity or lose access to the pool of disks for a long time, all Virtual I/O Servers operate in local mode.
Postgres memory dumps and unable to access the VIOS database
- Problem
- The Postgres database memory dumps and you are unable to access the VIOS database. When you
query the VIOS cluster, the following error message is displayed:
Unable to connect to Database
- Solution
-
- Login through HMC console to an VIOS which is getting core dumps.
- To go to the root directory, run the following command:
$ oem_setup_env
- To find the database node (DBN), run the following command from any Virtual I/O Server (VIOS)
which is part of the
cluster:
# clcmd ls -l /var/vio/SSP/<clustername>/D_E_F_A_U_L_T*/VIOSCFG
- To find the Postgres version, run the following command from the database node:
# cat /var/vio/SSP/<clustername>/D_E_F_A_U_L_T*/VIOSCFG/DB/PG/PG_VERSION ====>You’ll get output either 13 or 10
- If you get 13 as output, run the following
command:
# cd /usr/ios/db/postgres13/bin
- If you get 10 as output, run the following
command:
# cd /usr/ios/db/postgres10/bin
- If you get 13 as output, run the following
command:
- To change user credential to the admin user, run the following command:
# su vpgadmin
- Run the following
command:
$ /usr/ios/db/postgres13/bin/ pg_resetwal -f /var/vio/SSP/<clustername>/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG