Solving common problems

When you receive errors for any ksysmgr commands, the command output shows the error and the suggested resolution. However, if you are not able to determine the issue, review the following approaches to diagnose the issue.

The discovery operation failed.

If a problem occurs during the discovery steps, you must analyze the ksysmgr.log file for any failures. After the discovery is complete, you can query the IBM.VMR_LPAR resource class to confirm the successful completion of discovery operation:

lsrsrc IBM.VMR_LPAR

The output might be similar to the following sample:

        Name                = "xxx"
        LparUuid            = "59C8CFxx-4Bxx-43E2-A0CE-F028AEB5Fxxx"
        LparIPList          = {}
        SiteCleanupTastList = {}
        ActiveSiteID        = 80708xxxx
        LCB                 = {  }
        BootDiskList        = {}
        CecUuid             = "6ce366c5-f05d-3a12-94f8-94a3fdfcxxxx"
        ErrMsg              = ""
        Phase               = "READY"
        PhaseDetail         = 4194305
        Memory              = "4352"
        Processors          = "0.1"
        ActivePeerDomain    = "vmdr"

In case of any errors in the discovery operation, the Phase field is set as VERIFY and the ErrMsg field indicates the error details. The Phase field is set as READY after a successful discovery operation.

The discovery operation failed with the getlcb error.

The cause for this error might be that the virtual machine's Fibre Channel port in the storage area network (SAN) fabric is zoned with a storage port that does not provide any logical unit numbers (LUNs) to the virtual machine. You can resolve this error by completing one of the following steps:

Ensure that the virtual machine is zoned only with those storage ports that provide LUNs to the virtual machines.
Run the cfgmgr command in the virtual machine where the getlcb failure occurred and then run the discovery operation again.

The discovery operation failed indicating that the storage disk was already a part of an existing composite group.

If any of the storage disks in the GDR solution are already a part of an existing composite group, the discovery operation cannot complete successfully. A storage disk in the GDR solution must be associated with a single composite group, which is asynchronously consistent. Remove the older composite groups, and run the discovery operation again.

The discovery operation failed indicating that the disk group is created in only one site.

Review the /var/ksys/log/ksys_srdf.log file for any consistency-enabled issue. Ensure all the disks that belong to a Remote Data Facility (RDF) group are also a part of the composite group.

The verification phase failed.

After the validation is complete, you can query the IBM.VMR_LPAR resource class to ensure that the virtual machines are ready to be moved during a disaster situation:

lsrsrc IBM.VMR_LPAR

The output might be similar to the following sample:

        Name                = "xxx"
        LparUuid            = "59C8CFxx-4Bxx-43E2-A0CE-F028AEB5Fxxx"
        LparIPList          = {}
        SiteCleanupTastList = {}
        ActiveSiteID        = 80708xxxx
        LCB                 = {  }
        BootDiskList        = {}
        CecUuid             = "6ce366c5-f05d-3a12-94f8-94a3fdfcxxxx"
        ErrMsg              = ""
        Phase               = "READY_TO_MOVE"
        PhaseDetail         = 4194305
        Memory              = "4352"
        Processors          = "0.1"
        ActivePeerDomain    = "vmdr"

In case of any errors in configuration validation, review the error details in the ErrMsg field. The Phase field is set as READY_TO_MOVE after a successful verification operation.

The test-discovery step in the DR failover rehearsal operation is failing with the error message: Tertiary storage copy is missing.

This error occurs when one or more of the third copy of disks required for cloning the backup storage data is missing. For each backup (2nd copy) disk, a corresponding tertiary (third copy) disk must exist in the backup site. Check the availability and accessibility of the tertiary disks in the storage subsystem. You can also check the status of cloning relationship by using commands that are provided by the specific storage vendor.

The test-discovery step in the DR failover rehearsal operation is failing with the error message: Storage agent is not accessible.

This error occurs because of a problem in communication between the KSYS subsystem and storage subsystem. Check for any hardware issues. For example, ensure proper connectivity between all the subsystems. Also, identify the issue by analyzing resource manager trace log files. end of change

The HMC interface indicates that the LPAR has no Resource Monitoring Control (RMC) connection or the RMC is inactive.

Check whether the LPAR properties also indicate an RMC issue between the HMC and VIOS. The RMC connectivity issue can occur because of the security mode that is set in the LPAR. The security mode for both the HMC and LPAR must be set to the same value.

For example, list the security mode for LPAR by running the following command:

/usr/sbin/rsct/bin/lssecmode

The output might look like the following sample:

Current Security Mode Configuration
Compliance Mode : nist_sp800_131a
Asymmetric Key Type : rsa2048_sha256
Symmetric Key Type : default

Similarly, list the security mode for the HMC by running the following command:

/usr/sbin/rsct/bin/lssecmode

The output might look like the following sample:

Current Security Mode Configuration
Compliance Mode : none
Asymmetric Key Type : rsa512
Symmetric Key Type : default

In this case, the LPAR has the nist_xxx security mode enabled, but the HMC has no security mode. This mismatch can occur if another HMC was connected or a security mode was set before any reset operation was started.

Errors occur when you register an EMC storage agent.

Check the /var/symapi/config/netcnfg file to determine whether the configuration contains at least two EMC subsystems.

You want to view all the storage disks in the active site and the backup site.

Run the following command to list all the storage disks:

# symcfg list

The output might be similar to the following sample:

                                S Y M M E T R I X                         

                                       Mcode    Cache      Num Phys  Num Symm
    SymmID       Attachment  Model     Version  Size (MB)  Devices   Devices 

    000196800508 Local       VMAX100K  5977      217088        65      9964
    000194901326 Remote      VMAX-1SE  5876       28672         0       904
    000196800573 Remote      VMAX100K  5977      217088         0      7275
    000198701861 Remote      VMAX10K   5876       59392         0       255

The output of ksysmgr query command is not updated for hosts, HMCs, and VIOS even when the entities are updated.

Sometimes, the ksysmgr query command displays static information even when the hosts or HMCs are modified. To update the ksysmgr query command output dynamically, complete the following steps:

Unpair the hosts across the sites by using the following command:
```
ksysmgr pair host host_name pair=none
```
Remove the hosts from the site by using the following command:
```
ksysmgr remove host host_name
```

Add the new HMC to the site by using the following command:

ksysmgr add hmc hmc_name login=login_name password=login_password 
       ip=ip_address  site=site_name

Add the corresponding managed hosts to the site by using the following command:
```
ksysmgr add host host_name
```

The disaster recovery move operation failed.

When a disaster occurs and you initiate a move operation, the KSYS subsystem coordinates the start of the virtual machines on the backup site. During this process, LPAR profiles are created through HMCs on the backup site hosts. During the LPAR profile creation, if any errors occur such that the KSYS cannot communicate with HMCs, the move operation might fail. At that time, any virtual machines that are partially created on HMC require manual restart. For the rest of the virtual machines, you can use the ksysmgr recover command to recover and start the virtual machines. end of change

An unplanned move operation from the active site to the backup site is successfully completed and the specified flexible capacity policy is followed. Later, another unplanned move operation from the backup site to the source site fails. When the virtual machines are recovered to the source site, the virtual machines are started but without any change in the processor and memory values, (that is, without following the flexible capacity policy).

This situation can happen when the active hosts and backup hosts are connected to the same HMC. You must connect the source hosts and target hosts to different HMCs to continue unplanned move operations in case of source HMC failures. end of change