GDPC and IBM Spectrum Scale replication FAQ

This FAQ provides you with answers to common questions about geographically dispersed Db2® pureScale® cluster (GDPC) and IBM Spectrum Scale replication environment problems.

What do I do when I cannot bring the disks online after a storage failure on one site was rectified?

If nodes come online before the storage device, you must ensure that the disk configurations are defined and available before you try to restart the failed disk. If the Device and DevType fields are marked by a - when you try to list the network shared disk (NSD) using the mmlsnsd -X command, you must ensure that the device configurations for the disk services are available before attempting to restart the disks. Please consult the operating system and device driver manuals for the exact steps to configure the device. On AIX® platforms, you can run the cfgmgr command to automatically configure devices that have been added since the system was last rebooted.

What do I do if a computer's IP address, used for the IB interface, cannot be pinged after a reboot?

Ensure the InfiniBand (IB) related devices are available:
root> lsdev -C | grep ib
ib0         Available      IP over InfiniBand Network Interface
iba0        Available      InfiniBand host channel adapter
If the devices are not available, bring them online with chdev:
chdev -l ib0 -a state=up
Ensure that the ib0, icm and iba0 properties are set correctly, that ib0 references an IB adapter such as iba0, and that properties are persistent across reboots. Use the -P option of chdev to make changes persistent across reboots.

What do I do if access to the IBM Spectrum Scale file systems hangs for a long time on a storage controller failure?

Ensure the device driver parameters are set properly on each machine in the cluster.

What do I do if the cluster comes down following a site failure?

Check the system logs on the surviving site to see if IBM® Spectrum Scale has triggered a kernel panic due to outstanding I/O requests:
GPFS Deadman Switch timer has expired and there are still outstanding I/O requests
If this is the case, then ensure that the device driver parameters have been properly set

What happens when one site loses Ethernet connectivity and the LPARs on that site are expelled from the IBM Spectrum Scale cluster?

If the IBM Spectrum Scale cluster manager is on the tiebreaker site this behavior is expected, as the cluster manager does not have IB or Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) connectivity and can no longer talk to the site which has lost Ethernet connectivity. If the IBM Spectrum Scale cluster manager is not on the tiebreaker, but is on the site that retains Ethernet connectivity then ensure that the tiebreaker site is a IBM Spectrum Scale quorum-client, not a quorum-manager, as per the mmaddnode command. If the tiebreaker host is a quorum-manager its status can be changed to client with the /usr/lpp/mmfs/bin/mmchnode -–client -N hostT command. The status of the nodes as managers or clients can be verified with the /usr/lpp/mmfs/bin/mmlscluster command. Also ensure that the IBM Spectrum Scale subnets parameter has been set properly to refer to the IP subnet that uses the IB or RoCE interfaces. The /usr/lpp/mmfs/bin/mmlsconfig command can be used to verify that the subnets parameter has been set correctly.

What do I do when one site loses ethernet connectivity and the members on that site remain stuck in STOPPED state instead of doing a restart light and going to WAITING_FOR_FAILBACK state?

Ensure that LSR has been disabled.

How can I remove unused IBM Spectrum Scale Network Shared Disks (NSD)?

Scenario that can lead to the need to manually remove unused NSDs:
  1. User-driven or abnormal termination of db2cluster CREATE FILESYSTEM or ADD DISK command.
  2. The unused NSDs were created manually at some point earlier but left in the system.
The free NSDs need to be removed before they can be used in db2cluster command with either CREATE FILESYSTEM or ADD DISK option. Use the following instructions to remove them:
Note: Run all the following commands on the same host.
  1. Run mmlsnsd -XF to list the free NSD and its corresponding device name.
     root@coralpib21a:/> mmlsnsd -XF
    
     Disk name    NSD volume ID      Device         Devtype  Node name                Remarks
     ---------------------------------------------------------------------------------------------------
     gpfs2118nsd  09170151FFFFD473   /dev/hdisk7    hdisk    coralpib21a.torolab.ibm.com
  2. Find the NSD name that matches the target device to be removed.
    gpfs2118nsd
  3. Run mmdelnsd <NSD name> to remove the desired unused NSD.
    root@coralpib21a:/> mmdelnsd gpfs2118nsd
    mmdelnsd: Processing disk gpfs2118nsd
    mmdelnsd: Propagating the cluster configuration data to all
    affected nodes.  This is an asynchronous process.