IBM Support

IBM ESS :ESS 6.1.2.0 and 6.1.2.1 may exhibit restart of the Spectrum Scale daemon or fail to mount file system when configured with RDMA

Troubleshooting


Problem

  • Intermittent restart of the Spectrum Scale daemon
  • File system fails to mount at the client 
  • Infiniband or RoCE interconnect with RDMA enabled
Spectrum Scale log shows entries like this:
logAssertFailed: wcOpcode == IBV_WC_SEND || wcOpcode == IBV_WC_RDMA_READ || wcOpcode == IBV_WC_RDMA_WRITE

Cause

The MOFED layer 5.4-2.X contained in ESS 6.1.2.0 and ESS 6.1.2.1 might return incompatible/unsupported responses to I/O requests issued by the Spectrum Scale. As a part of recovery, the Spectrum Scale daemon might terminate and restart. This issue was introduced in MOFED versions 5.4-2.0.  Earlier versions of MOFED are not impacted.

Environment

All ESS 3000, ESS 3200, ESS 5000 running release 6.1.2.0 or 6.1.2.1 using MOFED version 5.4-2.X.

Resolving The Problem

While a permanent fix is being worked, downgrading MOFED to version 5.4-1.X is an acceptable interim fix of the issue. See below for steps how to downgrade the MOFED version to 5.4-1.0.3.0.
Downgrading MOFED version to 5.4-1.0.3.0:
Note: Firmware levels will not change in this process and will remain same as 5.4-2
Prerequisites:
  1. Check MOFED level on each cluster node to see whether the cluster is affected: ofed_info -s
  2. Make sure you have the right drivers for the installation
-rw-r--r-- 1 root root 366528512 Jul 16 11:57 MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.2-ppc64le.iso 
-rw-r--r-- 1 root root 410331136 Jul 16 11:58 MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.2-x86_64.iso
Steps for downgrading :
Note: These are high-level instructions on how to downgrade from MOFED 5.4-2 to MOFED 5.4-1. This instruction set assumes the cluster stays on-line. The examples below are based on the ppc64le architecture. Make changes where appropriate.
  1. Log in to the first node requiring MOFED downgrade
    •  ssh node1
  2. Check the MOFED version (downgrade required if > 5.4-1.0.3.0)
    • ofed_info -s
  3. Verify quorum availability (if cluster must remain online).
    • mmgetstate -s will verify quorum and help determine whether taking down one or more nodes is safe to keeping the cluster up.
  4. Shutdown Spectrum Scale (on this node)
    • mmshutdown
  5. Verify Spectrum Scale is shutdown (on this node)
    • mmgetstate
  6. Uninstall the existing MOFED version
    •  /sbin/ofed_uninstall.sh --force
  7. Copy MOFED iso and firmware binary the node (example using ppc64le and sftp)
    • sftp node1
    • cd /tmp
    • mput MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.2-ppc64le.iso
  8.  Mount MOFED iso
    • cd /tmp ; mount -o loop MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.2-ppc64le.iso /mnt
  9. Install MOFED (with sample options)
    • cd /mnt ; ./mlnxofedinstall --add-kernel-support --disable-kmp --without-fw-update
  10.  Remove MOFED udev rules (if applicable)
    • MOFED 5.x applies custom networking udev rules that can interfere with user-defined versions.
    • How to determine whether a modification is needed?
      • cat /proc/cmdline
    • If you see the following lines => [biosdevname=0 net.ifnames=0] do the following:
      • mv -f /lib/udev/rules.d/82-net-setup-link.rules /lib/udev/rules.d/82-net-setup-link.rules.bak
  11. Re-create initramfs (example for RHEL using dracut)
    • dracut -f
  12. Reboot the node
    • systemctl reboot
  13. Confirm MOFED driver is running
    • lbstat
    • ofed_info -s
    • ibdev2netdev
  14. Start Spectrum Scale
    • mmstartup
  15. Confirm Spectrum Scale is active and quorum met
    • mmgetstate -a
    • mmgetstate -s

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000KzfoAAC","label":"RDMA"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
09 February 2022

UID

ibm16554496