Start of change

GPUDirect Storage issues

The troubleshooting information for GPUDirect Storage (GDS) is primarily available in the NVIDIA documentation. IBM Spectrum Scale also provides some troubleshooting options for GDS-related issues.

Run the NVIDIA GDS utility gdscheck -p before you run the GDS workloads to verify the setup. You need Python3 installed on the node to run this utility. Verify the status of PCIe Access Control Services (ACS) and PCIe Input/Output Memory Management Unit (IOMMU), as these components affect GDS function and performance. The output of the gdscheck -p must display the following status for IOMMU and ACS components:
IOMMU disabled
ACS disabled

If you want to enable tracing of your CUDA application, adopt the following corresponding settings that are available in /etc/cufile.json.

  • Log level: Level of information to be logged.
  • Log location: By default, the trace is written into the current working directory of the CUDA application.

Troubleshooting information in NVIDIA documentation is available at: GPUDirect Storage Troubleshooting.

Troubleshooting options available in IBM Spectrum Scale

GPUDirect Storage (GDS) in IBM Spectrum Scale is integrated with the system health monitoring.

You can use the mmhealth command to monitor the health status. Check for the following details in the mmhealth command output:
  • File system manager status: Examples - down and quorum.
  • Network status: Examples - IB fabric broken and devices down.
  • File system status: Examples - Not mounted and broken.
You can run the mmhealth command as shown in the following example:
# mmhealth node show

Node name:      fscc-x36m3-32-hs
Node status:    DEGRADED
Status Change:  19 days ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------
FILESYSMGR     HEALTHY       17 days ago       -
GPFS           DEGRADED      19 days ago       mmfsd_abort_warn
NETWORK        HEALTHY       29 days ago       -
FILESYSTEM     HEALTHY       14 days ago       -
GUI            HEALTHY       29 days ago       -
PERFMON        HEALTHY       18 days ago       -
THRESHOLD      HEALTHY       29 days ago       -

You can use the mmhealth node show GDS command to check the health status of the GDS component. For more information about the various options that are available with mmhealth command, see mmhealth command.

Restriction counters

If performance is not as expected as it must be, then it indicates that some of the GDS restrictions have encountered and I/O is falling back to the compatibility mode.

You can obtain the current counter values by using the mmfsadm dump verbs command.Each counter represents the number of GDS I/O requests that caused a fallback to the compatibility mode for a particular reason.

These counters are part of the GDS section and you can print those values as shown in the following example:
# mmfsadm dump verbs | tail
  Unsupported file operation counters:
    file less than 4k  = 4
    sparse file        = 0
    snapshot file      = 100
    clone file         = 10
    encrypted file     = 2
    memory mapped file = 0
    compressed file    = 0
    dioWanted fail     = 10

mmfslog

GDS in IBM Spectrum Scale leaves log entries in mmfslog. A proper initialization requires the successful completion of the following three steps:
# grep "VERBS DC" mmfs.log.latest
2021-05-05_15:55:32.729-0400: [I] VERBS DC RDMA library libmlx5.so loaded.
2021-05-05_15:55:32.729-0400: [I] VERBS DC API loaded.
2021-05-05_15:55:32.986-0400: [I] VERBS DC API initialized.

Syslog

You can locate the details of the NVIDIA driver registration and de-registration in the syslog as shown in the following example:
Apr 12 00:48:53 c73u34 kernel: ibm_scale_v1_register_nvfs_dma_ops()
Apr 12 00:49:14 c73u34 kernel: ibm_scale_v1_unregister_nvfs_dma_ops()

Traces

You can take the GDS I/O trace by using the mmtracectl command. For more details, see mmtracectl command.

Support data

If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command to download all relevant files and diagnostic data to analyze the potential issues. For more details about the various options that are available with the gpfs.snap command, see gpfs.snap command.

End of change