GPUDirect Storage troubleshooting

The troubleshooting information for GPUDirect Storage (GDS) is primarily available in the NVIDIA documentation. IBM Storage Scale also provides some troubleshooting options for GDS-related issues.

Run the NVIDIA GDS utility gdscheck -p before you run the GDS workloads to verify the setup. You need Python3 installed on the node to run this utility. Verify the status of PCIe Access Control Services (ACS) and PCIe Input/Output Memory Management Unit (IOMMU), as these components affect GDS function and performance. The output of the gdscheck -p must display the following status for IOMMU and ACS components:
IOMMU disabled
ACS disabled
If you want to enable tracing of your CUDA application, adopt the following corresponding settings that are available in /etc/cufile.json.
Note: These settings may impact GDS performance.
  • Log level: Level of information to be logged.
  • Log location: By default, the trace is written into the current working directory of the CUDA application.

Troubleshooting information in NVIDIA documentation is available at GPUDirect Storage Troubleshooting.

Troubleshooting options available in IBM Storage Scale

GPUDirect Storage (GDS) in IBM Storage Scale is integrated with the system health monitoring.

You can use the mmhealth command to monitor the health status. Check for the following details in the mmhealth command output:
  • File system manager status: Examples - down and quorum.
  • Network status: Examples - IB fabric broken and devices down.
  • File system status: Examples - Not mounted and broken.
You can run the mmhealth command as shown in the following example:
# mmhealth node show

Node name:      fscc-x36m3-32-hs
Node status:    DEGRADED
Status Change:  19 days ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------
FILESYSMGR     HEALTHY       17 days ago       -
GPFS           DEGRADED      19 days ago       mmfsd_abort_warn
NETWORK        HEALTHY       29 days ago       -
FILESYSTEM     HEALTHY       14 days ago       -
GUI            HEALTHY       29 days ago       -
PERFMON        HEALTHY       18 days ago       -
THRESHOLD      HEALTHY       29 days ago       -

You can use the mmhealth node show GDS command to check the health status of the GDS component. For more information about the various options that are available with mmhealth command, see mmhealth command.

Error recovery

CUDA retries failed GDS read and GDS write requests in the compatibility mode. As the retry is a regular POSIX read() or write() system call, all GPFS limitations regarding error recovery apply in general.

Restriction counters

If performance is not as expected, this could indicate that one or more of the GDS restrictions for GDS reads or writes have been encountered.

Each counter represents the number of GDS I/O requests that caused a fallback to the compatibility mode for a particular reason.

In IBM Storage Scale 5.1.3, the mmdiag command is enhanced to print the diagnostic information for GDS. The mmdiag --gds command displays a list of counters, representing the GDS operations returned to CUDA due to a restriction. A restricted GDS operation is returned to the CUDA layer and retried in compatibility mode. The following output shows the GDS restriction counters:
# mmdiag --gds

=== mmdiag: gds ===

GPU Direct Storage restriction counters:

  file less than 4k                                  0
  sparse file                                        0
  snapshot file                                      0
  clone file                                         0
  encrypted file                                     0
  memory mapped file                                 0
  compressed file                                    0
  append to file                                     0
  increase file size                                 0
  dioWanted fail                                     0
  nsdServerDownlevel                                 0
  nsdServerGdsRead                                   0
  RDMA target port is down                           0
  RDMA initiator port is down                        0
  RDMA work request errors                           0
  no RDMA connection to NSD server (transient error) 0
  no RDMA connection to NSD server (permanent error) 0
The following table describes the restriction counters.
Counter Name Description
file less than 4k GDS performs read on a file with a file size of less than 4096 bytes.
sparse file GDS performs read on a sparse section within a file.
snapshot file GDS performs read on a snapshot file.
clone file GDS performs read on a clone section within a file.
encrypted file GDS performs read on an encrypted file.
memory mapped file GDS performs read on a memory mapped file.
compressed file GDS performs read on a compressed file.
append to file A new block had to be allocated for appending data to a file.
increase file size The file size had to be increased, a new block had been allocated.
dioWanted fail GDS performs read on a file where the internal function dioWanted failed.
nsdServerDownlevel GDS performs read on file data, which is stored on an NSD server that is running GPFS 5.1.1 or a previous version.
nsdServerGdsRead GDS performs read on a file data, which is stored on a disk attached to the local GPFS node.
RDMA target port is down GDS performs read through an RDMA adapter port on a GDS client, which is in the down state.
RDMA initiator port is down GDS performs read through an RDMA adapter port on an NSD server, which is in the down state.
RDMA work request errors The RDMA operation for a GDS read request is failed.
no RDMA connection to NSD server (transient error) Transient RDMA error.
no RDMA connection to NSD server (permanent error) Permanent RDMA error.

mmfslog

The GDS feature IBM Storage Scale, provides specific entries in the mmfslog file indicating a successful initialization.
# grep "VERBS DC" mmfs.log.latest
2021-05-05_15:55:32.729-0400: [I] VERBS DC RDMA library libmlx5.so loaded.
2021-05-05_15:55:32.729-0400: [I] VERBS DC API loaded.
2021-05-05_15:55:32.986-0400: [I] VERBS DC API initialized.
If the IBM Storage Scale log file contains the following warning message:
[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for device <device>

Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command for possible options.

Syslog

Detailed information about the NVIDIA driver registration and de-registration can be found in the syslog in case of errors. The corresponding messages look similar to:
Apr 12 00:48:53 c73u34 kernel: ibm_scale_v1_register_nvfs_dma_ops()
Apr 12 00:49:14 c73u34 kernel: ibm_scale_v1_unregister_nvfs_dma_ops()

Traces

Specific GDS I/O traces can be generated by using the mmtracectl command. For more details, see mmtracectl command.

Support data

If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command to download all relevant files and diagnostic data to analyze the potential issues. For more details about the various options that are available with the gpfs.snap command, see gpfs.snap command.

Common errors

  1. RDMA is not enabled.
    GPUDirect Storage (GDS) requires RDMA to be enabled. If RDMA is not enabled, an I/O error (EIO=-5) occurs as shown in the following example:
    # gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
    Error: IO failed stopping traffic, fd :33 ret:-5 errno :1
    io failed :ret :-5 errno :1, file offset :0, block size  :1048576
    

    When such an error occurs, verify that the system is configured correctly. For more information, see Configuring GPUDirect Storage for IBM Storage Scale.

    Important: Ensure that the verbsRDMA option is enabled (verbsRdma=enable).
  2. RDMA device addresses are set incorrectly.
    If the list of addresses of the RDMA devices in the /etc/cufile.json file is empty, the following error occurs:
    # gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
    Error: IO failed stopping traffic, fd :27 ret:-5008 errno :17
    io failed : GPUDirect Storage not supported on current file, file offset :0, block size  :1048576
    

    When such an error occurs, verify that the system is configured correctly. For more information, see Configuring GPUDirect Storage for IBM Storage Scale.

    Important: Ensure that the rdma_dev_addr_list configuration parameter has the correct value in the /etc/cufile.json file.