GPUDirect Storage troubleshooting
The troubleshooting information for GPUDirect Storage (GDS) is primarily available in the NVIDIA documentation. IBM Storage Scale also provides some troubleshooting options for GDS-related issues.
IOMMU disabled
ACS disabled
- Log level: Level of information to be logged.
- Log location: By default, the trace is written into the current working directory of the CUDA application.
Troubleshooting information in NVIDIA documentation is available at GPUDirect Storage Troubleshooting.
Troubleshooting options available in IBM Storage Scale
GPUDirect Storage (GDS) in IBM Storage Scale is integrated with the system health monitoring.
- File system manager status: Examples -
down
andquorum
. - Network status: Examples -
IB fabric broken
anddevices down
. - File system status: Examples -
Not mounted
andbroken
.
# mmhealth node show
Node name: fscc-x36m3-32-hs
Node status: DEGRADED
Status Change: 19 days ago
Component Status Status Change Reasons
----------------------------------------------------------------------------
FILESYSMGR HEALTHY 17 days ago -
GPFS DEGRADED 19 days ago mmfsd_abort_warn
NETWORK HEALTHY 29 days ago -
FILESYSTEM HEALTHY 14 days ago -
GUI HEALTHY 29 days ago -
PERFMON HEALTHY 18 days ago -
THRESHOLD HEALTHY 29 days ago -
You can use the mmhealth node show GDS
command to check the health status of the
GDS component. For more information about the various options that are available with
mmhealth command, see mmhealth command.
Error recovery
CUDA retries failed GDS read and GDS write requests in the compatibility mode. As the retry is a regular POSIX read() or write() system call, all GPFS limitations regarding error recovery apply in general.
Restriction counters
If performance is not as expected, this could indicate that one or more of the GDS restrictions for GDS reads or writes have been encountered.
Each counter represents the number of GDS I/O requests that caused a fallback to the compatibility mode for a particular reason.
# mmdiag --gds
=== mmdiag: gds ===
GPU Direct Storage restriction counters:
file less than 4k 0
sparse file 0
snapshot file 0
clone file 0
encrypted file 0
memory mapped file 0
compressed file 0
append to file 0
increase file size 0
dioWanted fail 0
nsdServerDownlevel 0
nsdServerGdsRead 0
RDMA target port is down 0
RDMA initiator port is down 0
RDMA work request errors 0
no RDMA connection to NSD server (transient error) 0
no RDMA connection to NSD server (permanent error) 0
The following table describes the
restriction counters.Counter Name | Description |
---|---|
file less than 4k | GDS performs read on a file with a file size of less than 4096 bytes. |
sparse file | GDS performs read on a sparse section within a file. |
snapshot file | GDS performs read on a snapshot file. |
clone file | GDS performs read on a clone section within a file. |
encrypted file | GDS performs read on an encrypted file. |
memory mapped file | GDS performs read on a memory mapped file. |
compressed file | GDS performs read on a compressed file. |
append to file | A new block had to be allocated for appending data to a file. |
increase file size | The file size had to be increased, a new block had been allocated. |
dioWanted fail | GDS performs read on a file where the internal function dioWanted failed. |
nsdServerDownlevel | GDS performs read on file data, which is stored on an NSD server that is running GPFS 5.1.1 or a previous version. |
nsdServerGdsRead | GDS performs read on a file data, which is stored on a disk attached to the local GPFS node. |
RDMA target port is down | GDS performs read through an RDMA adapter port on a GDS client, which is in the down state. |
RDMA initiator port is down | GDS performs read through an RDMA adapter port on an NSD server, which is in the down state. |
RDMA work request errors | The RDMA operation for a GDS read request is failed. |
no RDMA connection to NSD server (transient error) | Transient RDMA error. |
no RDMA connection to NSD server (permanent error) | Permanent RDMA error. |
mmfslog
# grep "VERBS DC" mmfs.log.latest
2021-05-05_15:55:32.729-0400: [I] VERBS DC RDMA library libmlx5.so loaded.
2021-05-05_15:55:32.729-0400: [I] VERBS DC API loaded.
2021-05-05_15:55:32.986-0400: [I] VERBS DC API initialized.
[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for device <device>
Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command for possible options.
Syslog
Apr 12 00:48:53 c73u34 kernel: ibm_scale_v1_register_nvfs_dma_ops()
Apr 12 00:49:14 c73u34 kernel: ibm_scale_v1_unregister_nvfs_dma_ops()
Traces
Specific GDS I/O traces can be generated by using the mmtracectl command. For more details, see mmtracectl command.
Support data
If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command to download all relevant files and diagnostic data to analyze the potential issues. For more details about the various options that are available with the gpfs.snap command, see gpfs.snap command.
Common errors
- RDMA is not enabled.GPUDirect Storage (GDS) requires RDMA to be enabled. If RDMA is not enabled, an I/O error
(EIO=-5)
occurs as shown in the following example:# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1 Error: IO failed stopping traffic, fd :33 ret:-5 errno :1 io failed :ret :-5 errno :1, file offset :0, block size :1048576
When such an error occurs, verify that the system is configured correctly. For more information, see Configuring GPUDirect Storage for IBM Storage Scale.
Important: Ensure that theverbsRDMA
option is enabled (verbsRdma=enable
). - RDMA device addresses are set incorrectly.If the list of addresses of the RDMA devices in the /etc/cufile.json file is empty, the following error occurs:
# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1 Error: IO failed stopping traffic, fd :27 ret:-5008 errno :17 io failed : GPUDirect Storage not supported on current file, file offset :0, block size :1048576
When such an error occurs, verify that the system is configured correctly. For more information, see Configuring GPUDirect Storage for IBM Storage Scale.
Important: Ensure that therdma_dev_addr_list
configuration parameter has the correct value in the /etc/cufile.json file.