GPUDirect Storage issues
The troubleshooting information for GPUDirect Storage (GDS) is primarily available in the NVIDIA documentation. IBM Spectrum Scale also provides some troubleshooting options for GDS-related issues.
IOMMU disabled
ACS disabled
If you want to enable tracing of your CUDA application, adopt the following corresponding settings that are available in /etc/cufile.json.
- Log level: Level of information to be logged.
- Log location: By default, the trace is written into the current working directory of the CUDA application.
Troubleshooting information in NVIDIA documentation is available at: GPUDirect Storage Troubleshooting.
Troubleshooting options available in IBM Spectrum Scale
GPUDirect Storage (GDS) in IBM Spectrum Scale is integrated with the system health monitoring.
- File system manager status: Examples -
down
andquorum
. - Network status: Examples -
IB fabric broken
anddevices down
. - File system status: Examples -
Not mounted
andbroken
.
# mmhealth node show
Node name: fscc-x36m3-32-hs
Node status: DEGRADED
Status Change: 19 days ago
Component Status Status Change Reasons
----------------------------------------------------------------------------
FILESYSMGR HEALTHY 17 days ago -
GPFS DEGRADED 19 days ago mmfsd_abort_warn
NETWORK HEALTHY 29 days ago -
FILESYSTEM HEALTHY 14 days ago -
GUI HEALTHY 29 days ago -
PERFMON HEALTHY 18 days ago -
THRESHOLD HEALTHY 29 days ago -
You can use the mmhealth node show GDS
command to check the health status of the
GDS component. For more information about the various options that are available with
mmhealth command, see mmhealth command.
Restriction counters
If performance is not as expected as it must be, then it indicates that some of the GDS restrictions have encountered and I/O is falling back to the compatibility mode.
You can obtain the current counter values by using the mmfsadm dump verbs command.Each counter represents the number of GDS I/O requests that caused a fallback to the compatibility mode for a particular reason.
# mmfsadm dump verbs | tail
Unsupported file operation counters:
file less than 4k = 4
sparse file = 0
snapshot file = 100
clone file = 10
encrypted file = 2
memory mapped file = 0
compressed file = 0
dioWanted fail = 10
mmfslog
# grep "VERBS DC" mmfs.log.latest
2021-05-05_15:55:32.729-0400: [I] VERBS DC RDMA library libmlx5.so loaded.
2021-05-05_15:55:32.729-0400: [I] VERBS DC API loaded.
2021-05-05_15:55:32.986-0400: [I] VERBS DC API initialized.
Syslog
Apr 12 00:48:53 c73u34 kernel: ibm_scale_v1_register_nvfs_dma_ops()
Apr 12 00:49:14 c73u34 kernel: ibm_scale_v1_unregister_nvfs_dma_ops()
Traces
You can take the GDS I/O trace by using the mmtracectl command. For more details, see mmtracectl command.
Support data
If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command to download all relevant files and diagnostic data to analyze the potential issues. For more details about the various options that are available with the gpfs.snap command, see gpfs.snap command.