IBM Support

IBM Spectrum Scale: NFS operations may fail with IO-Error

Flashes (Alerts)


Abstract

IBM has identified an issue with IBM Spectrum Scale 5.0.0.0 Protocol support for NFSv3/v4 in which IO-errors may be returned to the NFS client if the NFS server accumulates file-descriptor resources beyond the defined limit. Accumulation of file descriptor resources will occur when NFSv3 file create operations are sent against files that are already in use.

Content

Problem Summary

The amount of "files in use" or file descriptors will grow under special conditions and will not be reduced even when workload is stopped. When this amount reaches the defined limit of variable "nofiles", any further NFS IO may fail with IO-Error.

To trigger the accumulation, a NFSv3 file create command must be sent for a file that is still present in the file descriptor cache.

The file descriptor cache is cleared after 90 seconds (or sooner if cache usage is high).

Common practice to check for file existence before creating a file will prevent the condition.

The Ganesha NFS server uses a limit "nofiles" that is derived as 80 % number of the value of the GPFS configuration setting "maxFilesToCache".

Ganesha restarts due to configuration changes or node reboots will release the accumulated file descriptors.

The accumulation may take months to reach the "nofiles" limit (depending on the number of hits of the special condition and the setting of "nofiles").

Users affected

All IBM Spectrum Scale 5.0.0.0 NFSv3 users that run clients / applications that perform concurrent file creates from multiple clients / multiple client threads and do not follow common practice to check for file existence before sending a "create" command.

Users running IBM Spectrum Scale 5.0.0.0 that do not restart the NFS server.

Note: No errors will be created while the number of open files stays below the defined limit.

Recommendations

Reconsider the definition of the GPFS configuration parameter "maxFilesToCache". 100k is the recommended minimum, 1M should be possible for modern CES machines.

Monitor the file descriptor usage by the NFS server. If it approaches the limit, restart the NFS server on the affected CES node (i.e., using "systemctl restart nfs-ganesha").

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.0.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
25 September 2022

UID

ssg1S1011791