IBM Support

NVIDIA GPGPU Adapters system memory addressing limitations - IBM Systems

Troubleshooting


Problem

Careful planning should be done before ordering or installing the NVIDIA GPGPU adapters listed in the Affected configurations section into any system. These precautions are designed to prevent undetected data corruption issues or general system instability. Due to memory addressing limitations of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards, the maximum amount of system memory that can be used when one or more of these NVIDIA cards is installed in the system is one Terabyte (1 TB). Any product that can support 1 TB of system memory is affected.

Resolving The Problem

Source

RETAIN tip: H213010

Symptom

Careful planning should be done before ordering or installing the NVIDIA General-purpose computing on graphics processing units (GPGPU) adapters listed in the Affected configurations section into any system. These precautions are designed to prevent undetected data corruption issues or general system instability.

Due to memory addressing limitations of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards, the maximum amount of system memory that can be used when one or more of these NVIDIA cards is installed in the system is one Terabyte (1 TB).

Any product that can support 1 TB of system memory is affected.

Affected configurations

The system is configured with one or more of the following IBM options:

This tip is not system specific.

This tip is not software specific.

The NVIDIA device driver for the NVIDIA Grid, NVIDIA Tesla, or NVIDIA Quadro card is affected.

Solution

This is a design limitation of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards. Therefore, this is a permanent restriction associated with these NVIDIA cards.

Workaround

Remove system Dual In-line Memory Modules (DIMMs) from the system to reduce the total amount of installed system memory to less than one (1) TB.

Alternatively or if users have exactly 1TB of physical memory installed, users can use the boot options described below to restrict the memory address range of the system:

On Linux Operating Systems, the NVIDIA Linux driver attempts to identify the scenario where the host system has more memory than a given GPU can address (which is 1 TB on current generation GPUs). If this scenario is detected, the NVIDIA driver will drop back to allocations from the 4 GB Direct Memory Access (DMA) zone to avoid address truncation. This means that the driver will use the __GFP_DMA32 flag and limit itself to memory addresses below the 4 GB boundary. This is done on a per-GPU basis, so limiting one GPU will not limit other GPUs in the system. For example, if an NVIDIA Quadro K6000 (which can address 1 TB) and a Quadro 5000 (which can address 512 GB) are installed in a system with 1 TB of memory, then the Quadro K6000 operates normally with no limitations, while the Quadro 5000 falls back to the 4 GB limit. This is the behavior on R331 drivers starting with 331.93 and R340 drivers starting with 340.28. It is also possible to use the mem=1024G or max_addr=1024G kernel parameters to limit the amount of system memory that the operating system can access. This restriction is for the case where the system has remapped any physical memory based on the usage of lower memory addresses assigned to Memory Mapped Input/Output (MMIO).

On VMware ESXi 5.1, ESXi 5.5, and ESXi 6.0, the NVIDIA VMware VIB drivers mimic the behavior of the NVIDIA Linux drivers, for example, they will fall back to the 4 GB limit when there is more than 1 TB of system memory.

On Windows Operating Systems (OSs) (Server 2008 R2, Server 2012, and Server 2012 R2) with an NVIDIA Tesla card in the default Tesla Compute Cluster (TCC) mode on systems with 1 TB or more system memory, after the NVIDIA Windows driver installer is run and completed, the Microsoft Device Manager will show a yellow bang for the GPU device with a 'This device cannot start. (Code 10)' error and hence NVIDIA Tesla cards in TCC mode will not operate on Windows systems with 1 TB or more of system memory.

On Windows OSs (Server 2008 R2, Server 2012, and Server 2012 R2) with an NVIDIA Tesla, Quadro, or an NVIDIA GRID card in WDDM (Windows Display Driver Model) mode on systems with 1 TB or more system memory, after the NVIDIA Windows driver installer is run and completed, the Microsoft Device Manager will indicate 'This device is working properly' even though the system will not be functional.

However, after the system is restarted, the Microsoft Device Manager will indicate 'Windows has stopped this device because it has reported issues (Code 43)'. On systems with multiple NVIDIA cards, multiple restarts may be required before the Microsoft Device Manager will indicate 'Windows has stopped this device because it has reported problems (Code 43).' This is the behavior on Windows R331 drivers starting with 333.44 and R340 drivers starting with 340.66.

Alternatively for Microsoft Windows 2003 or prior releases, it is possible to use the /maxmem=1048576 boot option part number in the boot.ini file. For Microsoft Windows 2008 and later releases, it is possible to use the bcdedit /set {current} truncatememory 0xFFFFFFFFFF boot option part number.

Additional information

The NVIDIA Grid Kx, Tesla Kxx, and Quadro Kxxxx/Mxxxx cards were designed with a memory addressing capability of 40bits which means they are capable of addressing memory locations up to one (1) Terabyte. If system memory exceeds one (1) Terabyte the NVIDIA card may truncate memory addresses that exceed 1TB causing the card to access incorrect memory locations. This limitation will be documented on Server Proven Page for the affected systems under the NVIDIA adapters.

Note: that with exactly 1TB of physical memory installed in the system, memory overlapping Low Memory-mapped I/O (MMIO) is reclaimed, above the Top of High Memory which could shift the addresses of physical memory above the 1TB addressing limitation of the NVIDIA cards. Using the boot options described above in the Workaround section can prevent physical memory from receiving addresses above the 1 TB addressing limit of the NVIDIA cards.

Address Range Memory
0 to 3 GB/2 GB/1 GB Low Memory (3 GB or 2 GB or 1 GB)
3 GB/2 GB/1 GB to 4 GB Low MMIO (1 GB or 2 GB or 3 GB)
4 GB to 1 TB + 1 GB/2 GB/3 GB High Memory

 

Document Location

Worldwide

Operating System

System x Hardware Options:Operating system independent / None

Lenovo x86 servers:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOEB0B","label":"System x Hardware Options->Video->NVIDIA->00J6164"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QUOEB0D","label":"System x Hardware Options->Video->NVIDIA->00J6165"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOEB0F","label":"System x Hardware Options->Video->NVIDIA->00J6160"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOEB0H","label":"System x Hardware Options->Video->NVIDIA->00J6161"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOFI3C","label":"Lenovo x86 servers->Lenovo System x3500 M5->5464"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOFMOD","label":"Lenovo x86 servers->Lenovo System x3850 X6->6241"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOFMOF","label":"Lenovo x86 servers->Lenovo System x3950 X6->6241"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOFN52","label":"Lenovo x86 servers->Lenovo System x3650 M5->5462"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOFNIC","label":"Lenovo x86 servers->Lenovo System x3250 M5->5458"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
30 January 2019

UID

ibm1MIGR-5096047