Troubleshooting
Problem
Careful planning should be done before ordering or installing the NVIDIA GPGPU adapters listed in the Affected configurations section into any system. These precautions are designed to prevent undetected data corruption issues or general system instability. Due to memory addressing limitations of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards, the maximum amount of system memory that can be used when one or more of these NVIDIA cards is installed in the system is one Terabyte (1 TB). Any product that can support 1 TB of system memory is affected.
Resolving The Problem
Source
RETAIN tip: H213010
Symptom
Careful planning should be done before ordering or installing the NVIDIA General-purpose computing on graphics processing units (GPGPU) adapters listed in the Affected configurations section into any system. These precautions are designed to prevent undetected data corruption issues or general system instability.
Due to memory addressing limitations of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards, the maximum amount of system memory that can be used when one or more of these NVIDIA cards is installed in the system is one Terabyte (1 TB).
Any product that can support 1 TB of system memory is affected.
Affected configurations
The system is configured with one or more of the following IBM options:
- NVIDIA 16 GB VGX K1 PCI-Express Compute Card, option part number 00J6160, any replacement part number
- NVIDIA 8 GB VGX K2 PCI-Express X16 Compute Card, option part number 00J6161, any replacement part number
- NVIDIA Grid K1, option part number 00FP671, any replacement part number 90Y2432
- NVIDIA Grid K2, option part number 00FP674, any replacement part number 90Y2395
- NVIDIA Quadro K2000, option part number 90Y2379, any replacement part number
- NVIDIA Quadro K4000, option part number 00FP675, any 90Y2375
- NVIDIA Quadro K4000, option part number 90Y2375, any replacement part number
- NVIDIA Quadro K5000, option part number 90Y2387, any replacement part number
- NVIDIA Quadro K600, option part number 90Y2383, any replacement part number
- NVIDIA Quadro K6000, option part number 00FP672, any 90Y2371
- NVIDIA Tesla K10 GPU Adapter, option part number 00D4192, any replacement part number
- NVIDIA Tesla K20 GPU Adapter, option part number 47C2119, any NVIDIA Tesla K20
- NVIDIA Tesla K20, option part number 00FP673, any 90Y2391
- NVIDIA Tesla K20, option part number 00J6164, replacement part number 90Y2346
- NVIDIA Tesla K20X GPU Adapter, option part number 00J6165, any NVIDIA Tesla K20X
- NVIDIA Tesla K40, option part number 00FL133, any replacement part number 90Y2412
- NVIDIA Tesla K40, option part number 00FP676, any 90Y2408
- NVIDIA Tesla K40, option part number 47C2137, any replacement part number 90Y2412
This tip is not system specific.
This tip is not software specific.
The NVIDIA device driver for the NVIDIA Grid, NVIDIA Tesla, or NVIDIA Quadro card is affected.
Solution
This is a design limitation of the NVIDIA Grid Kx, NVIDIA Tesla Kxx, and NVIDIA Quadro Kxxxx cards. Therefore, this is a permanent restriction associated with these NVIDIA cards.
Workaround
Remove system Dual In-line Memory Modules (DIMMs) from the system to reduce the total amount of installed system memory to less than one (1) TB.
Alternatively or if users have exactly 1TB of physical memory installed, users can use the boot options described below to restrict the memory address range of the system:
On Linux Operating Systems, the NVIDIA Linux driver attempts to identify the scenario where the host system has more memory than a given GPU can address (which is 1 TB on current generation GPUs). If this scenario is detected, the NVIDIA driver will drop back to allocations from the 4 GB Direct Memory Access (DMA) zone to avoid address truncation. This means that the driver will use the __GFP_DMA32 flag and limit itself to memory addresses below the 4 GB boundary. This is done on a per-GPU basis, so limiting one GPU will not limit other GPUs in the system. For example, if an NVIDIA Quadro K6000 (which can address 1 TB) and a Quadro 5000 (which can address 512 GB) are installed in a system with 1 TB of memory, then the Quadro K6000 operates normally with no limitations, while the Quadro 5000 falls back to the 4 GB limit. This is the behavior on R331 drivers starting with 331.93 and R340 drivers starting with 340.28. It is also possible to use the mem=1024G or max_addr=1024G kernel parameters to limit the amount of system memory that the operating system can access. This restriction is for the case where the system has remapped any physical memory based on the usage of lower memory addresses assigned to Memory Mapped Input/Output (MMIO).
On VMware ESXi 5.1, ESXi 5.5, and ESXi 6.0, the NVIDIA VMware VIB drivers mimic the behavior of the NVIDIA Linux drivers, for example, they will fall back to the 4 GB limit when there is more than 1 TB of system memory.
On Windows Operating Systems (OSs) (Server 2008 R2, Server 2012, and Server 2012 R2) with an NVIDIA Tesla card in the default Tesla Compute Cluster (TCC) mode on systems with 1 TB or more system memory, after the NVIDIA Windows driver installer is run and completed, the Microsoft Device Manager will show a yellow bang for the GPU device with a 'This device cannot start. (Code 10)' error and hence NVIDIA Tesla cards in TCC mode will not operate on Windows systems with 1 TB or more of system memory.
On Windows OSs (Server 2008 R2, Server 2012, and Server 2012 R2) with an NVIDIA Tesla, Quadro, or an NVIDIA GRID card in WDDM (Windows Display Driver Model) mode on systems with 1 TB or more system memory, after the NVIDIA Windows driver installer is run and completed, the Microsoft Device Manager will indicate 'This device is working properly' even though the system will not be functional.
However, after the system is restarted, the Microsoft Device Manager will indicate 'Windows has stopped this device because it has reported issues (Code 43)'. On systems with multiple NVIDIA cards, multiple restarts may be required before the Microsoft Device Manager will indicate 'Windows has stopped this device because it has reported problems (Code 43).' This is the behavior on Windows R331 drivers starting with 333.44 and R340 drivers starting with 340.66.
Alternatively for Microsoft Windows 2003 or prior releases, it is possible to use the /maxmem=1048576 boot option part number in the boot.ini file. For Microsoft Windows 2008 and later releases, it is possible to use the bcdedit /set {current} truncatememory 0xFFFFFFFFFF boot option part number.
Additional information
The NVIDIA Grid Kx, Tesla Kxx, and Quadro Kxxxx/Mxxxx cards were designed with a memory addressing capability of 40bits which means they are capable of addressing memory locations up to one (1) Terabyte. If system memory exceeds one (1) Terabyte the NVIDIA card may truncate memory addresses that exceed 1TB causing the card to access incorrect memory locations. This limitation will be documented on Server Proven Page for the affected systems under the NVIDIA adapters.
Note: that with exactly 1TB of physical memory installed in the system, memory overlapping Low Memory-mapped I/O (MMIO) is reclaimed, above the Top of High Memory which could shift the addresses of physical memory above the 1TB addressing limitation of the NVIDIA cards. Using the boot options described above in the Workaround section can prevent physical memory from receiving addresses above the 1 TB addressing limit of the NVIDIA cards.
| Address Range | Memory |
|---|---|
| 0 to 3 GB/2 GB/1 GB | Low Memory (3 GB or 2 GB or 1 GB) |
| 3 GB/2 GB/1 GB to 4 GB | Low MMIO (1 GB or 2 GB or 3 GB) |
| 4 GB to 1 TB + 1 GB/2 GB/3 GB | High Memory |
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
30 January 2019
UID
ibm1MIGR-5096047