GPU environments (Jupyter Notebooks with Python with GPU)

With GPU environments, you can reduce the training time that is needed to run compute-intensive machine learning models. With more computing power, you can run more training iterations when you are fine-tuning your machine learning models. GPU environments are available for Python only.

Service This service is not available by default. An administrator must install the service. To determine whether the service is installed, click on your avatar and then click About > Version details. If the service is installed and ready to use, it is marked as Deployed.

For more information, see:

GPU environment templates
Using GPU environments in notebooks

Nvidia Multi-Instance GPU (MIG) enables partitioning of a physical GPU into multiple smaller, independent instances, which are known as MIG devices. These MIG devices offer isolation and allocation capabilities, enabling efficient resource sharing.

For key aspects of using Nvidia MIG, see Using MIG Devices in notebooks and scripts.

GPU environment templates

To use a GPU environment, you must create a new GPU environment template.

You must have the Admin or Editor role within the project to create an environment template.

To create a GPU environment template:

From the Manage tab of your project, select the Environments page, then under Templates, click New template.
Enter a name and a description.
Select the GPU environment configuration type.
Select the hardware configuration. Select the size dependent on the complexity of the model operations and the number of model training iterations that you'd like to perform.
- Specify the hardware size to match the complexity of your analytics workload and available resources. The default size is 1 GPU and 1 vCPU and 2 GB RAM.
Select the Python software version.

The environment template details are displayed. You can change your hardware settings by hovering over the setting.

The GPU environments with Python 3.11 include data science libraries from the 24.1 Runtime release that work with the NVIDIA CUDA Toolkit 12.2.0.

You can add your own custom libraries in addition to the libraries that are preinstalled for you. To to this, create a customization. See Customizing environment templates.

Using GPU environments in notebooks

After you create a GPU environment template, you can start assigning this environment to notebooks.

In a project, you can run more than one notebook that uses the same GPU environment template. This means that if you open a second notebook with the same environment template in the same project, a second kernel is started in the same runtime. The runtime resources are shared by the Jupyter kernels that you start in the runtime. The runtime is started per single user and not per notebook.

Using MIG Devices in notebooks and scripts

When you use MIG devices, be aware of these limitations:

By design, a single CUDA process can use a single MIG device. This means that parallel execution of CUDA processes across multiple MIG devices is not supported. Each process must be assigned to a specific MIG device. This is why multiple MIG slices cannot be easily used in a distributed manner.
watsonx currently supports only the single-process strategy, which exposes MIG instances as standard GPUs to OpenShift.
If your cluster contains other GPUs that do not provide MIG support, you must taint these nodes and work with a custom runtime definition that contains tolerations. This way you ensure that users do not accidentally select a complete GPU instead of a single MIG device.

When you use a single MIG device, you can apply the standard workflow for working with GPUs. Refer to the code snippet that shows how to use the device parameter in PyTorch:

import torch
zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')

Alternatively, use the torch.device("cuda:0") function.

If you use Tensorflow, you can use the tf.device context manager:

import tensorflow as tf
with tf.device('/GPU:0'):
  a = tf.constant([[1.0, 2.0], [4.0, 5.0]])

Note: It is not recommended to assign more than one MIG device to a runtime. Instead, you can reconfigure MIG profiles to provide more computing power and VRAM.

Accessing MIG UUIDs programmatically with multiple MIGs per runtime

If a runtime is requesting more than one MIG device, you must configure the specific MIG to be used by a CUDA process. To do that, use the CUDA_VISIBLE_DEVICES environment variable. This code snippet programmatically queries the NVIDIA Management Library (NVML) for all available MIG UUIDs (it requires the py3nvml library to be installed):

import os
from py3nvml.py3nvml import *

nvmlInit()
deviceCount = nvmlDeviceGetCount()

for deviceIndex in range(deviceCount):
	handle = nvmlDeviceGetHandleByIndex(deviceIndex)
	deviceUUID = nvmlDeviceGetUUID(handle)

	# Set CUDA_VISIBLE_DEVICES variable and start subprocess
	# os.environ["CUDA_VISIBLE_DEVICES"]=deviceUUID

nvmlShutdown()

Note: One CUDA process can use only one MIG device at a time. Therefore, you must pass the MIG UUID to the respective subprocess.

Next steps

Parent topic: Environments