Configuring MIG support in Red Hat OpenShift

Beginning Cloud Pak for Data version 4.8.4, Watson Machine Learning supports GPU inferencing for single model deployments. This feature is not available in Cloud Pak for Data versions 4.8.3 and earlier.

The Red Hat OpenShift container provides a platform to configure and use GPU resources. The NVIDIA Multi-Instance GPU (MIG) provides necessary software components to provision the GPU, such as NVIDIA drivers to enable CUDA. To use CUDA software specifications for your deployment, you must configure the Nvidia Multi-Instance GPU (MIG) in a Red Hat OpenShift® cluster. To configure MIG support, see Nvidia Guide for configuring MIG support.

Configuring MIG profiles within a cluster

To enable different MIG profiles, assign the MIG profile to each node and update the runtime definition. This is applicable to CUDA enabled runtime definitions only.

Assigning MIG profile to nodes

To assign MIG profile to a node, label the node by using the following command:

oc label nodes node1 nvidia.com/mig.config=all-1g.10gb --overwrite=true

Accessing supported MIG profiles

To find a list of supported MIG profiles for your GPU, see mig-parted-config configmap in the GPU operator namespace.

The standard setup uses a single MIG profile across the entire Cloud Pak for Data cluster and does not require any custom runtime definitions to be configured. To use a standard setup, label all nodes with the same MIG profile.

Following the setup, you can start a GPU runtime and select a single GPU to get a MIG device assigned.

Updating runtime definition for nodes

To update the runtime definition, follow these steps:

  1. Download the runtime definition for the GPU runtime (for example, runtime-23.1-py3.10-cuda). For more information, see Downloading the runtime configuration.

  2. In the runtime definition, add the nodeAffinity property to specify the MIG profile:

    "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": {
    "nodeSelectorTerms": [
        { "matchExpressions": [
        { "key": "nvidia.com/mig.config",
            "operator": "In",
            "values": ["all-1g.10g"]
        }
        ] }
    ] }
    }
    
  3. Update the runtime definition by using the service id credentials:

    a. To get the service id credentials, check the required namespace:

    oc get secret -A | grep wdp-service-id
    

    b. Get the required service-id-credentials token:

    oc get secret -n <NAMESPACE> wdp-service-id -o jsonpath='{.data.service-id-credentials}' | base64 --decode
    

    c. Update the runtime definition with PUT to /v2/runtime_definitions/<runtime_id>.

    The following code shows how to update the runtime definition in Python. The <runtime_id> is the runtime definition ID of the runtime definition that is being updated, and new_rd is the updated JSON.

    headers={'Authorization': 'Basic <service-id-credentials>',
        'Content-Type': 'application/json'}
    
    response = requests.put(
        f"{CPD_URL}/v2/runtime_definitions/<runtime_id>",
        json=new_rd,
        headers=headers,
        verify=False)
    

After the custom runtime definitions are updated, you can create deployments that select the nodes that offer a certain MIG profile as updated in the runtime definition.

Parent topic: Frameworks and software specifications in Watson Machine Learning