Requirements for deploying custom foundation models on MIG-enabled clusters

Review the considerations and requirements for deploying a custom foundation model on an MIG-enabled cluster.

You can deploy custom foundation models on an MIG-enabled cluster in both lightweight or full service watsonx.ai™ installation modes. As you prepare to deploy a custom foundation model, review these requirements:

Hardware requirements

The standard supported hardware configurations to deploy custom foundation models on MIG-enabled clusters are as follows:
  • NVIDIA A100 GPUs with 80 GB RAM
  • NVIDIA H100 GPUs with 80 GB RAM
  • NVIDIA H200 GPUs with 141 GB RAM
If your GPU configuration is different (for example, NVIDIA H100 GPUs with 40 GB RAM), you must create a custom hardware specification.
Restriction: You cannot use NVIDIA L40S GPUs with 48 GB RAM to deploy custom foundation models on MIG-enabled clusters.

Configuring MIG support to deploy custom foundation models

The cluster administrator must perform the following tasks to deploy custom foundation models on MIG-enabled clusters:
  1. Enable MIG partitioning on the required GPU nodes at the cluster level. To learn more about MIG partitioning, see NVIDIA documentation for configuring MIG Support in OpenShift Container Platform.
    	1g.10gb: nvidia.com/mig-1g.10gb
    	2g.20gb: nvidia.com/mig-2g.20gb
    	3g.40gb: nvidia.com/mig-3g.40gb
    	7g.80gb: nvidia.com/mig-7g.80gb
  2. Validate and add support for the NVIDIA MIG single strategy. With single strategy, you can use fixed partition size on a single GPU. For more information, see Configuring single strategy for MIG support

Configuring single strategy for MIG support

Follow these steps to configure single-strategy for MIG support:
  1. Set the MIG advertisement strategy to single.

    Specify the host name, strategy, and configuration label in environment variables.

    NODE_NAME=myworker.redhat.com
    STRATEGY=single
    MIG_CONFIGURATION=all-3g.40gb
  2. Apply the desired MIG partitioning profile.

    For example, label a node to create two 3g.20gb instances on each GPU with the following command:

    oc label node/${NODE_NAME} nvidia.com/mig.config=${MIG_CONFIGURATION} --overwrite
  3. Verify the MIG configuration:
    1. Confirm that the correct label is applied to the node:
      oc get node/${NODE_NAME} -o json | jq '.metadata.labels'
    2. Check that the configuration was applied successfully:
      nvidia-smi -L

To learn more about configuring single strategy for MIG support, see Example of configuring single strategy for MIG in NVIDIA documentation.