Planning for foundation model tuning in IBM watsonx.ai

To customize a foundation model by tuning the model requires extra resources, such as more GPUs and storage. The amount of extra resources required differs based on the tuning method you plan to use, the size of the base foundation model, and how you configure the tuning job.

Hardware requirements for foundation model tuning

When you are determining resource requirements for tuning a foundation model, factor into your calculations the extra GPUs required by the tuning experiment and storage requirements for the newly created tuned model asset. The following table shows general resource requirements for the different tuning methods that are supported in IBM watsonx.ai

Note: The average resource requirements for the various tuning methods are provided based on NVIDIA A100 GPUs with 80 GB RAM. Adjust the number of GPUs based on the GPU types in your deployment.

Table 1. Average resource requirements by tuning method
Tuning method	Average resource requirements	Supported GPU types
Full fine tuning	4 GPUs	NVIDIA A100, NVIDIA H100, NVIDIA H200, NVIDIA L40S
Low-rank adaptation (LoRA) fine tuning	2 GPUs	NVIDIA A100, NVIDIA H100, NVIDIA H200, NVIDIA L40S
Quantized low-rank adaptation (QLoRA) fine tuning	1 GPU	NVIDIA A100, NVIDIA H100, NVIDIA H200

Restriction: You cannot tune foundation models with Intel Gaudi 3 AI Accelerator GPUs. Regardless of the tuning method that you choose, you cannot partition any of the GPUs that will be used for tuning a foundation model.

The actual GPU resources required vary significantly based on the following factors:

Base model size
For example, to apply LoRA fine tuning to a 70 billion parameter model with a 1,024 maximum sequence length and batch size of 32 requires 8 GPUs.
Batch size for the tuning job
For example, for an 8 billion parameter model:
- You can use 1 GPU if you set the batch size to 2 and the maximum sequence length to 512.
- If you set the batch size to 128 and the maximum sequence length to 8,192, you might need as many as 8 GPUs.
Maximum sequence length, also referred to as context window length
For example, for an 8 billion parameter model:
- To run full fine tuning with a maximum sequence length of 32,000 and batch size of 1, you need 2 GPUs.
- To run LoRA fine tuning with a maximum sequence length of 64,000 and batch size of 1, you need 2 GPUs.

Tuning a foundation model

When you add models that can be fine tuned to your deployment, the models are downloaded to the cluster storage. However, unlike most other foundation models, no inference servers are started for the models. As a result, these foundation models cannot be inferenced right away. If you want to inference a foundation model before you tune the model, you can deploy the model as a custom foundation model and inference the model deployment. For a list of foundation models that can be tuned, see Foundation models available for tuning.

When you tune a foundation model, the model is loaded onto a node with access to the required number of GPUs so that the model can be accessed and adjusted by the tuning job. You can apply the full fine tuning method to a foundation model by using the API or from the Tuning Studio in the product UI. After the model is tuned, you can deploy and inference the tuned model deployment. For LoRA or QLoRA fine tuning, you must deploy both a new instance of the base model and the tuned model adapters in the same deployment space.

For more information about these tuning methods, see Methods for tuning foundation models.