Planning for foundation model tuning in IBM watsonx.ai
To customize a foundation model by tuning the model requires extra resources, such as more GPUs and storage. The amount of extra resources required differs based on the tuning method you plan to use, the size of the base foundation model, and how you configure the tuning job.
Hardware requirements for foundation model tuning
| Tuning method | Average resource requirements | Supported GPU types |
|---|---|---|
| Full fine tuning | 4 GPUs | NVIDIA A100, NVIDIA H100, NVIDIA H200, NVIDIA L40S |
| Low-rank adaptation (LoRA) fine tuning | 2 GPUs | NVIDIA A100, NVIDIA H100, NVIDIA H200, NVIDIA L40S |
| Quantized low-rank adaptation (QLoRA) fine tuning | 1 GPU | NVIDIA A100, NVIDIA H100, NVIDIA H200 |
- Base model size
For example, to apply LoRA fine tuning to a 70 billion parameter model with a 1,024 maximum sequence length and batch size of 32 requires 8 GPUs.
- Batch size for the tuning jobFor example, for an 8 billion parameter model:
- You can use 1 GPU if you set the batch size to 2 and the maximum sequence length to 512.
- If you set the batch size to 128 and the maximum sequence length to 8,192, you might need as many as 8 GPUs.
- Maximum sequence length, also referred to as context window lengthFor example, for an 8 billion parameter model:
- To run full fine tuning with a maximum sequence length of 32,000 and batch size of 1, you need 2 GPUs.
- To run LoRA fine tuning with a maximum sequence length of 64,000 and batch size of 1, you need 2 GPUs.
Tuning a foundation model
When you add models that can be fine tuned to your deployment, the models are downloaded to the cluster storage. However, unlike most other foundation models, no inference servers are started for the models. As a result, these foundation models cannot be inferenced right away. If you want to inference a foundation model before you tune the model, you can deploy the model as a custom foundation model and inference the model deployment. For a list of foundation models that can be tuned, see Foundation models available for tuning.
When you tune a foundation model, the model is loaded onto a node with access to the required number of GPUs so that the model can be accessed and adjusted by the tuning job. You can apply the full fine tuning method to a foundation model by using the API or from the Tuning Studio in the product UI. After the model is tuned, you can deploy and inference the tuned model deployment. For LoRA or QLoRA fine tuning, you must deploy both a new instance of the base model and the tuned model adapters in the same deployment space.
For more information about these tuning methods, see Methods for tuning foundation models.