Methods for tuning foundation models
Learn more about different tuning methods available in watsonx.ai and how to choose the method that's right for your solution.
Foundation models can be tuned in the following ways:
-
Prompt tuning: Adjusts the content of the prompt that is passed to the model to guide the model to generate output that matches a pattern you specify. The underlying foundation model and its parameter weights are not changed. Only the prompt input is altered.
Although the result of prompt tuning is a new tuned model asset, the prompt-tuned model merely adds a layer of function that runs before the input is processed by the underlying foundation model. When you prompt-tune a model, the underlying foundation model is not changed, which means that it can be used to address different business needs without being retrained each time. As a result, you reduce computational needs and inference costs. See Prompt tuning.
-
Full fine tuning: Using the base model’s previous knowledge as a starting point, full fine tuning tailors the model by tuning it with a smaller, task-specific dataset. The full fine-tuning method changes the parameter weights for a model whose weights were set through prior training to customize the model for a task.
A result of fine tuning is an entirely new model. Because all of the model weights are tuned, full fine tuning is a more expensive technique than other parameter-efficient tuning techniques. More compute and storage resources are required to host the new tuned model that you create by fine tuning a foundation model. See Full fine tuning.
-
Low-rank adaptation (LoRA) fine tuning: Adapts a foundation model for a task by changing the weights of a representative subset of the model parameters, called low-rank adapters, instead of the base model weights during tuning. At inference time, weights from the tuned adapters are added to the weights from the base foundation model to generate output that is tuned for a task. See Low-rank adaptation (LoRA) fine tuning.
You can run LoRA fine tuning experiments programmatically starting with the 2.1.1 release.
-
Quantized low-rank adaptation (QLoRA) fine tuning: QLoRA is a variant of LoRA that incorporates quantization to further reduce the memory footprint and computational resources that are required during tuning. See Quantized low-rank adaptation (QLoRA) fine tuning.
You can run QLoRA fine tuning experiments programmatically starting with the 2.1.1 release.
Tuning method comparison
The following table compares the available tuning methods based on common criteria for choosing a tuning method.
Criteria | Prompt tuning | Full fine tuning | LoRA fine tuning | QLoRA fine tuning |
---|---|---|---|---|
Tuning technique | Prompt vectors are tuned; base model parameters remain fixed. | All base model parameters are fine-tuned on the target task. | Adapters that represent a subset of model parameters are tuned; base model parameters remain fixed during tuning. | Adapters that represent a subset of model parameters are tuned; base model parameters remain fixed during tuning. |
Tuned model outcomes | Effective when the target task is similar to the pretrained knowledge of the model. | Effective at customizing a model for a new task or domain when given sufficient data and compute resources. | High performance with reduced risk of overfitting; might not reach level of full fine tuning performance. | High performance with reduced risk of overfitting; might not reach level of full fine tuning performance. Potential quality degradation introduced by quantization. |
Required compute resources | Low. Minimum resources are required. Only the prompt vector is tuned; the underlying model is unaltered. | High. Large computational resources and memory are required to fully update the model parameters. | Moderate. Requires fewer resources than full fine tuning because only the adapters are tuned; the underlying model is unaltered during tuning. | Low. Requires fewer resources than LoRA fine tuning because the model weights are quantized to reduce computational and storage needs. |
Tuning time duration | Short. Quickest process because only the prompts are changed; can range from 10 minutes to a few hours. | Long. Exact duration depends on the model and dataset sizes. | Moderate. Faster than full fine tuning, but takes time to modify the adapters; can range from one to many hours. | Moderate. Faster than full fine tuning, but takes time to modify the adapters; can range from one to many hours. |
Cost | Fewer resources are needed to prompt tune and host the tuned model. An effectively prompt-tuned smaller model can do the equivalent work of a larger model and requires fewer resources to host. | Factor the cost of the extra resources required both to fine tune the model and to deploy and host the new fine-tuned model that is generated. | Requires fewer storage and compute resources. Multiple LoRA adapters can be served using the same base model to save costs. | Requires fewer storage and compute resources than LoRA. Multiple LoRA adapters can be served using the same base quantized model to save costs. |
Purpose | Most suitable for quick adaptation tasks, especially when computational resources are limited, or the task is closely related to the pretrained model. | Best for scenarios where maximum accuracy and task-specific adaptation are critical and extra resources and cost are justified for the use case. | A good option for creating task-specific adapters to tune a foundation model for multiple tasks. | A good option for creating task-specific adapters to tune a quantized foundation model for multiple tasks. |
Parent topic: Tuning foundation models