Low-rank adaptation (LoRA) fine tuning
Low-rank adaptation (LoRA) fine tuning adapts a foundation model for a task by changing the weights of a representative subset of the model parameters, called low-rank adapters, instead of the base model weights during tuning. At inference time, weights from the tuned adapters are added to the weights from the base foundation model to generate output that is tuned for a task.
How low-rank adaptation (LoRA) tuning works
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adds a subset of parameters to the frozen base foundation model and updates the subset during the tuning experiment, without modifying the parameters of the base model. When the tuned foundation model is inferenced, the new parameter weights from the subset are added to the parameter weights from the base model to generate output that is customized for a task.
How the subset of parameters is created involves some mathematics. Remember, the neural network of a foundation model is composed of layers, each with a complex matrix of parameters. These parameters have weight values that are set when the foundation model is initially trained. The subset of parameters that are used for LoRA tuning is derived by applying rank decomposition to the weights of the base foundation model. The rank of a matrix indicates the number of vectors in the matrix that are linearly independent from one another. Rank decomposition, also known as matrix decomposition, is a mathematical method that uses this rank information to represent the original matrix in two smaller matrices that, when multiplied, form a matrix that is the same size as the original matrix. With this method, the two smaller matrices together capture key patterns and relationships from the larger matrix, but with fewer parameters. The smaller matrices produced are called low-rank matrices or low-rank adapters.
During a LoRA tuning experiment, the weight values of the parameters in the subset–the low-rank adapters–are adjusted. Because the adapters have fewer parameters, the tuning experiment is faster and needs fewer resources to store and compute changes. Although the adapter matrices lack some of the information from the base model matrices, the LoRA tuning method is effective because LoRA exploits the fact that large foundation models typically use many more parameters than are necessary for a task.
The output of a LoRA fine-tuning experiment is a set of adapters that contain new weights. When these tuned adapters are multiplied, they form a matrix that is the same size as the matrix of the base model. At inference time, the new weights from the product of the adapters are added directly to the base model weights to generate the fine-tuned output.
You can configure parameters of the LoRA tuning experiment, such as the base foundation model layers to target and the rank to use when decomposing the base model matrices. For more details, see Parameters for tuning foundation models.
When you deploy the adapter asset, you must deploy the asset in a deployment space where the base model is also deployed. You can use the LoRA fine tuning method in watsonx.ai to fine tune only non-quantized foundation models.
The benefits of using the LoRA fine-tuning technique include:
- The smaller, trainable adapters used by the LoRA technique require fewer storage and computational resources during tuning.
- Adjustments from the adapters are applied at inference time without impacting the context window length or the speed of model responses.
- You can deploy one base foundation model and use the model with different adapters to customize outputs for different tasks.
Low-rank adaptation (LoRA) fine-tuning workflow
During the LoRA fine-tuning experiment, the parameter weights of a representative subset of the model parameters, called low-rank adapters, are repeatedly adjusted so that the predictions of the tuned foundation model can get better over time.
The following diagram illustrates the steps that occur during a LoRA fine-tuning experiment run.
The parts of the experiment flow that you can configure are highlighted with a user icon . These decision points correspond with experiment tuning parameters
that you control. See Parameters for tuning foundation models.
The diagram shows the following steps of the experiment:
-
The experiment reads the training data, tokenizes it, and converts it into batches.
The size of the batches is determined by the batch size parameter.
-
Low-rank adapters, which are a representative subset of the base model parameters, are devised. The initial weights of the low-rank adapters are applied to the model layers that you specify in the target_modules parameter, and are calculated based on the value that you specify for the rank parameter.
-
Sends input from the examples in the batch to the LoRA adapters, and then to the foundation model to process and generate output.
-
Compares the model's output to the output from the training data that corresponds to the training data input that was submitted. Then, computes the loss gradient, which is the difference between the predicted output and the actual output from the training data.
The experiment adjusts the LoRA adapter parameter weights based on the computed loss of the model. When this adjustment occurs depends on how the Accumulation steps parameter is configured.
-
Adjustments are applied to the parameter weights of the LoRA adapaters. The degree to which the weights are changed is controlled by a combination of the learning rate, alpha, and dropout parameter values.
-
Input from the next example in the training data is submitted to the LoRA adapter as input. The adapter applies the latest weight changes and adds them to the base foundation model weights to adjust them for the task.
-
The process repeats until all of the examples in all of the batches are processed.
-
The entire set of batches are processed again as many times as is specified in the Number of epochs parameter.
Learn more about LoRA
- IBM.com: What is LoRA (low-rank adaption)?
- LoRA research paper
- Parameters for tuning foundation models
Parent topic: Methods for tuning foundation models