Full fine-tuning, like the pre-training process it resembles, is very computationally demanding. For modern deep learning models with hundreds of millions or even many billions of parameters, it’s often prohibitively costly and impractical.

Parameter efficient fine-tuning (PEFT) encompasses a range of methods to reduce the number of trainable parameters that need to be updated in order to effectively adapt a large pre-trained model to specific downstream applications. In doing so, PEFT significantly decreases the computational resources and memory storage needed to yield an effectively fine-tuned model. PEFT methods have often been demonstrated to be more stable than full fine-tuning methods, particularly for NLP use cases.3



Partial fine-tuning

Also called selective fine-tuning, partial fine-tuning methods aim to reduce computational demands by updating only the select subset of pre-trained parameters most critical to model performance on relevant downstream tasks. The remaining parameters are “frozen,” ensuring that they will not be changed.

The most intuitive partial fine-tuning approach is to update only the outer layers of the neural network. In most model architectures, the inner layers of the model (closest to the input layer) capture only broad, generic features: for example, in a CNN used for image classification, early layers typically discern edges and textures; each subsequent layer discerns progressively finer features until final classification is predicted at the outermost layer. Generally speaking, the more similar the new task (for which the model is being fine-tuned) is to the original task, the more useful the pre-trained weights of the inner layers will already be for this new, related task—and thus the fewer layers need to be updated).

Other partial fine-tuning methods including updating only the layer-wide bias terms of the model (rather than the node-specific weights)4 and “sparse” fine-tuning methods that update only a select subset of overall weights throughout the model.5



Additive fine-tuning

Rather than fine-tuning the existing parameters of a pre-trained model, additive methods add extra parameters or layers to the model, freeze the existing pre-trained weights, and train only those new components. This approach helps retain stability of the model by ensuring that the original pre-trained weights remain unchanged.

While this can increase training time, it significantly reduces memory requirements because there are far fewer gradients and optimization states to store: according to Lialin, et al, training all of a model’s parameters requires 12–20 times more GPU memory than the model weights alone.6 Further memory savings can be achieved through quantization of the frozen model weights: a reduction in the precision used to represent model parameters, conceptually similar to lowering the bitrate of an audio file.

One sub-branch of additive methods is prompt tuning. Conceptually, it’s similar to prompt engineering, which refers to tailoring “hard prompts”—that is, prompts written by a human in natural language—to guide the model toward the desired output, such as by specifying a certain tone or by providing examples that facilitate few-shot learning. Prompt tuning introduces AI-authored soft prompts: learnable vector embeddings that are concatenated to the user’s hard prompt. Rather than retraining the model, prompt tuning entails freezing model weights and instead trains the soft prompt itself. Fast and efficient, prompt tuning allows for models to more easily switch between specific tasks, albeit with a tradeoff in interpretability.



Adapters

Another subset of additive fine-tuning injects adapter modules—new, task-specific layers added to the neural network—and trains these adapter modules in lieu of fine-tuning any of the pre-trained model weights (which are frozen). According to the original paper, which measured results on the BERT masked language model, adapters attained performance equivalent to that of full fine-tuning while training only 3.6% as many parameters.7



Reparameterization

Reparameterization-based methods like Low Rank Adaptation (LoRA) leverage low-rank transformation of high-dimensional matrices (like the massive matrix of pre-trained model weights in a transformer model). These low-rank representations omit inconsequential higher-dimensional information in order to capture the underlying low-dimensional structure of model weights, greatly reducing the number of trainable parameters. This dramatically speeds up fine-tuning and reduces memory needed to store model updates.

LoRA eschews direct optimization of the matrix of model weights and instead optimizes a matrix of updates to model weights (or delta weights), which is inserted into the model. That matrix of weight updates is, in turn, represented as two smaller (i.e., lower rank) matrices, greatly reducing the number of parameters to be updated—which, in turn, dramatically speeds up fine-tuning and reduces memory needed to store model updates. The pre-trained model weights themselves remain frozen.

An added benefit of LoRA is that, since what’s being optimized and stored are not new model weights but rather the difference (or delta) between the original pre-trained weights and fine-tuned weights, different task-specific LoRAs can be “swapped in” as needed to adapt the pre-trained model—whose actual parameters remain unchanged—to a given use case.

A variety of LoRA derivatives has been developed, such as QLoRA, which further reduces computational complexity by quantizing the transformer model prior to LoRA.