Published: 15 August 2024
Contributors: Ivan Belcic, Cole Stryker
Parameter-efficient fine-tuning (PEFT) is a method of improving the performance of pretrained large language models (LLMs) and neural networks for specific tasks or data sets. By training a small set of parameters and preserving most of the large pretrained model’s structure, PEFT saves time and computational resources.
Neural networks trained for general tasks such as natural language processing (NLP) or image classification can specialize in a related new task without being entirely retrained. PEFT is a resource-efficient way to build highly specialized models without starting from scratch each time.
PEFT works by freezing most of the pretrained language model’s parameters and layers while adding a few trainable parameters, known as adapters, to the final layers for predetermined downstream tasks.
The fine-tuned models retain all the learning gained during training while specializing in their respective downstream tasks. Many PEFT methods further enhance efficiency with gradient checkpointing, a memory-saving technique that helps models learn without storing as much information at once.
Explore how to choose the right approach in preparing data sets and employing AI models.
The CEO’s guide to generative AI
Parameter-efficient fine-tuning balances efficiency and performance to help organizations maximize computational resources while minimizing storage costs. When tuned with PEFT methods, transformer-based models such as GPT-3, LLaMA and BERT can use all the knowledge contained in their pretraining parameters while performing better than they otherwise would without fine-tuning.
PEFT is often used during transfer learning, where models trained in one task are applied to a second related task. For example, a model trained in image classification might be put to work on object detection. If a base model is too large to completely retrain or if the new task is different from the original, PEFT can be an ideal solution.
Traditional full fine-tuning methods involve slight adjustments to all the parameters in pretrained LLMs to adapt them for specific tasks. But as developments in artificial intelligence (AI) and deep learning (DL) have led models to grow larger and more complex, the fine-tuning process has become too demanding on computational resources and energy.
Also, each fine-tuned model is the same size as the original. All these models take up significant amounts of storage space, further driving up costs for the organizations that use them. While fine-tuning does create more efficient machine learning (ML), the process of fine-tuning LLMs has itself become inefficient.
PEFT adjusts the handful of parameters that are most relevant to the model’s intended use case to deliver specialized model performance while reducing model weights for significant computational cost and time savings.
Parameter-efficient fine-tuning brings a wealth of benefits that have made it popular with organizations that use LLMs in their work:
Most large language models used in generative AI (gen AI) are powered by expensive graphics processing units (GPUs) made by manufacturers such as Nvidia. Each LLM uses large amounts of computational resources and energy. Adjusting only the most relevant parameters imparts large savings on energy and cloud computing costs.
Time-to-value is the amount of time that it takes to develop, train and deploy an LLM so it can begin generating value for the organization that uses it. Because PEFT tweaks only a few trainable parameters, it takes far less time to update a model for a new task. PEFT can deliver comparable performance to a full fine-tuning process at a fraction of the time and expense.
Catastrophic forgetting happens when LLMs lose or “forget” the knowledge gained during the initial training process as they are retrained or tuned for new use cases. Because PEFT preserves most of the initial parameters, it also safeguards against catastrophic forgetting.
Overfitting is when a model hews too closely to its training data during the training process, making it unable to generate accurate predictions in other contexts. Transformer models tuned with PEFT are much less prone to overfitting as most of their parameters remain static.
By focusing on a few parameters, PEFT lowers the training data requirements for the fine-tuning process. Full fine-tuning requires a much larger training data set because all the model’s parameters will be adjusted during the fine-tuning process.
Without PEFT, the costs of developing a specialized LLM are too high for many smaller or medium-sized organizations to bear. PEFT makes LLMs available to teams who might not otherwise have the time or resources to train and fine-tune models.
PEFT enables data scientists and other professionals to customize general LLMs to individual use cases. AI teams can experiment with model optimization without worrying as much about burning through computational, energy and storage resources.
AI teams have various PEFT techniques and algorithms at their disposal, each with its relative advantages and specializations. Many of the most popular PEFT tools can be found on Hugging Face and numerous other GitHub communities.
Adapters are one of the first PEFT techniques to be applied to natural language processing (NLP) models. Researchers strove to overcome the challenge of training a model for multiple downstream tasks while minimizing model weights. Adapter modules were the answer: small add-ons that insert a handful of trainable, task-specific parameters into each transformer layer of the model.
Introduced in 2021, low-rank adaption of large language models (LoRA) uses twin low-rank decomposition matrices to minimize model weights and reduce the subset of trainable parameters even further.
QLoRA is an extended version of LoRA that quantizes or standardizes the weight of each pretrained parameter to just 4 bits from the typical 32-bit weight. As such, QLoRA offers significant memory savings and makes it possible to run an LLM on just one GPU.
Specifically created for natural language generation (NLG) models, prefix-tuning appends a task-specific continuous vector, known as a prefix, to each transformer layer while keeping all parameters frozen. As a result, prefix-tuned models store over a thousandfold fewer parameters than fully fine-tuned models with comparable performance.
Prompt-tuning simplifies prefix-tuning and trains models by injecting tailored prompts into the input or training data. Hard prompts are manually created, while soft prompts are AI-generated strings of numbers that draw knowledge from the base model. Soft prompts have been found to outperform human-generated hard prompts during tuning.
P-tuning is a variation of prompt-tuning designed for natural language understanding (NLU) tasks. Rather than use manually created prompts, P-tuning introduced automated prompt training and generation that leads to more impactful training prompts over time.
Go from AI pilots to production and impact with AI technologies built for business.
Reimagine how you work with AI for business.
Use the IBM library of foundation models on the IBM watsonx™ platform to scale generative AI for your business with confidence.
Deep tech requires deep trust, especially in the age of AI.
How hybrid by design can help tech architectures accelerate business outcomes.
Generative AI is everywhere, and it has democratized data and accelerated the model-to-monetization cycle. AI-powered automation is poised to become more prevalent.