Published: 5 April 2024
Contributors: Dave Bergmann
Instruction tuning is a technique for fine-tuning large language models (LLMs) on a labeled dataset of instructional prompts and corresponding outputs. It improves model performance not only on specific tasks, but on following instructions in general, thus helping adapt pre-trained models for practical use.
Instruction tuning is a subset of the broader category of fine-tuning techniques used to adapt pre-trained foundation models for downstream tasks. Foundation models can be fine-tuned for a variety of purposes, from style customization to supplementing the core knowledge and vocabulary of the pre-trained model to optimizing performance for a specific use case. Though fine-tuning is not exclusive to any specific domain or artificial intelligence model architecture, it has become an integral part of the LLM lifecycle. For example, Meta’s Llama 2 model family is offered (in multiple sizes) as a base model, as a variant fine-tuned for dialogue (Llama-2-chat) and as a variant fine-tuned for coding (Code Llama).
Instruction tuning is not mutually exclusive with other fine-tuning techniques. For example, chat models often undergo both instruction tuning and reinforcement learning from human feedback (RLHF), a fine-tuning technique that aims to improve abstract qualities like helpfulness and honesty; models fine-tuned for coding often undergo both instruction tuning (to broadly optimize responses for instruction following) and additional fine-tuning on programming-specific data (to augment the model’s knowledge of coding syntax and vocabulary).
While the genesis of LLMs traces back to the 2017 “Attention is All You Need” paper that introduced large-scale transformer models to natural language processing (NLP) tasks, the incorporation of instruction tuning and RLHF—driven by influential papers from Google (in 2021)1 and OpenAI (in 2022),2 respectively—yielded the modern LLMs that initiated the current era of generative AI with the launch of ChatGPT.
Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.
Register for the guide on foundation models
The utility of instruction tuning, like that of most fine-tuning techniques, lies in the fact that pre-trained LLMs are not optimized for conversations or instruction following. In a literal sense, LLMs do not answer a prompt: they only append text to it. Instruction tuning helps make that appended text more useful.
The pre-training process for autoregressive language models—LLMs used for generating text, like Meta’s Llama 2, OpenAI’s GPT, Google’s Gemini or IBM’s Granite—optimizes these LLMs to simply predict the next word(s) in a given sequence until it’s complete.
LLMs are pre-trained using self-supervised learning on a massive corpus of written content. In pre-training, autoregressive models are provided the beginning of a text sample and repeatedly tasked with predicting the next word in the sequence until the end of the excerpt. For each prediction, the actual next word of the original sample sentence serves as “ground truth.” Through optimization algorithms like gradient descent that iteratively adjust model parameters—the varying weights and biases applied to the mathematical operations occurring at each node in a neural network—in a way that brings the model’s predictions closer to the original text, the model “learns” the linguistic patterns in its training data (and, by extension, the “knowledge” conveyed in those linguistic patterns).
Though this pre-training process imparts an impressive ability to generate linguistically coherent text, it doesn’t necessary align model performance with the practical needs of human users. Without fine-tuning, a base model might respond to a prompt of “teach me how to bake bread” with “in a home oven.” That’s a grammatically sound way to complete the sentence, but not what the user wanted.
Nevertheless, pre-training an LLM for any specific purpose (like following instructions) is impractical. The “large” in “large language models” refers to the fact that these models often have billions of parameters: training these huge models from scratch entails a tremendous amount of energy, time, computational resources and training data. Conversely, fine-tuning an already-trained LLM requires far less data and, especially when using parameter efficient fine-tuning (PEFT) methods like partial fine-tuning or low rank adaptation (LoRA), only a fraction of the computational demands.
Though fine-tuning can be achieved through nearly any machine learning paradigm, including reinforcement learning, semi-supervised learning or additional self-supervised learning, instruction tuning entails supervised learning on labeled (input, output) pairs. What distinguishes instruction tuning from other forms of supervised fine-tuning (SFT) is that the input samples in an instruction dataset consist entirely of tasks that resemble requests users might make in their prompts; the outputs demonstrate desirable responses to those requests. In adjusting model weights to make the LLM’s outputs resemble the examples in the instruction dataset, the LLM “learns” to respond to a prompt like “teach me how to bake bread” by appending text that contains actual advice for baking bread.
Instruction tuning thus helps to bridge the gap between the model’s fundamental objective—next-word prediction—and the user’s goal of having the model follow instructions and perform specific tasks. This makes model behavior more useful and predictable.
Fine-tuning LLMs on a labeled dataset of varied instruction-following tasks yields greater ability to follow instructions in general, reducing the amount of in-context information needed for effective prompts. Instruction datasets can be either manmade or generated by another LLM.
As articulated in Google Research’s influential 2022 paper, “Finetuned Language Models are Zero-Shot Learners,” the goal of instruction tuning is to improve the ability of LLMs to respond to NLP instructions. To do so, instruction tuning “combines appealing aspects of both the pretrain–finetune and prompting paradigms.” In essence, by organically incorporating the principles of prompt engineering into supervised fine-tuning, instruction tuning reduces the amount of prompt engineering and few-shot exemplars required to elicit a useful, accurate response from the fine-tuned model.1
Each training sample in an instruction dataset comprises three elements:
The Google paper noted that the resulting instruction-tuned variant of their LaMDA-PT model, dubbed FLAN (for Finetuned Language Net), experienced the greatest improvements on tasks that are naturally articulated as instructions, like translation, question-answering, reading comprehension and natural language inference (NLI)—the task of determining whether a given “hypothesis” follows logically from a given “premise.”
To explain this, the FLAN paper notes an observation made by Brown, et al in the research paper released for the original GPT-3 model in 2020: one explanation for why pre-trained LLMs (absent additional fine-tuning) struggle with tasks like NLI is that passages resembling a typical NLI task are unlikely to occur naturally in the corpus of unlabeled data used for self-supervised pre-training.3 Conversely, for tasks that more closely resemble the straightforward language modeling objective of pre-training—like commonsense reasoning tasks that ultimately require the model to complete a sentence correctly—instructions are largely redundant (and thus instruction tuning imparts less benefit).
Perhaps most importantly, the paper demonstrated that adding additional tasks to the instruction tuning dataset improved the instruction-tuned model’s performance even on novel tasks that were not represented in the instruction dataset. Therein lies the fundamental benefit of instruction tuning: a holistic improvement in the model’s ability to follow instructions in general.
The FLAN paper also included an ablation study that explored whether the apparent benefits of instruction fine-tuning were due to the instruction themselves or simply attributable to fine-tuning the model on multiple NLP tasks. To examine the role of instructions in fine-tuning, the ablation study fine-tuned the base model on three different setups:
The ablation study then measured the results of each fine-tuned language model on a series of zero-shot instruction-following tasks. The instruction-tuned model achieved over 18% greater accuracy than the “no template” model and over 8% greater accuracy than the “dataset name” model. This indicates that training with the instructions themselves is crucial to enhancing zero-shot performance on unseen tasks.
Chain-of-thought (CoT) prompting asks an LLM to not only answer a question but also generate a rationale for how it arrived at an answer. This can be achieved through few-shot prompting with examplars of sequential reasoning, or by simply appending “think step by step” to the end of a prompt. Research has demonstrated that CoT prompting significantly enhances the zero-shot capabilities of large models across diverse arithmetical, symbolic reasoning and other logical reasoning tasks.5 Wei, et al found that instruction tuning that does not include CoT tasks in the instruction dataset significantly degrades model performance on CoT evaluations—but that adding CoT datasets improves performance on all evaluations.6
Furthermore, their research found that instruction finetuning on CoT tasks—both with and without few-shot exemplars—increases a model’s ability for CoT reasoning in a zero-shot setting. An intuitive understanding of this benefit would be that through being fine-tuned to work through a problem in logical steps rather than leap to an answer that simply seems linguistically coherent, models learn to better produce and apply their own reasoning skills.
A number of datasets exist for the purpose of instruction tuning LLMs, many of which are open source. These datasets can comprise directly written (or collected) natural language (instruction, output) pairs, use templates to convert existing annotated datasets into instructions or even use other LLMs to generate examples.
While directly authoring (instruction, output) pairs is straightforward, it’s a labor-intensive process that ultimately entails a significant amount of time and cost. Various methods have been proposed to transform natural language datasets into instructions, typically by applying templates. The release of multiple open source human-crafted datasets has helped defray to cost of fine-tuning on organic data.
Prominent open source human-created instruction datasets include:
Motivated by the prohibitive amount of cost and labor required to manually generate instructions and target outputs, many instruction datasets use the responses of larger LLMs to generate prompts, outputs or both. The use of LLM-generated datasets often has the added effect of teaching smaller models to emulate the behavior of larger models, sometimes in a deliberate teacher/learner dynamic.
As the power of LLMs increases, the utility of LLM-generated instruction tuning datasets has similarly increased. A 2023 paper replicated the Alpaca fine-tuning paradigm—which fine-tuned LLaMA on InstructGPT-generated instructions—while repeating the process in parallel using GPT-4 to generate instruction. The resultant model, which they dubbed LLaMA-GPT4, significantly outperformed the Alpaca equivalent’s “Helpfulness” scores and came close to matching GPT-4 itself in measures of “Helpfulness,” “Honesty” and “Harmlessness.”11
Though instruction tuning techniques have yielded important advances in LLMs, work remains to diversify instruction tuning datasets and fully clarify its benefits.
Chief among the challenges of instruction tuning is the creation of high-quality instructions for use in fine-tuning. The resources required to craft a suitably large instruction dataset has centralized instruction to a handful of open source datasets, which can have the effect of decreasing model diversity. Though the use of larger, proprietary LLMs to generate instructions has helped reduce costs, this has the potential downside of reinforcing the biases and shortcomings of these proprietary LLMs across the spectrum of open source LLMs. This problem is compounded by the fact that proprietary models are often, in an effort to circumvent the intrinsic bias of human researchers, to evaluate the performance of smaller models.
On a technical level, some researchers have raised concerns that using larger models to improve smaller models may help smallest models imitate the larger models’ style, but not their actual functionality. A 2023 empirical study suggested that many of the impressive performance gains enjoyed through instruction tuning may come from picking up superficial patterns, rather than more genuine improvement in logical reasoning.12
Similarly, other researchers have posited that some reported improvements may depend somewhat on the reliance of evaluating instruction-tuned model performance on tasks too closely related to those of the instruction training dataset. Through more targeted testing of models instruction tuned in this fashion, Gudibande, et al concluded that “the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base [language models], rather than taking the shortcut of imitating proprietary systems.”13
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.
The watsonx AI studio offers a library of cost-effective, enterprise-grade foundation models developed by IBM, open-source models and models sourced from third-party providers to help clients and partners quickly scale and operationalise generative AI with minimal risk.
Granite is IBM's flagship series of LLM foundation models based on decoder-only transformer architecture. Granite language models are trained on trusted enterprise data spanning internet, academic, code, legal and finance.
Learn about large language model operation (LLMOps): the specialized practices and workflows that speed development, deployment and management of AI models throughout their complete lifecycle.
Learn how, why, and when to tune a foundation model in watsonx.ai with this series of tutorials and video guides.
NOTE: All links reside outside ibm.com.
1 "Finetuned Language Models Are Zero-Shot Learners", Google (via arXiv), 3 September 2021 (last revised 8 February 2022).
2 "Aligning language models to follow instructions", OpenAI, 27 January 2022.
3 "Language Models are Few-Shot Learners", arXiv, 22 July 2020.
4 "WMT 2014", Papers With Code, 27 June 2014.
5 "Language Models are Zero-Shot Reasoners", arXiv, 24 May 2022 (last revised 29 January 2023).
6 "Scaling Instruction-Finetuned Language Models", Google (via arXiv), 6 December, 2022.
7 "Alpaca: A Strong, Replicable Instruction-Following Model", Stanford Center for Research on Foundation Models, 13 March 2023.
8 "WizardLM: Empowering Large Language Models to Follow Complex Instructions", arXiv, 10 June 2023.
9 "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality", LMSYS Org, 30 March 2023.
10 "Orca: Progressive Learning from Complex Explanation Traces of GPT-4", Microsoft, June 2023.
11 "Instruction Tuning with GPT-4", arXiv, 6 April 2023.
12 "Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning", arXiv, 19 May 2023.
13 "The False Promise of Imitating Proprietary LLMs", arXiv, 25 May 2023.