In many real-world settings, an artificial intelligence model’s accuracy and capacity are not, unto themselves, enough to make the model useful: it must also fit within the available budget of time, memory, money and computational resources.
The top performing models for a given task are often too large, slow or expensive for most practical use cases—but often have unique qualities that emerge from a combination of their size and their capacity for pre-training on a massive quantity of training data. These emergent abilities are especially evident in autoregressive language models, like GPT or Llama, that exhibit capabilities beyond their explicit training objective of simply predicting the next word in a sequence. Conversely, small models are faster and less computationally demanding, but lack the accuracy, refinement and knowledge capacity of a large model with far more parameters.
In the seminal 2015 paper, “Distilling the Knowledge in a Neural Network,” Hinton et al proposed to circumvent these limitations by dividing training into two distinct stages with distinct purposes. The authors presented an analogy: whereas many insects have a larval form optimized for extracting energy and nutrients from the environment and a totally different adult form optimized for traveling and reproduction, conventional deep learning uses the same models for both the training and deployment stages, despite their different requirements.
Taking inspiration from both nature and the work of Caruana et al, Hinton et al suggested that training large, cumbersome models is worthwhile if doing so is the best way to extract structure from data—but introduced a different kind of training, distillation, to transfer that knowledge to a small model more suitable for real-time deployment.2
Knowledge distillation techniques aim to not only replicate the outputs of teacher models, but to emulate their “thought processes.” In the era of LLMs, KD has enabled the transfer of abstract qualities like style, reasoning abilities and alignment to human preferences and values.3
Furthermore, smaller models are fundamentally more explainable: in a model with hundreds of billions of parameters, it’s difficult to interpret the contributions of different parts of the neural network. Transferring representations learned by large “black-box” models to simpler models can help elucidate transformative insights in fields like medical diagnostics and molecular discovery.4