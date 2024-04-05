Modern deep learning models are built from artificial neural networks, comprising multiple layers of interconnected nodes (or “neurons”). Each neuron has an activation function: a mathematical operation performed on data received from the previous layer, whose output informs the input fed to the following layer. Classic feed-forward neural networks (FFNs) process information by progressively passing input data from neurons in one layer to neurons in the following layer until it reaches an outer layer where final predictions occur. Some neural network architectures incorporate additional elements, like the self-attention mechanisms of transformer models, that capture additional patterns and dependencies in input data.

The connections between different layers and neurons are mediated by learnable model parameters: variable weights (and biases) that amplify or diminish the influence a given part of the network’s output has on other parts of the network. A deep learning model “learns” by adjusting these parameters, using optimization algorithms like gradient descent, in a way that increases the accuracy of its predictions.

While a larger number of parameters increases the model’s capacity—its ability to absorb information and patterns therein—it also increases the computational resources needed to train and operate the model. In a typical deep learning model—what in this context is referred to as a dense model—the entire network is executed in order to process any and all inputs. This creates a tradeoff between model capacity and practicality.

Unlike conventional dense models, mixture of experts uses conditional computation to enforce sparsity: rather than using the entire network for every input, MoE models learn a computationally cheap mapping function that determines which portions of the network—in other words, which experts—are most effective to process a given input, like an individual token used to represent a word or word fragment in NLP tasks.

This allows the capacity of the model to be increased (by expanding the total number of parameters) without a corresponding increase in the computational burden required to train and run it (because not all of those parameters will necessarily be used at any given time).