Each of these myriad neuron-to-neuron connections is multiplied by a unique weight, which amplifies (or diminishes) the influence of each connection. The input provided to each neuron’s activation function can be understood as the weighted sum of the outputs of each neuron in the previous layer. There’s usually also a unique bias term added to each activation function, which functions similarly to the bias term of a common regression function.
During training, the neural network “learns” through adjustments to each of these weights and bias terms that yield more accurate outputs. These are the model’s parameters: when you read about, for instance, a large language model (LLM) having 8 billion “parameters,” that number reflects every single weighted neuron-to-neuron connection and neuron-specific bias in the model’s neural network.
The intermediate layers, called the network’s hidden layers, are where most of the learning occurs. It’s the inclusion of multiple hidden layers that distinguishes a deep learning model from a “non-deep” neural network, such as a restricted Boltzmann machine (RBN) or standard multilayer perceptron (MLP). The presence of multiple hidden layers allows a deep learning model to learn complex hierarchical features of data, with earlier layers identifying broader patterns and deeper layers identifying more granular patterns.
To perform inference, the network completes a forward pass: the input layer receive input data, usually in the form of a vector embedding, with each input neuron processing an individual feature of the input vector. For example, a model that works with 10x10 pixel grayscale images will typically have 100 neurons in its input layer, with each input neuron corresponding to an individual pixel. Neural networks thus typically require input vectors to be fixed to a certain size, though preprocessing techniques like pooling or normalization can provide some flexibility with regard to the size of the original input data itself.
The data is progressively transformed and passed along to the nodes of each subsequent layer until the final layer. The activation functions of the neurons in the output layer compute the network’s final output prediction. For instance, each output node of a deep classification model might perform a softmax function that essentially takes a numerical input and scales it to a probability, between 0–1, that the input belong a potential classification category. The model would then output the category corresponding to whichever output node yielded the highest output.