Starting from the final layer, a "backward pass" differentiates the loss function to compute how each individual parameter of the network contributes to the overall error for a single input.
Returning to our earlier example of the classifier model, we would start with the 5 neurons in the final layer, which we’ll call layer L. The softmax value of each output neuron represents the likelihood, out of 1, that an input belongs to their category. In a perfectly trained model, the neuron representing the correct classification would have an output value close to 1 and the other neurons would have an output value close to 0.
For now, we’ll focus on the output unit representing the correct prediction, which we’ll call Lc. Lc’s activation function is a composite function, containing the many nested activation functions of the entire neural network from the input layer to the output layer. Minimizing the loss function would entail making adjustments throughout the network that bring the output of Lc’s activation function closer to 1.
To do so, we’ll need to know how any change in previous layers will change Lc’s own output. In other words, we’ll need to find the partial derivatives of Lc’s activation function.
The output of Lc’s activation function depends on the contributions that it receives from neurons in the penultimate layer, which we’ll call layer L-1. One way to change Lc’s output is to change the weights between the neurons in L-1 and Lc. By calculating the partial derivative of each L-1 weight with respect to the other weights, we can see how increasing or decreasing any of them will bring the output of Lc closer to (or further away from) 1.
But that’s not the only way to change Lc’s output. The contributions Lc receives from L-1 neurons are determined not just by the weights applied to L-1’s output values, but by the actual (pre-weight) output values themselves. The L-1 neurons’ output values, in turn, are influenced by weights applied to inputs they receive from L-2. So we can differentiate the activation functions in L-1 to find the partial derivatives of the weights applied to L-2’s contributions. These partial derivatives show us how any change to an L-2 weight will affect the outputs in L-1, which would subsequently affect the output value of Lc and thereby affect the loss function.
By that same logic, we could also influence the output values that L-1 neurons receive from L-2 neurons by adjusting the contributions that L-2 neurons receive from neurons in L-3. So we find the partial derivatives in L-3, and so on, recursively repeating this process until we’ve reached the input layer. When we’re done, we have the gradient of the loss function: a vector of its partial derivative for each weight and bias parameter in the network.
We’ve now completed a forward pass and backward pass for a single training example. However, our goal is to train the model to generalize well to new inputs. To do so requires training on a large number of samples that reflect the diversity and range of inputs the model will be tasked with making predictions on post-training.