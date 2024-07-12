In most cases, classification loss is calculated in terms of entropy. Entropy, in plain language, is a measure of uncertainty within a system. For an intuitive example, compare flipping coins to rolling dice: the former has lower entropy, as there are fewer potential outcomes in a coin flip (2) than in a dice toss (6).

In supervised learning, model predictions are compared to the ground truth classifications provided by data labels. Those ground truth labels are certain and thus have low or no entropy. As such, we can measure loss in terms of the difference in certainty we’d have using the ground truth labels to the certainty of the labels predicted by the model.

The formula for cross-entropy loss (CEL) is derived from that of Kullback-Leibler divergence (KL divergence), which measures the difference between two probability distributions. Ultimately, minimizing loss entails minimizing the difference between the ground truth distribution of probabilities assigned to each potential label and the relative probabilities for each label predicted by the model.



Binary cross-entropy (log loss)

Binary cross-entropy loss, also called log loss, is used for binary classification. Binary classification algorithms typically output a likelihood value between 0 and 1. For example, in an email spam detection model, email inputs that result in outputs closer to 1 might be labeled “spam.” Inputs yielding outputs closer to 0 would be classified as “not spam.” An output of 0.5 would indicate maximum uncertainty or entropy.

Though the algorithm will output values between 0 and 1, the ground truth values for the correct predictions are exactly “0” or “1.” Minimizing binary cross-entropy loss thus entails not only penalizing incorrect predictions but also penalizing predictions with low certainty. This incentivizes the model to learn parameters that yield predictions that are not only correct but also confident. Furthermore, focusing on the logarithms of predicted likelihood values results in the algorithm more heavily penalizing predictions that are confidently wrong.

To maintain the common convention of lower loss values meaning less error, the result is multiplied by -1. Log loss for a single example i is thus calculated as – ( y i · log ( p ( y i ) ) + ( 1 - y i ) · log ( 1 - p ( y i ) ) ) , where y i is the true likelihood—either 0 or 1—and p(y i ) is the predicted likelihood. Average loss across an entire set of n training examples is thus calculated as – 1 n ∑ i = 1 n y i · l o g ( p ( y i ) ) + ( 1 - y i ) · l o g ( 1 - p ( y i ) ) .



Categorical cross-entropy loss

Categorical cross-entropy loss (CCEL) applies this same principle to multi-class classification. A multi-class classification model will usually output a value for each potential class, representing the probability of an input belonging to each respective category. In other words, they output predictions as a probability distribution.

In deep learning, neural network classifiers typically use a softmax activation function for neurons in the output layer. Each output neuron’s value is mapped to a number between 0 and 1, with the values collectively summing up to 1.

For example, in a data point containing only one potential category, the ground truth values for each prediction thus comprise “1” for the true class and “0” for each incorrect class. Minimizing CCEL entails increasing the output value for the correct class and decreasing output values for incorrect classes, thereby bringing the probability distribution closer to that of the ground truth. For each example, log loss must be calculated for each potential classification predicted by the model.