The specific training objective used for diffusion models is closely related to the reconstruction loss term used to optimize variational autoencoders (VAEs). Like VAEs, diffusion models are optimized by maximizing the variational lower bound (VLB), also called the evidence lower bound (ELBO), of a combination of multiple loss terms.
Maximizing the VLB is used in variational inference to approximate the intractable score function : instead of directly minimizing error, it reformulates the equation as maximizing the minimum estimation (or lower bound) of the accuracy of model predictions.
The loss terms used each reflect the Kullback-Leibler divergence (or “KL divergence,” usually denoted as DKL) between the outcomes of forward diffusion steps of q and the reverse steps predicted by pθ. KL divergence is used to measure the difference between two probability distributions—for instance, between the distribution of pixel values in one image and the distribution of pixel values in another.
Specifically, the loss function for diffusion models combines three loss terms: LT, Lt and L0.
- LT reflects the KL divergence between q and pθ(xT). In other words, the difference between the fully noised end result of the forward process q and the starting point of the reverse process. This term can generally be ignored, because xT is gaussian and q has no learnable parameters.
- Lt reflects the KL divergence between and at each step. In other words, the accuracy of each of pθ’s denoising predictions during reverse diffusion as compared to each corresponding noising step of during the forward diffusion process for the original image, x0.
- L0 measures . In other words, L0 reflects the negative log likelihood of the model’s prediction of the fully denoised image x0. The gradient of L0 is the score matching term described earlier in the article. The loss term is negative so that minimizing the loss function becomes the equivalent of maximizing the likelihood of the model's predictions.
Though its complex mathematical derivation is beyond the scope of this article, the VLB can ultimately be simplified down to the mean-squared error (MSE) between the noise predicted by the model, and the true noise added in the forward process, , at each timestep. This explains why the model’s output is a prediction of noise at each step, rather than the denoised image itself.
By calculating the gradient of the loss function during backpropagation and then adjusting model weights to minimize the loss function through gradient descent, the model’s predictions across the entire training data set will become more accurate.