Simulated quantization: Both the forward passes and backward passes are executed in floating-point precision. It is crucial to maintain the floating-point precision during the backward pass, as gradient accumulation in quantized formats can lead to gradient vanishing or significant error propagation, particularly when working with low-precision representations. However, the model weights and activations are quantized after each gradient update, like a projected gradient descent approach. This method ensures that the model can converge to a better loss point, even after quantization introduces a perturbation to the model parameters.

Gradient calculation: Straight-through estimator (STE) is used to approximate the gradient by treating the quantization operation as an identity function. This approximation allows the gradient to pass through the quantization step, facilitating model updates during training.

* STE has been shown to work effectively in practice, except in extreme cases like binary quantization, where the gradient approximation is less accurate.

Quantization of parameters: After each gradient update, the model weights are quantized. This projection step ensures that the updated quantized weights adhere to the quantization scheme (for example, 8-bit or 4-bit). The model parameters remain in floating-point precision during training to avoid issues like underflow from small gradient updates.

Learning quantization parameters: In some advanced versions of QAT, the quantization parameters (such as clipping ranges or step sizes) are learned during training. For example, methods like parameterized clipping activation (PACT) learns clipping ranges for activations)or learned step size quantization (LSQ) learns scaling factors for activations help in fine-tuning the quantization process to enhance accuracy. (add PACT & LSQ citation).

Retraining overhead: QAT requires extensive retraining, sometimes over hundreds of epochs, to recover the lost accuracy, particularly at low-bit precision. This makes QAT computationally expensive, but for models that require high accuracy and long lifetimes, the retraining is often worthwhile. In contrast, for models with shorter lifespans or less critical accuracy needs, the computational cost might not justify the benefits.

Validation and benchmarking: The fully quantized model is validated against a test dataset to ensure that the accuracy remains acceptable. During this step metrics like accuracy, precision, recall and F1 score are compared against the original floating-point model. If for example, the accuracy drops beneath a predetermined threshold, it might be necessary to return to the previous steps for additional training. However, if the model passes it will move on to deployment.4, 5, 6