What is quantization aware training?

Authors

Senior Technology Advocate

What is quantization aware training (qat)?

Quantization aware training (QAT) is a method of quantization that integrates weight precision reduction directly into the pretraining or fine-tuning process of large language models (LLMs). This process differs from post-training quantization (PTQ) that performs quantization on the LLM to a pretrained model with no additional training or fine-tuning taking place in the process. While QAT yields improved performance outcomes, it requires significant computational resources and needs access to representative datasets for training purposes.¹

What is quantization?

Quantization is a machine learning compression technique that when applied to LLMs converts high-precision data formats (like 32-bit or 16-bit floating point numbers) to lower-precision formats (such as 8-bit or even 4-bit integers). This process reduces the computational demands, memory footprint and overall model size of LLMs while accelerating inference speed. Though quantization inevitably introduces some accuracy loss (quantization error) as values are compressed to fewer bits, the goal is to maintain nearly identical model performance and integrity despite the reduced precision. By shrinking the number of bits that need to be processed during each model run, quantization significantly decreases computational costs and allows LLMs to generate responses more quickly, making it a valuable optimization strategy for deployment.²

Quantization-Aware Training (QAT, Left) versus Post-Training Quantization (PTQ, Right)

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

How does QAT work?

Quantization aware training (QAT) works by simulating the effects of low-precision arithmetic during training. It inserts fake quantization operations into the computation graph, which mimic how weights and activations would behave under reduced precision (for example, 8-bit or 4-bit). By introducing quantization effects during training, QAT enables the model to optimize its parameters with awareness of reduced numerical precision, leading to better performance under quantized inference.

In QAT, the standard quantizer function transforms continuous, high-precision values into discrete, lower-precision representations by applying an elementwise operation to weight tensors. This function follows the uniform quantization approach, which is defined in the following algorithm:

$Q (x) = Δ \cdot r o u n d (c l i p (x / Δ, l, u))$

Where:

$Δ$ represents the quantization step size

The clip function constrains values between lower bound $l$ and upper bound $u$ :

$c l i p (x, l, u) = l i f x < l$

$c l i p (x, l, u) = u i f x > u$

Forward and backward passes use these replicated quantized values for computation. The forward pass of QAT refers to fake quantization being applied to weights and activations during inference. The backward pass refers to gradients being computed and weights being updated. Gradients are computed by using the straight through estimator (STE), which overcomes the nondifferentiability of quantization functions by using the identity function's gradient (Qˆ'(x) = 1) during backpropagation. This enables weight updates in quantized neural networks despite the true gradient being zero almost everywhere. However, model weights are maintained in full precision (floating-point), ensuring they can be updated smoothly without numerical instability. This means that each training step uses quantized versions for calculations but updates the precise weights, which are then requantized for the next step.³

Diagram of the Quantization-Aware Training procedure, including the use of Straight Through Estimator (STE).

In general, this concept can be a bit confusing, especially if one is new to the world of quantization. Let’s imagine for a moment that you are the driver for a racing team and you are navigating an off-road course for the first time. The terrain is narrow and can be very difficult to traverse. The first time that you navigate this course you might drive a bit slower to avoid crashing. Think of this as the forward pass the model will input under quantized conditions and grasp the ruggedness being introduced by quantization noise. After the first lap your driving coach instructs you on how to improve your methods for navigating the course more efficiently based on your initial run. We imagine the backwards pass as your driving coach’s critiques and corrections as it calculates the gradient and updates the weights based on its review of the forward pass (your first go around the rocky racecourse). Now, imagine your driving coach giving you feedback to improve your driving by pretending the rugged terrain was a smooth racetrack. This guidance allows you to improve your driving technique without getting hungup on the details of the rough terrain. Like this advice, the STE ignores nondifferentiable quantization steps and passes gradients through as if the quantization were a smooth operation. To recap the forward pass is like navigating a rugged racecourse. The backward pass is similar to your driving coach’s instructions based on the performance of your first lap, and STE is advice to pretend the rocky course were a smooth track.

QAT Steps

Simulated quantization: Both the forward passes and backward passes are executed in floating-point precision. It is crucial to maintain the floating-point precision during the backward pass, as gradient accumulation in quantized formats can lead to gradient vanishing or significant error propagation, particularly when working with low-precision representations. However, the model weights and activations are quantized after each gradient update, like a projected gradient descent approach. This method ensures that the model can converge to a better loss point, even after quantization introduces a perturbation to the model parameters.

Gradient calculation: Straight-through estimator (STE) is used to approximate the gradient by treating the quantization operation as an identity function. This approximation allows the gradient to pass through the quantization step, facilitating model updates during training.

* STE has been shown to work effectively in practice, except in extreme cases like binary quantization, where the gradient approximation is less accurate.

Quantization of parameters: After each gradient update, the model weights are quantized. This projection step ensures that the updated quantized weights adhere to the quantization scheme (for example, 8-bit or 4-bit). The model parameters remain in floating-point precision during training to avoid issues like underflow from small gradient updates.

Learning quantization parameters: In some advanced versions of QAT, the quantization parameters (such as clipping ranges or step sizes) are learned during training. For example, methods like parameterized clipping activation (PACT) learns clipping ranges for activations)or learned step size quantization (LSQ) learns scaling factors for activations help in fine-tuning the quantization process to enhance accuracy. (add PACT & LSQ citation).

Retraining overhead: QAT requires extensive retraining, sometimes over hundreds of epochs, to recover the lost accuracy, particularly at low-bit precision. This makes QAT computationally expensive, but for models that require high accuracy and long lifetimes, the retraining is often worthwhile. In contrast, for models with shorter lifespans or less critical accuracy needs, the computational cost might not justify the benefits.

Validation and benchmarking: The fully quantized model is validated against a test dataset to ensure that the accuracy remains acceptable. During this step metrics like accuracy, precision, recall and F1 score are compared against the original floating-point model. If for example, the accuracy drops beneath a predetermined threshold, it might be necessary to return to the previous steps for additional training. However, if the model passes it will move on to deployment.^{4, 5, 6}

QAT with PyTorch and Tensorflow

At a high level, both PyTorch and TensorFlow implement QAT within Python environments by simulating quantization effects or “fake quants” during training to maintain model accuracy after conversion. PyTorch's approach centers on model preparation through annotation and specifying quantization configurations, followed by inserting observer modules to collect tensor statistics during calibration. PyTorch QAT then replaces floating-point operations with quantized equivalents while maintaining backpropagation compatibility. TensorFlow's QAT works through the Keras Model Optimization Toolkit, which creates a quantization-aware model wrapper that emulates the quantization math through fake nodes, by using a straightforward apply_quantization_aware_training() API that requires minimal code changes to existing models.

PyTorch's QAT framework offers two paths: eager mode quantization with torch.quantization for rapid prototyping and fx graph mode for more comprehensive model transformations. It provides fine-grained control through customizable observers and supports both symmetric and asymmetric quantization schemes. TensorFlow's QAT implementation follows a clear workflow where the quantized Keras model is trained, then converted to a fully quantized model for inference by using the TFLite converter. TensorFlow specializes in 8-byte integer (int8) quantization targeting mobile and edge devices. Both Pytorch and Tensorflow frameworks simulate quantization during training with floating-point computations that mimic fixed-point math but differ in their integration approaches. PyTorch emphasizing flexibility with its modular design versus TensorFlow focusing on deployment-ready optimization with stronger mobile acceleration support. For more information, check out the Pytorch and Tensorflow repos on GitHub for some tutorials.^{7, 8}

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Go to episode

Use cases for QAT

Model deployment on edge devices and mobile phones: QAT is pivotal in the use of LLMs on devices where memory, compute and power are limited. It enables the use of efficient INT8 or even INT4 arithmetic, which accelerates inference and reduces model size without considerable loss in accuracy. For example, modern mobile devices are already optimized for low-precision operations making QAT models more feasible for these platforms.⁹

Computer vision tasks: QAT can be applied to convolutional neural networks (CCNs) for image classification tasks. CCNs are a type of deep learning model that are designed to process data by using a grid-like method. For example, using the MNIST digit classification task QAT enables CCN to achieve accuracy close to the original floating-point model.^{10, 11}

What's next?

Quantization aware training (QAT) is a critical technique for achieving efficient large language model deployment without significant accuracy loss. By integrating quantization during training, QAT enables reliable inference on resource-constrained devices, making it essential for edge computing and mobile applications. As hardware improves, QAT will unlock AI applications in IoT, real-time analytics and more personalized experiences. Its growing adoption, already supported by frameworks like PyTorch and TensorFlow, emphasizes its critical role in advancing scalable, energy-efficient AI systems.

How to choose the right AI foundation model

Learn how to choose the right approach in preparing datasets and employing AI models.

Resources

AI governance imperative: evolving regulations and emergence of agentic AI

Learn how evolving regulations and the emergence of AI agents are reshaping the need for robust AI governance frameworks.

IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology

Download the report to learn why IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology, and how watsonx.governance advances risk management, reporting, and integration.

How AI agents and assistants can benefit your organization

Dive into this comprehensive guide that breaks down key use cases, core capabilities, and step-by-step recommendations to help you choose the right solutions for your business.

Reimagine business productivity with AI agents and assistants

Learn how AI agents and AI assistants can work together to achieve new levels of productivity.

Try watsonx Orchestrate™

Explore how generative AI assistants can lighten your workload and improve productivity.

From AI projects to profits: How agentic AI can sustain financial returns

Learn how organizations are shifting from launching AI in disparate pilots to using it to drive transformation at the core.

Omdia Report on empowered intelligence: The impact of AI agents

Discover how you can unlock the full potential of gen AI with AI agents.

How AI agents will reinvent productivity

Learn ways to use AI to be more creative, efficient and start adapting to a future that involves working closely with AI agents.

Ushering in the agentic enterprise: Putting AI to work across your entire technology estate

Stay updated about the new emerging AI agents, a fundamental tipping point in the AI revolution.

The future of agents, AI energy consumption, Anthropic's computer use and Google watermarking AI-generated text

Stay ahead of the curve with our AI experts on this episode of Mixture of Experts as they dive deep into the future of AI agents and more.

How Comparus is using a "banking assistant"

Comparus used solutions from IBM® watsonx.ai™ and impressively demonstrated the potential of conversational banking as a new interaction model.

Footnotes

¹Hasan, Jahid. 2024. “Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques.” ArXiv.org. 2024. https://arxiv.org/abs/2411.06084

²Gholami, Amir, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. “A Survey of Quantization Methods for Efficient Neural Network Inference.” ArXiv:2103.13630 [Cs], June. https://arxiv.org/abs/2103.13630

³Krishnamoorthi, Raghuraman. 2018. “Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper.” ArXiv:1806.08342 [Cs, Stat], June. https://arxiv.org/abs/1806.08342

⁴Ashkboos, Saleh, Bram Verhoef, Torsten Hoefler, Evangelos Eleftheriou, and Martino Dazzi. 2024. “EfQAT: An Efficient Framework for Quantization-Aware Training.” ArXiv.org. 2024. https://search.arxiv.org/paper.jsp?r=2411.11038&qid=1745420378770ler_nCnN_-2014381916&qs=quantization+aware+training+steps

⁵Esser, Steven K., Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. 2020. “Learned Step Size Quantization.” ArXiv.org. May 6, 2020. https://doi.org/10.48550/arXiv.1902.08153

⁶Yang, Zhun, et al. “Injecting Logical Constraints into Neural Networks via Straight-through Estimators.” ArXiv.org, 2023, https://arxiv.org/abs/2307.04347

⁷“Quantization — PyTorch Master Documentation.” 2019. Pytorch.org. 2019. https://pytorch.org/docs/stable/quantization.html

⁸“Quantization Aware Training | TensorFlow Model Optimization.” n.d. TensorFlow. https://www.tensorflow.org/model_optimization/guide/quantization/training

⁹Tan, Fuwen, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. 2024. “MobileQuant: Mobile-Friendly Quantization for On-Device Language Models.” ArXiv.org. 2024. https://arxiv.org/abs/2408.13933

¹⁰Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. “Deep Learning with Limited Numerical Precision.” ArXiv.org. 2015. https://arxiv.org/abs/1502.02551

¹¹Kaziha, Omar, and Talal Bonny. 2020. “Exploring Quantization-Aware Training on a Convolution Neural Network,” November, 1–5. https://doi.org/10.1109/ccci49893.2020.9256498