**Published: ** 29 July 2024

**Contributors: **Bryan Clark

Quantization is the process of reducing the precision of a digital signal, typically from a higher-precision format to a lower-precision format. This technique is widely used in various fields, including signal processing, data compression and machine learning.

Quantization is a technique utilized within large language models (LLMs) to convert weights and activation values of high precision data, usually 32-bit floating point (FP32) or 16-bit floating point (FP16), to a lower-precision data, like 8-bit integer (INT8). High precision data (referring to FP32 and FP16) gets its name because models using these data types typically have higher accuracy. This is because when the data is compressed into something like INT8 they are squeezed to smaller size. This effectively results in less accuracy, also referred to as quantization error. An activation value is a number (between zero and one) assigned to the artificial neuron of the neural network. This assigned number is referred to as its activation value. 8-bit quantization is generally the goal but quantized data of 4-bit integer (INT4) and lower has been successfully achieved. Essentially, the quantization process using compression techniques on a neural network to convert a large number of bits to a small number of bits. ^{1}

The computational requirements of operating an LLM using FP32 can be immense. Along with the increased computational requirements, inference (the process of an LLM generating a response to a user's query) can be slowed as well. Quantization can be a great optimization tool to both reduce the computational burden as well while increasing the inference speed of an LLM. The quantization process revolves around the premise of converting the weights to a lower precision data type while model's performance remains almost identical. The conversion of weights to a lower precision data type will result in less computational costs because there are less number of bits that need processing each time the model runs. Less bits being processed will also lead to each query to the LLM processing quicker.

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Register for the guide on foundation models

By utilizing quantization to convert the floating point data types to integers, the calculations can be completed more rapidly. This decreases the model’s overall latency and leads to an improvement in the inference speed versus the accuracy. This tradeoff is crucial for any real-time applications, especially while running on mobile CPUs. ^{2}

Quantization is crucial when attempting to run machine learning models on devices that cannot handle higher computational requirements. When quantization converts floating-point to integer representation, it reduces the computational demands of the machine learning model. This makes it possible to utilize these quantized models within real-world applications on laptops, tablets and smartphones. ^{2}

Typically, quantized models have lower computational requirements. Therefore, quantization results in increased energy efficiency which is key for running these models on laptops, tablets and mobile phones. ^{3}

Utilizing quantization allows current machine learning models to run using integer operations. This makes the quantized models compatible with older platforms that do not support floating-point operations. This also makes these models much more accessible, making it possible to run them on consumer GPUs. ^{4}

There are around 4 billion values within the set of FP32 possible values ranging from -3.4 * 1038 to 3.4 * 1038. Whereas with INT8, we see only 256 values within the possible values set ranging from -128 to 128. Due to the latter being a significantly smaller set of values, matrix multiplication can occur much faster. Due to the immense computational cost of deep-learning based models, accurate and efficient algorithms are essential.

The process of quantization occurs by first determining the optimal route of projecting the 32-bit floating point values into the INT8 field. To do so, there are multiple algorithms to quantizing a model. We will take a look at two quantization methods, Absolute Max and Affine Quantization.

**Absolute max quantization **

To calculate the mapping between the floating-point number and its corresponding INT8 number in absolute max quantization, you have to first divide by the absolute maximum value of the tensor and then multiply by the total range of the data type.

For example, we will apply the absolute max quantization algorithm to the following vector [1.6, -0.7, -3.4, 1.7, -2.9, 0.5, 2.3, 6.2]. You extract the absolute maximum of it, which is 6.2 in this case. INT8 has a range of [-127, 127], so we divide 127 by 6.2 and obtain 20.5 for the scaling factor. Therefore, multiplying the original vector by it gives the quantized data vector [33, -14, -70, 35, -59, 10, 47, 127]. Due to these numbers being rounded, there will be some precision loss. ^{5}

**Affine quantization **

To implement the affine quantization algorithm, we will define our range of 32-bit floating point values as [a, b]. The affine quantization algorithm is as follows:

𝑥𝑞 = *round ((1/𝑆)𝑥+𝑍) *

- **𝑥𝑞 ** is the quantized INT8 value that corresponds to the 32-bit floating point value x.

- **S **is an FP32 scaling factor and is a positive 32-bit floating point.

- **Z ** is the zero-point. This will be the INT8 value that corresponds with zero in the 32-bit floating point field.

*- round *is referring to the rounding of the resulting value

To next establish the [min, max] of our 32-bit floating point values we need to take into account any outliers. Overlooking these outliers can lead to them being mapped as either min or max and potentially skewing the accuracy of the quantized model. To counter this, the model can be quantized in blocks. The weights can be broken up into groups of either 64 or 128. Then these groups are quantized to account for outliers and minimize the risk of lowered precision. ^{6}

**Post-training quantization (PTQ)**

Post-training quantization occurs when quantization is applied to an existing model. This converts the model from a floating-point representation to a lower precision fixed-point integer without the need for any retraining. This method does not require as much data as quantization aware training and is much faster. However, because an already existing model is essentially being converted to a smaller sizer, post-training quantization can lead to performance being degraded. An example of when to utilize PTQ would be when you already have a working model and have a desire to increase the speed and efficiency. This is because PTQ takes place after a model is trained (that is, an already existing model) so a large amount of training data is not required for this process. ^{7}

**Quantization-aware training (QAT)**

Quantization-aware training incorporates the conversion of weight during the pre-training or fine-tuning of an LLM. This allows for enhanced performance, but it demands a large amount of computational power and requires representative training data. Overall, quantization aware training will usually produce a model with higher performance, but it is more expensive and will demand far more computing power. An example of when to use QAT would be when you possess an adequate amount of training data and a larger budget. It is also good to remember that this process takes place during the training stage of the model so it does not make sense to use this method with a model that is already trained. ^{7}

**Dynamic quantization versus static quantization techniques**

The purpose of these two techniques is how the clipping range, or calibration as it is often referred, will be selected. During this dynamic quantization, the clipping range is dynamically computed for each activation. Typically, this type of quantization technique will result in a higher accuracy. As its name states, static quantization utilizes a fixed clipping range for all inputs. This form of quantization is used more often as dynamic quantization and can be highly expensive.

When the weights are converted during quantization, sometimes there is a loss of accuracy within the quantized values with quantized machine learning models. Model size should be taken into consideration, because when quantizing exceptionally large LLMs with numerous parameters and layers, there is the risk of significant quantization error accumulation. ^{8}

Machine learning model scale training can be extremely costly, especially with quantization aware training (QAT). This makes post-training quantization (PTQ) the best choice from a cost-effective standpoint. However, this does limit the model in some aspects as QAT will typically produce a more accurate model. ^{9}

Granite is IBM’s flagship brand of open large language model (LLM) foundation models spanning multiple modalities.

Learn more about neural networks and how quantization can be applied.

Learn more about machine learning to better understand the role of quantization.

¹ Dong Liu, Meng Jiang, Kaiser Pister, "LLMEasyQuant - An Easy to Use Toolkit for LLM Quantization", https://arxiv.org/pdf/2406.19657v2 (link resides outside ibm.com).

² Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko,* "*Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", https://arxiv.org/pdf/1712.05877v1 (link resides outside ibm.com).

³ Ravi Kishore Kodali, Yatendra Prasad Upreti, Lakshmi Boppana, "A Quantization Approach for the Reduced Size of Large Language Models", https://ieeexplore.ieee.org/document/10499664 (link resides outside ibm.com).

⁴ Xiao Sun, Naigang Wang, Chia-yu Chen, Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, "Ultra-Low Precision 4-bit Training of Deep Neural Networks", https://research.ibm.com/publications/ultra-low-precision-4-bit-training-of-deep-neural-networks

⁵ Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", https://arxiv.org/pdf/2208.07339 (link resides outside ibm.com)

⁶ Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer, "A Survey of Quantization Methods for Efficient Neural Network Inference", https://arxiv.org/pdf/2103.13630 (link resides outside ibm.com)

⁷ Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius, "A Survey of Quantization Methods for Efficient Neural Network Inference", https://arxiv.org/pdf/2004.09602 (link resides outside ibm.com)

⁸ Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan,* "*What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation", https://arxiv.org/pdf/2403.06408v1 (link resides outside ibm.com).

⁹ Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer, "SqueezeLLM: Dense-and-Sparse Quantization", https://arxiv.org/pdf/2306.07629v4 (link resides outside ibm.com).