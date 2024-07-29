There are around 4 billion values within the set of FP32 possible values ranging from -3.4 1038 to 3.4 1038. Whereas with INT8, we see only 256 values within the possible values set ranging from -128 to 128. Due to the latter being a significantly smaller set of values, matrix multiplication can occur much faster. Due to the immense computational cost of deep-learning based models, accurate and efficient algorithms are essential.

The process of quantization occurs by first determining the optimal route of projecting the 32-bit floating point values into the INT8 field. To do so, there are multiple algorithms to quantizing a model. We will take a look at two quantization methods, Absolute Max and Affine Quantization.



Absolute max quantization

To calculate the mapping between the floating-point number and its corresponding INT8 number in absolute max quantization, you have to first divide by the absolute maximum value of the tensor and then multiply by the total range of the data type.

For example, we will apply the absolute max quantization algorithm to the following vector [1.6, -0.7, -3.4, 1.7, -2.9, 0.5, 2.3, 6.2]. You extract the absolute maximum of it, which is 6.2 in this case. INT8 has a range of [-127, 127], so we divide 127 by 6.2 and obtain 20.5 for the scaling factor. Therefore, multiplying the original vector by it gives the quantized data vector [33, -14, -70, 35, -59, 10, 47, 127]. Due to these numbers being rounded, there will be some precision loss. 5



Affine quantization

To implement the affine quantization algorithm, we will define our range of 32-bit floating point values as [a, b]. The affine quantization algorithm is as follows:

𝑥𝑞 = round ((1/𝑆)𝑥+𝑍)

- 𝑥𝑞 is the quantized INT8 value that corresponds to the 32-bit floating point value x.

- S is an FP32 scaling factor and is a positive 32-bit floating point.

- Z is the zero-point. This will be the INT8 value that corresponds with zero in the 32-bit floating point field.

- round is referring to the rounding of the resulting value to the closest integer.

To next establish the [min, max] of our 32-bit floating point values we need to take into account any outliers. Overlooking these outliers can lead to them being mapped as either min or max and potentially skewing the accuracy of the quantized model. To counter this, the model can be quantized in blocks. The weights can be broken up into groups of either 64 or 128. Then these groups are quantized to account for outliers and minimize the risk of lowered precision. 6

