10. Quantization

Until this point we have mostly used 32-bit floating point numbers (FP32) for the training and application of our machine learning models. FP32 hardware is widely available and the respective numerical accuracy is typically more than sufficient for many machine learning workloads. Yet, FP32 operations are costly when compared to floating point or integer number formats with fewer bits.

The 16-bit floating point data type bfloat16 (BF16) is a popular alternative to FP32. FP32 and BF16 numbers have the same number of exponent bits, which makes conversions from one format into the other one simple. Often times BF16 can be used as a drop-in replacement for FP32 data in machine learning models, i.e., from a user perspective one simply uses BF16 tensors over FP32 tensors. BF16 support is widely available on recent CPUs, GPUs and tailored machine learning accelerators.

When performing neural network inference, we may be able to use even less bits per value, e.g., through eight- or four-bit data types. Note, that the training of neural networks is typically more demanding w.r.t. to the required number of bits.

Hint

A floating point number has a single sign bit, a certain number of mantissa bits and a certain number of exponent bits. An IEEE-754 FP32 floating point number uses eight bits for the exponent and 23 bits for the mantissa. Normalized FP32 numbers have a hidden bit, thus we effectively obtain 24 mantissa bits.

For a floating point number formats, we have to decide on how many bits we invest into the mantissa (precision) and how many we invest into the exponent (dynamic range). Respective machine learning options are actively discussed by the community. For example, NVIDIA, Arm and Intel recently presented suitable FP8 formats for deep learning. The recommended E4M3 encoding uses four exponent bits and three mantissa bits, the recommended E5M2 encoding uses five exponent bits and two mantissa bits.

This lab takes a different route and studies how we can use integer operations for deep learning inference. Once we have tackled the challenge of converting the FP32 data of a model to an integer data type, e.g., int8, we may exploit the full performance of state-of-the-art fixed-point accelerators. As an example, consider Meta’s AI inference accelerator MTIA v1 which has a peak of 102.4 TOPS when using int8. Section 11 of this class will use the Hexagon Tensor Processor of the Snapdragon 8 Gen 2 system on chip and deploy our quantized model on a mobile device.

10.1. AI Model Efficiency Toolkit

The AI Model Efficiency Toolkit (AIMET) is open-source software which allows us to study and optimize the quantization of neural network models. AIMET has a PyTorch interface which we may use to feed our PyTorch models into the software’s quantization simulation and optimization mechanisms. In this part of the lab we will set up AIMET and apply its min-max quantization scheme to a neural network consisting of a single linear layer.

Hint

The work A White Paper on Neural Network Quantization by Qualcomm AI Research provides a detailed overview of different quantization approaches available in AIMET.

By keeping the setup simple, we confirm that the min-max scheme follows the signed symmetric quantization scheme discussed in the lectures. Given a set \(R\) of floating-point-valued numbers, we obtain the scaling factor \(s\) through:

\[s = \frac{ \text{max}( \text{abs}( R ) ) }{ 2^{b-1} -1 }.\]

The parameter \(b\) gives the number of bits of the used integer data type. For example, when using int8, we obtain \(b=8\). Now, we derive the quantized value \(x_\text{quant}\) for a real-valued scalar \(x_\text{float}\) by applying scaling, rounding and clamping:

\[x_\text{quant} = \text{clamp} \left( \text{rne}\left( \frac{x_\text{float}}{s} \right), -2^{b-1}, 2^{b-1} - 1 \right).\]

\(\text{rne}\) is the “Round Nearest, ties to Even” operator, and the function \(\text{clamp}\) is defined as:

\[\begin{split}\text{clamp}(x,a,b) = \begin{cases} a, \quad \text{if} \quad x < a \\ b, \quad \text{if} \quad x > b \\ x, \quad \text{otherwise}. \end{cases}\end{split}\]

Thus, for signed int8 quantization the minimum possible value after quantization is -128 and the maximum value is 127. This leads to two types of quantization errors. First, applying the \(\text{rne}\) function may lead to a rounding error. Second, if \(x_\text{float} < -2^{b-1}\) or \(x_\text{float} > 2^{b-1}-1\), we obtain a clipping error.

The file simple_linear.py sets up a simple neural network consisting of a linear layer without a bias. The weights \(W_\text{float}\) of the linear layer are initialized to:

\[\begin{split}W_\text{float} = \begin{bmatrix} 0.25 & -0.32 & 1.58 & 2.10 \\ -1.45 & 1.82 & -0.29 & 3.78 \\ -2.72 & -0.12 & 2.24 & -1.84 \\ 1.93 & 0.49 & 0.00 & -3.19 \\ \end{bmatrix}\end{split}\]

The desired output \(y_\text{float}\) is computed by applying the network to the input \(x_\text{float} = (0.25, 0.17, -0.31, 0.55)\), i.e., by computing \(y_\text{float} = W_\text{float} \, x_\text{float}\). In this task we will quantize the vector \(x\) and the weight matrix \(W\) using AIMET. This gives us the quantized input \(x_\text{quant}\) and the quantized weight matrix \(W_\text{quant}\), which we use to compute the quantized result \(y_\text{quant}\):

\[y_\text{quant} = W_\text{quant} \, x_\text{quant}.\]

Further, we can invert the scaling and obtain an approximate floating-point result \(\tilde{y}_\text{float}\) from the quantized one:

\[y_\text{float} \approx \tilde{y}_\text{float} = s \cdot y_\text{quant}.\]

Tasks

  1. Make yourself familiar with the PyTorch Model Quantization API of AIMET. Explain the purpose of compute_encodings in your own words. What is the parameter forward_pass_callback used for?

  2. Quantize the provided SimpleLinear model l_model and apply the quantized model to the input vector l_x:

    • Use four bits as weight and activation bandwidths,

    • Use the AIMET quantization scheme aimet_common.defs.QuantScheme.post_training_tf,

    • Use the provided function calibrate as your forward_pass_callback argument when invoking compute_encodings.

  3. Print min, max and delta values for AIMET’s input, weight and output quantizers. Explain the values!

  4. Export the quantized model.

10.2. Post-Training Quantization

In this part of the lab, we will leave the world of quantization toy problems and quantize a trained neural network which fulfills an actual purpose 😉. The lectures discussed this for the MLP which we trained in Section 4. You may quantize any trained model in this lab (including the MLP), but keep in mind that this will be the model which you will deploy in Section 11.

When searching for suitable models, consider PyTorch Hub which contains a rich collection of pre-trained models across different application domains. Further, AIMET model zoo also host many FP32 models together with their quantized counterparts. The provided accuracy metrics may serve as a guideline on what is possible by using advanced quantization features of AIMET.

Tasks

  1. Pick a trained model which you quantize in the following task. Import the model in PyTorch and check its FP32 accuracy for respective test data.

  2. Use AIMET to perform an int8 quantization of the model’s weights and the activations. You may use advanced techniques, e.g., cross-layer equalization or AdaRound. Document your approach and the used AIMET features. Report the accuracy of your quantized model.

  3. Export the quantized model.