.. _ch:mobile_inference: Mobile Inference ================ .. _fig:soc_hdk8550: .. figure:: data_mobile_inference/hdk8550.jpg :align: center :width: 100 % Photo of the HDK8550 Development Kit used in this lab. The main board hosts a SM8550P system on chip. This lab deploys a trained deep learning model on mobile devices. For this we will use the `SM8550P `__ System On Chip (SoC) of the `HDK8550 Development Kit `__. The SoC features different accelerators which we may harness for machine learning inference: A Central Processing Unit (CPU), a Graphics Processing Unit (GPU) and a Hexagon Tensor Processor (HTP). As `outlined `__ in the Qualcomm developer network, the accelerators have different characteristics. Especially, low-precision floating point computations and quantization play a crucial role for the full exploiting of the SoC at inference time. Model and Data Conversions -------------------------- This section discusses the on-device deployment of a trained machine learning model. While the approach is transferrable, we limit our discussions to the `resnet18 `__ model for brevity. The AIMET export of the resnet18 model is shared in the file :download:`aimet_export.tar.xz `. In addition, to the graph and the FP32 weights, respective quantization parameters generated by AIMET are provided. Our goal is to run the resnet18 inference workload on the available machine learning accelerators, i.e., the CPU, the GPU and the HTP of the SM8550P SoC. For now, we will limit our efforts to the required preprocessing, execution and postprocessing steps of the inference pipeline. Once accomplished, we'll benchmark the performance of the accelerators in :numref:`ch:mobile_inference_benchmark`. .. note:: You may deploy other models in a similar fashion. For example, detailed documentation for `SESR `__ or `EnhancedGAN `__ networks is `provided `__ as part of the Qualcomm developer network. Further, the documentation of the Qualcomm AI Engine Direct SDK provides a tutorial for the Inception v3 model. The SDK's documentation is available on the server ``soc.inf-ra.uni-jena.de`` from the directory ``/opt/HexagonSDK/5.3.0.0/docs``. All tools required for deploying the exported model on the mobile devices are preinstalled on the server ``soc.inf-ra.uni-jena.de``. The server is also connected to the development kits such that you may issue file transfers and executions without additional hops. We will use the `Qualcomm AI Engine Direct SDK `__ to deploy our models. In general, we have to perform the following steps for on-device inference: #. Translate the PyTorch-exported model to a format which can run on the targeted accelerator. Possibly, enable the model for quantized inference in this step. #. Prepare the input data such that it can be used on-device. We will use raw FP32 values of the `face-blurred ImageNet `__ test dataset. The data preparation step is already done for you and the (preprocessed and shuffled) data are available from the directory ``/opt/data/imagenet/raw_test/batch_size_32`` on the server. Each batch has the size [32, 224, 224, 3], i.e., we have 32 samples per batch, 224x224 pixels per image and three color channels. #. Transfer the prepared model and data from the host to a mobile device. #. Run the inference workload on the device. #. Transfer the results from the mobile device back to the host, and analyze the results. Of course, in more sophisticated settings one could directly interface with the mobile device's sensors for data inputs or process the obtained results on the device, e.g., in an user-facing app, without relying on a possible nonexistent host. .. literalinclude:: data_mobile_inference/env_setup.sh :linenos: :language: bash :caption: Environment setup for the AI Engine Direct SDK on the SoC server ``soc.inf-ra.uni-jena.de``. :name: lst:mobile_inference_env_setup To get started set up your environment using the steps in :numref:`lst:mobile_inference_env_setup`. The conda environment ``ai_direct`` is installed system-wide on the server. If conda is not available in your terminal, run ``/opt/anaconda3/bin/conda init bash`` first. Host CPU ^^^^^^^^ .. literalinclude:: data_mobile_inference/host_cpu.sh :linenos: :language: bash :caption: Model preparation and execution on the host CPU. :name: lst:mobile_inference_host_cpu As a sanity check, we first run our model on the host CPU through the Qualcomm AI Engine Direct SDK. This means, that we simply use the CPU of the SoC server for the time being. The required steps are outlined in :numref:`lst:mobile_inference_host_cpu`. Given the exported ONNX version of our model, lines 2-7 produce the C++ version ``resnet18_fp32.cpp`` of the model together with the model's weights ``resnet18_fp32.bin``. The two files are then used to generate the library ``libresnet18_fp32.so`` in 10-13. This shared library is used in lines 20-25 to run the model. The data which is passed to our model is specified in the file ``target_raw_list_host.txt``. In the given example, we simply instruct ``qnn-net-run`` to use the first 10 batches of the preprocessed face-blurred Imagenet test dataset by specifying their locations in line 17. .. admonition:: Task Run the resnet18 model on the host CPU of the SoC server. Write a Python script which interprets the generated results in ``output/host_fp32``. Compare your results to the true labels. Be aware that the provided resnet18 has an accuracy of about 68.4% w.r.t. the faced-blurred Imagenet dataset. Kryo CPU ^^^^^^^^ Now, let's move ahead and run our inference workload for the first time on a mobile device. A set of HDK8550 development kits is connected to the SoC server. The development kits run the Android operating system in version 13. Thus, we have to use the Android debug bridge (adb) to transfer data to and from an HDK8550 kit, and to run our workload on a development kit. In comparison to previous labs, we'll use the same user to log into the kits. This means that you share the kits and the user accounts with your fellow students. .. important:: * Create your own directory in ``/data/local/tmp/``. Use your first name for the name of that directory. Do not touch files anywhere else on the file system! * The development kits have enabled root access. Run root commands only if explicitly instructed to do so! * Coordinate with your peers when using the kits. Use the class's matrix channel for this when not in the lab room. The call of ``qnn-model-lib-generator`` in lines 10-13 of :numref:`lst:mobile_inference_host_cpu` already converted our model into a version which can run on the SM8550P's Kryo CPU. Note, that the Kryo CPU has Armv9 cores whereas our SoC server has x86 cores which requires different libraries and executables for the two architectures. .. literalinclude:: data_mobile_inference/kryo_cpu.sh :linenos: :language: bash :caption: Model preparation and execution on the SM8550P SoC's Kryo CPU. :name: lst:mobile_inference_kryo_cpu :numref:`lst:mobile_inference_kryo_cpu` outlines the steps for the execution on the CPU of the mobile platform. Line 2 sets the id of the development kit on which we run our inference workload. To list the ids of all available kits, use the command ``adb devices -l``. Next, line 4 sets working directory on the mobile device use in later commands. Lines 10-20 copy all required files from the host to the development kit. Further, in lines 16-20, a ``target_raw_list.txt`` is assembled which contains the on-device location of our input data. Line 23 runs the inference workload on the mobile platform. The computed results are copied back from the device to the host in line 26. .. admonition:: Task Run the resnet18 model on the Kryo CPU of the SM8550P SoC. Verify that your results match those obtained when running on the host CPU. Adreno GPU ^^^^^^^^^^ We successfully ran our first inference workload on a mobile device 🥳. Now, let's tap into the performance of the Adreno 740 GPU to accelerate our workload. As before, we are in the lucky position that the ``qnn-model-lib-generator`` call in :numref:`lst:mobile_inference_host_cpu` already performed the required conversions for us. .. literalinclude:: data_mobile_inference/adreno_gpu.sh :linenos: :language: bash :caption: Model preparation and execution on the SM8550P SoC's Adreno 740 GPU. :name: lst:mobile_inference_adreno_gpu As shown in :numref:`lst:mobile_inference_adreno_gpu`, running the GPU version of our model is very similar to what we have done for the SoC's Kryo CPU. However, keep in mind the the CPU and GPU are vastly different computer architectures, meaning that this complexity is hidden from us by the AI Engine Direct SDK. .. admonition:: Task Run the resnet18 model on the Adreno 740 GPU of the SM8550P SoC. Verify that your results match those obtained when running on the host CPU. HTP ^^^ The `Hexagon Tensor Processor `__ (HTP) is the last accelerator which we will use in this lab. The HTP has a matrix unit which targets high-performance on-device machine learning inference. In particular, we will leave the FP32 world at this point, and derive a quantized int8 version of the resnet18 model which may run on the HTP. Note, that we derived and exported quantization parameters, e.g., scaling factors, when using AIMET in :numref:`ch:quantization`. However, the framework still exported the weights as FP32 data which means that the actual conversion to integer data for on-device deployment is still outstanding. .. literalinclude:: data_mobile_inference/htp.sh :linenos: :language: bash :caption: Model preparation (incl. quantization) and execution on the SM8550P SoC's Hexagon Tensor Processor. :name: lst:mobile_inference_htp :numref:`lst:mobile_inference_htp` provides the required workflow for deploying a quantized resnet18 model on the HTP using the AI Engine Direct SDK. The overall procedure is similar to what we have done for the FP32 model when targeting the SoC's CPU and GPU. However, two key differences set the HTP procedure apart. First, the call of ``qnn-onnx-converter`` in lines 6-15 generates a quantized model. We see that the AIMET-derived quantization parameters (line 11) are used and that int8 is used for the weights and activations (lines 12-13). The ``--input_list`` argument in line 8 is mandatory. Here, one could derive the quantization parameters directly in the SDK without going through AIMET. The locations of the required example data would then be given in the file ``target_raw_list_host.txt``. This step unnecessary in our case, since we are overwriting all parameters with those from derived in AIMET. Second, in lines 24-29 a serialized version of the quantized model is derived for HTP deployment. This model is then deployed in lines 35-48. If we would like to run the quantized model on the CPU or GPU, we could, as before, simply use dynamic library produced by the ``qnn-model-lib-generator`` in lines 18-21. .. admonition:: Task Run the quantized resnet18 model on the CPU, GPU and HTP of the SM8550P SoC. Verify that the three accelerators produce the same results. Do you observe a loss in accuracy? .. _ch:mobile_inference_benchmark: Benchmarking the Accelerators ----------------------------- The AI Engine Direct SDK provides the tool ``qnn_bench.py`` which benchmarks the performance of one or more accelerators. .. literalinclude:: data_mobile_inference/resnet18_fp32.json :linenos: :language: json :caption: ``qnn_bench.py`` configuration for the FP32 version of the resnet18 model. :name: lst:mobile_inference_qnn_bench As users we have to provide a JSON configuration which describes the parameters which we are interested in. An example for our FP32 resnet18 model is given in :numref:`lst:mobile_inference_qnn_bench`. The benchmarked model is specified in line 11 and the benchmarking input in line 12. We instruct the tool to benchmark the CPU and GPU in line 18. The remaining parameters are described in the "Benchmarking" section of the AI Engine Direct SDK reference guide. .. admonition:: Tasks #. Benchmark the performance of the FP32 resnet18 model on the CPU and GPU of the SM8550P SoC. Report the inference performance as "required time in seconds per sample". #. Benchmark the performance of the int8 resnet18 model on the CPU, GPU and HTP of the SM8550P SoC. Report the inference performance as "required time in seconds per sample".