11. Mobile Inference

../_images/hdk8550.jpg

Fig. 11.1 Photo of the HDK8550 Development Kit used in this lab. The main board hosts a SM8550P system on chip.

This lab deploys a trained deep learning model on mobile devices. For this we will use the SM8550P System On Chip (SoC) of the HDK8550 Development Kit. The SoC features different accelerators which we may harness for machine learning inference: A Central Processing Unit (CPU), a Graphics Processing Unit (GPU) and a Hexagon Tensor Processor (HTP). As outlined in the Qualcomm developer network, the accelerators have different characteristics. Especially, low-precision floating point computations and quantization play a crucial role for the full exploiting of the SoC at inference time.

11.1. Model and Data Conversions

This section discusses the on-device deployment of a trained machine learning model. While the approach is transferrable, we limit our discussions to the resnet18 model for brevity. The AIMET export of the resnet18 model is shared in the file aimet_export.tar.xz. In addition, to the graph and the FP32 weights, respective quantization parameters generated by AIMET are provided.

Our goal is to run the resnet18 inference workload on the available machine learning accelerators, i.e., the CPU, the GPU and the HTP of the SM8550P SoC. For now, we will limit our efforts to the required preprocessing, execution and postprocessing steps of the inference pipeline. Once accomplished, we’ll benchmark the performance of the accelerators in Section 11.2.

Note

You may deploy other models in a similar fashion. For example, detailed documentation for SESR or EnhancedGAN networks is provided as part of the Qualcomm developer network. Further, the documentation of the Qualcomm AI Engine Direct SDK provides a tutorial for the Inception v3 model. The SDK’s documentation is available on the server soc.inf-ra.uni-jena.de from the directory /opt/HexagonSDK/5.3.0.0/docs.

All tools required for deploying the exported model on the mobile devices are preinstalled on the server soc.inf-ra.uni-jena.de. The server is also connected to the development kits such that you may issue file transfers and executions without additional hops. We will use the Qualcomm AI Engine Direct SDK to deploy our models. In general, we have to perform the following steps for on-device inference:

  1. Translate the PyTorch-exported model to a format which can run on the targeted accelerator. Possibly, enable the model for quantized inference in this step.

  2. Prepare the input data such that it can be used on-device. We will use raw FP32 values of the face-blurred ImageNet test dataset. The data preparation step is already done for you and the (preprocessed and shuffled) data are available from the directory /opt/data/imagenet/raw_test/batch_size_32 on the server. Each batch has the size [32, 224, 224, 3], i.e., we have 32 samples per batch, 224x224 pixels per image and three color channels.

  3. Transfer the prepared model and data from the host to a mobile device.

  4. Run the inference workload on the device.

  5. Transfer the results from the mobile device back to the host, and analyze the results.

Of course, in more sophisticated settings one could directly interface with the mobile device’s sensors for data inputs or process the obtained results on the device, e.g., in an user-facing app, without relying on a possible nonexistent host.

Listing 11.1.1 Environment setup for the AI Engine Direct SDK on the SoC server soc.inf-ra.uni-jena.de.
1conda activate ai_direct
2export QNN_SDK_ROOT=/opt/qcom/aistack/qnn/2.11.0.230603
3source ${QNN_SDK_ROOT}/bin/envsetup.sh
4export QNN_SDK_ROOT=/opt/qcom/aistack/qnn/2.11.0.230603
5source ${QNN_SDK_ROOT}/bin/envsetup.sh
6export PATH=/opt/software/sdk/ndk/25.2.9519653/:$PATH
7export ANDROID_NDK_ROOT=/opt/software/sdk/ndk/25.2.9519653/

To get started set up your environment using the steps in Listing 11.1.1. The conda environment ai_direct is installed system-wide on the server. If conda is not available in your terminal, run /opt/anaconda3/bin/conda init bash first.

Host CPU

Listing 11.1.2 Model preparation and execution on the host CPU.
 1# produces a network which expects 32,224,224,3 input data (see model/resnet18_fp32.cpp)
 2${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
 3  --input_network aimet_export/resnet18/resnet18.onnx \
 4  --input_encoding 'input' other \
 5  --batch 32 \
 6  --debug \
 7  --output model/resnet18_fp32.cpp
 8
 9# compile a dynamic library which represents the model
10${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator \
11  -c model/resnet18_fp32.cpp \
12  -b model/resnet18_fp32.bin \
13  -o model_libs
14
15# generate list of inputs
16touch target_raw_list_host.txt
17for i in $(seq 0 9); do echo /opt/data/imagenet/raw_test/batch_size_32/inputs_$i.raw >> target_raw_list_host.txt; done
18
19# run the model
20${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-net-run \
21              --log_level=info \
22              --backend ${QNN_SDK_ROOT}/lib/x86_64-linux-clang/libQnnCpu.so \
23              --model model_libs/x86_64-linux-clang/libresnet18_fp32.so \
24              --input_list target_raw_list_host.txt \
25              --output_dir=output/host_fp32

As a sanity check, we first run our model on the host CPU through the Qualcomm AI Engine Direct SDK. This means, that we simply use the CPU of the SoC server for the time being. The required steps are outlined in Listing 11.1.2. Given the exported ONNX version of our model, lines 2-7 produce the C++ version resnet18_fp32.cpp of the model together with the model’s weights resnet18_fp32.bin. The two files are then used to generate the library libresnet18_fp32.so in 10-13. This shared library is used in lines 20-25 to run the model. The data which is passed to our model is specified in the file target_raw_list_host.txt. In the given example, we simply instruct qnn-net-run to use the first 10 batches of the preprocessed face-blurred Imagenet test dataset by specifying their locations in line 17.

Task

Run the resnet18 model on the host CPU of the SoC server. Write a Python script which interprets the generated results in output/host_fp32. Compare your results to the true labels. Be aware that the provided resnet18 has an accuracy of about 68.4% w.r.t. the faced-blurred Imagenet dataset.

Kryo CPU

Now, let’s move ahead and run our inference workload for the first time on a mobile device. A set of HDK8550 development kits is connected to the SoC server. The development kits run the Android operating system in version 13. Thus, we have to use the Android debug bridge (adb) to transfer data to and from an HDK8550 kit, and to run our workload on a development kit. In comparison to previous labs, we’ll use the same user to log into the kits. This means that you share the kits and the user accounts with your fellow students.

Important

  • Create your own directory in /data/local/tmp/. Use your first name for the name of that directory. Do not touch files anywhere else on the file system!

  • The development kits have enabled root access. Run root commands only if explicitly instructed to do so!

  • Coordinate with your peers when using the kits. Use the class’s matrix channel for this when not in the lab room.

The call of qnn-model-lib-generator in lines 10-13 of Listing 11.1.2 already converted our model into a version which can run on the SM8550P’s Kryo CPU. Note, that the Kryo CPU has Armv9 cores whereas our SoC server has x86 cores which requires different libraries and executables for the two architectures.

Listing 11.1.3 Model preparation and execution on the SM8550P SoC’s Kryo CPU.
 1# set serial of android device
 2export ANDROID_SERIAL=3000a4df
 3# set user directory on the device
 4export DEVICE_USER_DIR=/data/local/tmp/alex
 5
 6# create directory to host the model and data on device
 7adb shell "mkdir -p ${DEVICE_USER_DIR}/resnet18_cpu_fp32"
 8
 9# copy the runner and compiled model to the device
10adb push ${QNN_SDK_ROOT}/bin/aarch64-android/qnn-net-run ${DEVICE_USER_DIR}/resnet18_cpu_fp32
11adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnCpu.so ${DEVICE_USER_DIR}/resnet18_cpu_fp32
12adb push model_libs/aarch64-android/libresnet18_fp32.so ${DEVICE_USER_DIR}/resnet18_cpu_fp32
13
14# copy data from host to device and set up target list on device
15adb shell "mkdir -p ${DEVICE_USER_DIR}/data/imagenet/raw_test/batch_size_32"
16adb shell "touch ${DEVICE_USER_DIR}/resnet18_cpu_fp32/target_raw_list.txt"
17for batch in $(seq 0 9); do \
18  adb shell "echo ${DEVICE_USER_DIR}/data/imagenet/raw_test/batch_size_32/inputs_${batch}.raw >> ${DEVICE_USER_DIR}/resnet18_cpu_fp32/target_raw_list.txt"
19  adb push data/imagenet/raw_test/batch_size_32/inputs_${batch}.raw ${DEVICE_USER_DIR}/data/imagenet/raw_test/batch_size_32/inputs_${batch}.raw 
20done
21
22# execute the model on the device CPU
23adb shell "cd ${DEVICE_USER_DIR}/resnet18_cpu_fp32; LD_LIBRARY_PATH=. ./qnn-net-run --backend libQnnCpu.so --model libresnet18_fp32.so --input_list target_raw_list.txt"
24
25# copy results from device to host
26adb pull ${DEVICE_USER_DIR}/resnet18_cpu_fp32/output output/cpu_fp32

Listing 11.1.3 outlines the steps for the execution on the CPU of the mobile platform. Line 2 sets the id of the development kit on which we run our inference workload. To list the ids of all available kits, use the command adb devices -l. Next, line 4 sets working directory on the mobile device use in later commands. Lines 10-20 copy all required files from the host to the development kit. Further, in lines 16-20, a target_raw_list.txt is assembled which contains the on-device location of our input data. Line 23 runs the inference workload on the mobile platform. The computed results are copied back from the device to the host in line 26.

Task

Run the resnet18 model on the Kryo CPU of the SM8550P SoC. Verify that your results match those obtained when running on the host CPU.

Adreno GPU

We successfully ran our first inference workload on a mobile device 🥳. Now, let’s tap into the performance of the Adreno 740 GPU to accelerate our workload. As before, we are in the lucky position that the qnn-model-lib-generator call in Listing 11.1.2 already performed the required conversions for us.

Listing 11.1.4 Model preparation and execution on the SM8550P SoC’s Adreno 740 GPU.
 1# assuming ANDROID_SERIAL and DEVICE_USER_DIR to be set
 2# assuming that the imagenet data is already present
 3
 4# create directory to host the model and data on device
 5adb shell "mkdir -p ${DEVICE_USER_DIR}/resnet18_gpu_fp32"
 6
 7# copy the runner and compiled model to the device
 8adb push ${QNN_SDK_ROOT}/bin/aarch64-android/qnn-net-run ${DEVICE_USER_DIR}/resnet18_gpu_fp32/
 9adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnGpu.so ${DEVICE_USER_DIR}/resnet18_gpu_fp32/
10adb push model_libs/aarch64-android/libresnet18_fp32.so ${DEVICE_USER_DIR}/resnet18_gpu_fp32/
11
12# set up target list on device
13adb shell "touch ${DEVICE_USER_DIR}/resnet18_gpu_fp32/target_raw_list.txt"
14for batch in $(seq 0 9); do \
15  adb shell "echo ${DEVICE_USER_DIR}/data/imagenet/raw_test/batch_size_32/inputs_${batch}.raw >> ${DEVICE_USER_DIR}/resnet18_gpu_fp32/target_raw_list.txt"
16done
17
18# execute the model on the device GPU
19adb shell "cd ${DEVICE_USER_DIR}/resnet18_gpu_fp32; LD_LIBRARY_PATH=. ./qnn-net-run --backend libQnnGpu.so --model libresnet18_fp32.so --input_list target_raw_list.txt"
20
21# copy results from device to host
22adb pull ${DEVICE_USER_DIR}/resnet18_gpu_fp32/output output/gpu_fp32

As shown in Listing 11.1.4, running the GPU version of our model is very similar to what we have done for the SoC’s Kryo CPU. However, keep in mind the the CPU and GPU are vastly different computer architectures, meaning that this complexity is hidden from us by the AI Engine Direct SDK.

Task

Run the resnet18 model on the Adreno 740 GPU of the SM8550P SoC. Verify that your results match those obtained when running on the host CPU.

HTP

The Hexagon Tensor Processor (HTP) is the last accelerator which we will use in this lab. The HTP has a matrix unit which targets high-performance on-device machine learning inference. In particular, we will leave the FP32 world at this point, and derive a quantized int8 version of the resnet18 model which may run on the HTP. Note, that we derived and exported quantization parameters, e.g., scaling factors, when using AIMET in Section 10. However, the framework still exported the weights as FP32 data which means that the actual conversion to integer data for on-device deployment is still outstanding.

Listing 11.1.5 Model preparation (incl. quantization) and execution on the SM8550P SoC’s Hexagon Tensor Processor.
 1# assuming ANDROID_SERIAL and DEVICE_USER_DIR to be set
 2# assuming that the imagenet data is already present
 3# assuming a file target_raw_list_host.txt is present on the host
 4
 5# produces a quantized network which expects 32,224,224,3 input data (see model/resnet18_int8.cpp)
 6${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
 7  --input_network aimet_export/resnet18/resnet18.onnx \
 8  --input_list target_raw_list_host.txt \
 9  --input_encoding 'input_data' other \
10  --batch 32 \
11  --quantization_overrides aimet_export/resnet18/resnet18.encodings \
12  --act_bw=8 \
13  --weight_bw=8 \
14  --debug \
15  --output model/resnet18_int8.cpp
16
17# compile a dynamic library which represents the model
18${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator \
19  -c model/resnet18_int8.cpp \
20  -b model/resnet18_int8.bin \
21  -o model_libs
22
23# generate a serialized context for HTP execution
24${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-context-binary-generator \
25  --backend ${QNN_SDK_ROOT}/lib/x86_64-linux-clang/libQnnHtp.so \
26  --model $(pwd)/model_libs/x86_64-linux-clang/libresnet18_int8.so \
27  --output_dir model \
28  --binary_file resnet18_int8.serialized \
29  --log_level info
30
31# create directory to host the model and data on device
32adb shell "mkdir -p ${DEVICE_USER_DIR}/resnet18_htp_int8"
33
34# copy the runner and compiled model to the device
35adb push ${QNN_SDK_ROOT}/bin/aarch64-android/qnn-net-run ${DEVICE_USER_DIR}/resnet18_htp_int8/
36adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_USER_DIR}/resnet18_htp_int8/
37adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_USER_DIR}/resnet18_htp_int8/
38adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_USER_DIR}/resnet18_htp_int8/
39adb push model/resnet18_int8.serialized.bin ${DEVICE_USER_DIR}/resnet18_htp_int8/
40
41# set up target list on device
42adb shell "touch ${DEVICE_USER_DIR}/resnet18_htp_int8/target_raw_list.txt"
43for batch in $(seq 0 9); do \
44  adb shell "echo ${DEVICE_USER_DIR}/data/imagenet/raw_test/batch_size_32/inputs_${batch}.raw >> ${DEVICE_USER_DIR}/resnet18_htp_int8/target_raw_list.txt"
45done
46
47# execute the model on the device HTP
48adb shell "cd ${DEVICE_USER_DIR}/resnet18_htp_int8; LD_LIBRARY_PATH=. ./qnn-net-run --backend libQnnHtp.so --retrieve_context resnet18_int8.serialized.bin --input_list target_raw_list.txt"
49
50# copy results from device to host
51adb pull ${DEVICE_USER_DIR}/resnet18_htp_int8/output output/htp_int8

Listing 11.1.5 provides the required workflow for deploying a quantized resnet18 model on the HTP using the AI Engine Direct SDK. The overall procedure is similar to what we have done for the FP32 model when targeting the SoC’s CPU and GPU. However, two key differences set the HTP procedure apart.

First, the call of qnn-onnx-converter in lines 6-15 generates a quantized model. We see that the AIMET-derived quantization parameters (line 11) are used and that int8 is used for the weights and activations (lines 12-13). The --input_list argument in line 8 is mandatory. Here, one could derive the quantization parameters directly in the SDK without going through AIMET. The locations of the required example data would then be given in the file target_raw_list_host.txt. This step unnecessary in our case, since we are overwriting all parameters with those from derived in AIMET.

Second, in lines 24-29 a serialized version of the quantized model is derived for HTP deployment. This model is then deployed in lines 35-48. If we would like to run the quantized model on the CPU or GPU, we could, as before, simply use dynamic library produced by the qnn-model-lib-generator in lines 18-21.

Task

Run the quantized resnet18 model on the CPU, GPU and HTP of the SM8550P SoC. Verify that the three accelerators produce the same results. Do you observe a loss in accuracy?

11.2. Benchmarking the Accelerators

The AI Engine Direct SDK provides the tool qnn_bench.py which benchmarks the performance of one or more accelerators.

Listing 11.2.1 qnn_bench.py configuration for the FP32 version of the resnet18 model.
 1{
 2  "Name":"resnet18_fp32",
 3  "HostRootPath": "resnet18_fp32.repo",
 4  "HostResultsDir":"resnet18_fp32.repo/results",
 5  "DevicePath":"/data/local/tmp/alex/qnnbm.repo",
 6  "Devices":["3000a4df"],
 7  "Runs":10,
 8
 9  "Model": {
10    "Name": "resnet18_fp32",
11    "qnn_model": "model_libs/aarch64-android/libresnet18_fp32.so",
12    "InputList": "./target_raw_list.txt",
13    "Data": [
14      "data"
15    ]
16  },
17
18  "Backends":["CPU", "GPU"],
19  "Measurements": ["timing"]
20}

As users we have to provide a JSON configuration which describes the parameters which we are interested in. An example for our FP32 resnet18 model is given in Listing 11.2.1. The benchmarked model is specified in line 11 and the benchmarking input in line 12. We instruct the tool to benchmark the CPU and GPU in line 18. The remaining parameters are described in the “Benchmarking” section of the AI Engine Direct SDK reference guide.

Tasks

  1. Benchmark the performance of the FP32 resnet18 model on the CPU and GPU of the SM8550P SoC. Report the inference performance as “required time in seconds per sample”.

  2. Benchmark the performance of the int8 resnet18 model on the CPU, GPU and HTP of the SM8550P SoC. Report the inference performance as “required time in seconds per sample”.