9. Single Instruction Multiple Data

This lab covers the floating-point support of the Advanced SIMD (ASIMD) architecture, also called Neon. We begin by examining the behavior of a SIMD operation in Section 9.1. Once this behavior is understood, we can implement a small and efficient FP32 matrix-matrix multiplication kernel in Section 9.2. As discussed in Section 9.3, this approach extends to matrices of arbitrary size. Lastly, Section 9.4 allows us to run the class’s workloads on recent mobile processors.

9.1. Neon

../_images/registers_neon.svg

Fig. 9.1.1 Illustration of the thirty-two 128-bit Neon registers V0-V31 visible to the A64 instruction set, along with the floating-point control register (FPCR) and the floating-point status register (FPSR).

As discussed in the lectures and shown in Fig. 9.1.1, Neon provides 32 128-bit registers. These registers may be used independently of the general-purpose registers which we used in Section 8. In this lab, we limit our considerations to single-precision floating-point (FP32) numbers. Since every FP32 number has 32 bits, we may store up to four FP32 numbers in a single Neon vector register.

Listing 9.1.1 Neon fused multiply-add instruction operating on vector registers v4, v8, and v10.
fmla v4.4s, v8.4s, v10.s[2]

This lab relies on vector instructions to perform the heavy lifting. Specifically, we will use the vector, single-precision FMLA (by element) instruction when writing a matrix kernel in Section 9.2.

Tasks

  1. Look up the vector, single-precision FMLA (by element) instruction. Briefly describe what the example instruction in Listing 9.1.1 does.

  2. Implement a small function showcase_fmla_element in assembly language which showcases fmla v4.4s, v8.4s, v10.s[2]. Embed your function in a driver to showcase the instruction’s behavior.

9.2. Matrix Multiplication Kernel

Efficient implementations of matrix-matrix multiplication are at the core of many applications. In this part of the lab we’ll write a kernel for the high-level operation \(C\mathrel{+}=AB^T\) with matrices \(A \in \mathbb{R}^{8 \times 4}\), \(B \in \mathbb{R}^{4 \times 4}\), and \(C \in \mathbb{R}^{8 \times 4}\). Furthermore, we implement an FP32 kernel with the following signature:

void gemm_simd_8_4_4( float const * a,
                      float const * b,
                      float       * c );

The kernel gets the three pointers a, b, and c to matrices \(A\), \(B\) and \(C\) as parameters. As shown in Fig. 9.2.1, we assume that \(A\) and \(C\) are stored in column-major format, while \(B\) is stored in row-major format (equivalent to a column-major \(B^T\)).

../_images/gemm_memory.svg

Fig. 9.2.1 Illustration of the operation \(C\mathrel{+}=AB^T\). The numbers inside the matrices show the IDs of the matrices’ elements w.r.t. their 1D arrays in linear memory.

We now implement a high-performance version of this kernel. To ease your development efforts, a code skeleton is provided. The code skeleton contains all boilerplate code to verify and benchmark your matrix kernel. Only the loads of \(A\) and \(B\), as well as appropriate fmla instructions, are omitted from kernels/gemm_simd_8_4_4.s and must be implemented before tuning the kernel for more performance.

Hint

Neon has multiple variants of FMLA instructions. It is sufficient to exclusively rely on the variant discussed in Section 9.1 to finish the kernel. If you are interested in the theoretical performance of a Cortex-A76 core, you may have a look at the Cortex®-A76 Software Optimization Guide.

Tasks

  1. Finish the kernel gemm_simd_8_4_4 in the file kernels/gemm_simd_8_4_4.s. Make sure that the computed results are correct. Report the obtained floating-point performance on one of the provided Raspberry Pis. Include the ID, e.g., pirate03, and the output of lscpu of the machine you used in your report.

  2. Optimize your kernel! This means maximizing its performance. Eliminating unnecessary stack transfers yields immediate performance gains. Document your optimizations and report the performance of your optimized kernel. Note: Your kernel must produce correct results and adhere to the procedure call standard. A minimum performance of 15 FP32-GFLOPS is required to pass this task. Provide a creative team name together with your obtained performance. We will honour the three best teams.

9.3. Loops

We decompose the implementation of kernels for larger matrix operations into two parts. First, we write suitable nanokernels which are completely unrolled, i.e., they do not contain any loops. One goal for the nanokernels is an optimal utilization of the available vector registers. Typically, this means maximizing accumulator block size. Second, we repeatedly execute these nanokernels inside nested loops operating on blocks of the matrices.

After finishing Section 9.2, a natural extension is implementing a kernel gemm_simd_8_4_64 which assumes \(A \in \mathbb{R}^{8 \times 64}\), \(B \in \mathbb{R}^{64 \times 4}\), and \(C \in \mathbb{R}^{8 \times 4}\).

Tasks

Implement the kernel gemm_simd_8_4_64 by adding a loop. Verify your kernel! Document the performance of your kernel. A minimum performance of 30 FP32-GFLOPS is required to pass this task. Provide a creative team name together with your obtained performance. We will honour the three best teams.

9.4. Snapdragon

The Raspberry Pi 5 system-on-a-chip incorporates a Cortex-A76 processor implementing the ARMv8-A architecture. This provides an opportunity to study a relevant and rather recent ISA on considerably cheap hardware. However, introduced in 2018, the core is somewhat dated.

../_images/xiaomi_12pro.jpeg

Fig. 9.4.1 Picture of the Xiaomi 12 Pro smartphone accessible in this part of the lab. The phone has a Snapdragon 8+ Gen 1 SoC, with a CPU implementing ARMv9.

This part of the lab allows us to run our assembly kernels on the Snapdragon 8+ Gen 1 (Waipio) and Snapdragon 8 Gen 2 (Kailua) platforms, each with CPUs having eight ARMv9-A cores. The Kryo CPU of Waipio has a single Cortex-X2 core (prime), three Cortex-A710 cores (gold), and four Cortex-A510 cores (silver). The Kryo CPU of the Kailua SoC comprises one Cortex-X3 core (prime), two Cortex-A715 cores, and two Cortex-A710 cores (gold), and three Cortex-A510 cores (silver).

Tasks

  1. Since some extra steps are required to access the devices, contact the teaching team on how to access the hardware.

  2. Benchmark your SIMD workloads on Waipio or Kailua, or both.