9. Single Instruction Multiple Data

This lab covers the floating point support of the Advanced SIMD (ASIMD) architecture also called NEON. We’ll get started by showcasing the behavior of a vector operation in Section 9.1. Once understood, this allows us to implement a small and efficient FP32 matrix-matrix multiplication kernel in Section 9.2. As discussed in Section 9.3, this approach can be extended to target more general sizes of our matrices. Lastly, Section 9.4 allows us to run the class’s workloads on recent processors.

9.1. ASIMD/NEON

../_images/registers_asimd.svg

Fig. 9.1.1 Illustration of ASIMD’s 32 128-bit vector registers.

As discussed in the lectures and shown in Fig. 9.1.1, ASIMD uses 32 128-bit registers. These registers may be used independently of the general purpose registers which we harnessed in Section 8. In this lab we limit our considerations to single precision floating point (FP32) numbers. Since every single precision number has 32 bits, we may store up to four FP32 numbers in a single ASIMD vector register.

Listing 9.1.1 ASIMD fused multiply-add instruction operating on vector registers v8, v0 and v4.
fmla v8.4s, v0.4s, v4.s[2]

This lab relies on vector instructions to do our heavy lifting. Specifically, we’ll use the vector, single-precision FMLA (by element) instruction when writing a matrix kernel in Section 9.2.

Tasks

  1. Look up the vector, single-precision FMLA (by element) instruction. Briefly describe (in your own words) what the example instruction in Listing 9.1.1 does.

  2. Implement a small function showcase_fmla_element in assembly language which showcases fmla v8.4s, v0.4s, v4.s[2]. As done in the lectures for fmul, embed your function in a driver to showcase the instruction’s behavior.

9.2. Matrix Multiplication Kernel

Efficient implementations of matrix-matrix multiplications are at the core of many applications. In this part of the lab we’ll write a kernel for the high-level operation \(C+=AB\) with \(A,B,C \in \mathbb{R}^{4 \times 4}\). Further, we limit ourselves to a single-precision kernel which has the following function signature:

void gemm_asm_asimd_4_4_4( float const * i_a,
                           float const * i_b,
                           float       * io_c );

The kernel gets the three pointers i_a, i_b and io_c to matrices \(A\), \(B\) and \(C\) as parameters. As shown in Fig. 9.2.1, we assume that all matrices are stored as column-major.

../_images/gemm_memory.svg

Fig. 9.2.1 Illustration of the operation \(C+=AB\) for column-major matrices \(A\), \(B\) and \(C\). The numbers inside the matrices show the IDs of the matrices’ elements w.r.t. their 1d arrays in linear memory.

In this part of the lab, we target a fast implementation of this kernel. To ease your development efforts, a code frame is provided. The code frame contains all boiler code to verify and benchmark your matrix kernel. Only the appropriate fmla instructions are missing in kernels/gemm_asm_asimd_4_4_4.s and have to be added by you before we may tune the kernel for more performance.

Hint

ASIMD has different types of FMLA instructions. Its sufficient to exclusively rely on the variant discussed in Section 9.1 to finish the kernel. If you are interested the theoretical performance of a Cortex-A72 core, you may have a look at the Cortex®-A72 Software Optimization Guide.

Tasks

  1. Finish the kernel gemm_asm_asimd_4_4_4 in the file kernels/gemm_asm_asimd_4_4_4.s. Make sure that the computed results are correct. Report the obtained floating performance on one of the lab room’s Raspberry Pis. Include the ID, e.g., rspi05, and the output of lscpu of the board you used in your report.

  2. Optimize your kernel! This means: maximize its performance! Unnecessary stack transfers should be considered low-hanging fruits. Document your optimizations and report the performance of your optimized kernel. Note: Your kernel must produce correct results and adhere to the procedure call standard.

9.3. Microkernels and Loops

We typically split the implementation of kernels for larger matrix operations into two parts. First, we write suitable microkernels which are completely unrolled, i.e., they don’t contain any loops. One goal for the microkernels is an optimal utilization of the available vector registers. Typically, this means maximizing the size of the used accumulator block. Second, we repeatedly execute these microkernels inside nested loops operating on blocks of the matrices.

Note

Interested in the low-level implementation of highly efficient matrix kernels? Maybe the class High Performance Computing is something for you 😇.

After finishing Section 9.2, a straightforward extension is given through the implementation of a microkernel gemm_asm_asimd_16_4_4 which assumes \(A,C \in \mathbb{R}^{16 \times 4}\) and \(B \in \mathbb{R}^{4 \times 4}\). We can then add a simple loop to realize \(C+=AB\) with \(A \in \mathbb{R}^{16 \times 12}\), \(B \in \mathbb{R}^{12 \times 4}\) and \(C \in \mathbb{R}^{16 \times 4}\) in gemm_asm_asimd_16_4_12.

Tasks

  1. Implement and verify the microkernel gemm_asm_asimd_16_4_4. Document the performance of your microkernel.

  2. Add a loop to realize the kernel gemm_asm_asimd_16_4_12. Verify your kernel! Document the performance of your kernel.

9.4. N1, Snapdragon 8+ Gen 1 and RISC-V

The Raspberry Pi 4 board uses a Cortex-A72 processor which implements ARMv8-A. This is great, since we can study a relevant and rather recent ISA on considerably cheap hardware. However, with an introduction in 2015, the processor is quite dated.

../_images/xiaomi_12pro.jpeg

Fig. 9.4.1 Picture of the Xiaomi 12 Pro smartphone accessible in this part of the lab. The phone has a Snapdragon 8+ Gen 1 SoC with a CPU implementing ARMv9.

This part of the lab allows us run our assembly kernels on two recent processors. The first one is the Neoverse N1 server CPU which is, for example, available in AWS or OCI. Secondly, we may use Snapdragon 8+ Gen 1 System on Chip (SoC) which is used in recent smartphones and has a single Cortex-X2 core, three Cortex-A710 cores and four Cortex-A510 cores.

Lastly, we might also change gears entirely and have a look at RISC-V. For this, two VisionFive development boards are available. The respective SoC features two U74 cores which may execute 64-bit RISC-V code (RV64GBC, S+U+M Mode).

Since some extra hops are required to get you onto these machines, working on this hardware is limited to the open labs.

Tasks

  1. During the regular labs of this week, tell the teaching team that you are interested in using N1, Snapdragon 8+ Gen 1 or the RISC-V boards. Also tell the team during which open lab you’d like to access the hardware.

  2. Show up to the respective open lab, and we’ll get you started.

  3. Run class-related workloads on the CPUs and share your experiences. This is a free-style section but at least a minimal report on your experiences is expected.