9. Single Instruction Multiple Data
This lab covers the floating point support of the Advanced SIMD (ASIMD) architecture also called NEON. We’ll get started by showcasing the behavior of a vector operation in Section 9.1. Once understood, this allows us to implement a small and efficient FP32 matrix-matrix multiplication kernel in Section 9.2. As discussed in Section 9.3, this approach can be extended to target more general sizes of our matrices. Lastly, Section 9.4 allows us to run the class’s workloads on recent processors.
9.1. ASIMD/NEON
As discussed in the lectures and shown in Fig. 9.1.1, ASIMD uses 32 128-bit registers. These registers may be used independently of the general purpose registers which we harnessed in Section 8. In this lab we limit our considerations to single precision floating point (FP32) numbers. Since every single precision number has 32 bits, we may store up to four FP32 numbers in a single ASIMD vector register.
fmla v8.4s, v0.4s, v4.s[2]
This lab relies on vector instructions to do our heavy lifting. Specifically, we’ll use the vector, single-precision FMLA (by element) instruction when writing a matrix kernel in Section 9.2.
Tasks
Look up the vector, single-precision FMLA (by element) instruction. Briefly describe (in your own words) what the example instruction in Listing 9.1.1 does.
Implement a small function
showcase_fmla_element
in assembly language which showcasesfmla v8.4s, v0.4s, v4.s[2]
. As done in the lectures forfmul
, embed your function in a driver to showcase the instruction’s behavior.
9.2. Matrix Multiplication Kernel
Efficient implementations of matrix-matrix multiplications are at the core of many applications. In this part of the lab we’ll write a kernel for the high-level operation \(C+=AB\) with \(A,B,C \in \mathbb{R}^{4 \times 4}\). Further, we limit ourselves to a single-precision kernel which has the following function signature:
void gemm_asm_asimd_4_4_4( float const * i_a,
float const * i_b,
float * io_c );
The kernel gets the three pointers i_a
, i_b
and io_c
to matrices \(A\), \(B\) and \(C\) as parameters.
As shown in Fig. 9.2.1, we assume that all matrices are stored as column-major.
In this part of the lab, we target a fast implementation of this kernel.
To ease your development efforts, a code frame
is provided.
The code frame contains all boiler code to verify and benchmark your matrix kernel.
Only the appropriate fmla
instructions are missing in kernels/gemm_asm_asimd_4_4_4.s
and have to be added by you before we may tune the kernel for more performance.
Hint
ASIMD has different types of FMLA instructions. Its sufficient to exclusively rely on the variant discussed in Section 9.1 to finish the kernel. If you are interested the theoretical performance of a Cortex-A72 core, you may have a look at the Cortex®-A72 Software Optimization Guide.
Tasks
Finish the kernel
gemm_asm_asimd_4_4_4
in the filekernels/gemm_asm_asimd_4_4_4.s
. Make sure that the computed results are correct. Report the obtained floating performance on one of the lab room’s Raspberry Pis. Include the ID, e.g., rspi05, and the output oflscpu
of the board you used in your report.Optimize your kernel! This means: maximize its performance! Unnecessary stack transfers should be considered low-hanging fruits. Document your optimizations and report the performance of your optimized kernel. Note: Your kernel must produce correct results and adhere to the procedure call standard.
9.3. Microkernels and Loops
We typically split the implementation of kernels for larger matrix operations into two parts. First, we write suitable microkernels which are completely unrolled, i.e., they don’t contain any loops. One goal for the microkernels is an optimal utilization of the available vector registers. Typically, this means maximizing the size of the used accumulator block. Second, we repeatedly execute these microkernels inside nested loops operating on blocks of the matrices.
Note
Interested in the low-level implementation of highly efficient matrix kernels? Maybe the class High Performance Computing is something for you 😇.
After finishing Section 9.2, a straightforward extension is given through the implementation of a microkernel gemm_asm_asimd_16_4_4
which assumes \(A,C \in \mathbb{R}^{16 \times 4}\) and \(B \in \mathbb{R}^{4 \times 4}\).
We can then add a simple loop to realize \(C+=AB\) with \(A \in \mathbb{R}^{16 \times 12}\), \(B \in \mathbb{R}^{12 \times 4}\) and \(C \in \mathbb{R}^{16 \times 4}\) in gemm_asm_asimd_16_4_12
.
Tasks
Implement and verify the microkernel
gemm_asm_asimd_16_4_4
. Document the performance of your microkernel.Add a loop to realize the kernel
gemm_asm_asimd_16_4_12
. Verify your kernel! Document the performance of your kernel.
9.4. N1, Snapdragon 8+ Gen 1 and RISC-V
The Raspberry Pi 4 board uses a Cortex-A72 processor which implements ARMv8-A. This is great, since we can study a relevant and rather recent ISA on considerably cheap hardware. However, with an introduction in 2015, the processor is quite dated.
This part of the lab allows us run our assembly kernels on two recent processors. The first one is the Neoverse N1 server CPU which is, for example, available in AWS or OCI. Secondly, we may use Snapdragon 8+ Gen 1 System on Chip (SoC) which is used in recent smartphones and has a single Cortex-X2 core, three Cortex-A710 cores and four Cortex-A510 cores.
Lastly, we might also change gears entirely and have a look at RISC-V. For this, two VisionFive development boards are available. The respective SoC features two U74 cores which may execute 64-bit RISC-V code (RV64GBC, S+U+M Mode).
Since some extra hops are required to get you onto these machines, working on this hardware is limited to the open labs.
Tasks
During the regular labs of this week, tell the teaching team that you are interested in using N1, Snapdragon 8+ Gen 1 or the RISC-V boards. Also tell the team during which open lab you’d like to access the hardware.
Show up to the respective open lab, and we’ll get you started.
Run class-related workloads on the CPUs and share your experiences. This is a free-style section but at least a minimal report on your experiences is expected.