7. Small GEMMs: SVE

We wrote our first high-performance matrix kernel using the Advanced SIMD (ASIMD) extension in Section 6. ASIMD is available on all AArch64 processors including those who support the more advanced Scalable Vector Extension (SVE). Since SVE is key for high performance on Arm HPC processors, we’ll switch gears in this lab. After getting familiar with SVE in Section 7.1, we will develop an SVE-based microkernel for small GEMMS in Section 7.2. Then, we generalize this approach by adding nested \(K\), \(M\) and \(N\) loops around the microkernel in Section 7.2, Section 7.3 and Section 7.4. In Section 7.5, we study the generalization of our approach such that we can handle values of \(M\) which are not multiples of the used vector length.

7.1. Getting Started

We’ll get started by making ourselves familiar with SVE. We require SVE instructions for our GEMM microkernel in Section 7.2. SVE – together with the superset SVE2 – represents the future of vector processing on Arm processors. The A64FX processor which also powers the Fugaku supercomputer and the Graviton3 processor already support SVE.

Once again we are on the lookout for information in the world wide web. As before, the info is available via different channels, e.g., official documentation, tutorials or latest announcements and news articles. The links provided below are meant to be a first try such that we can get started quickly. Ultimately, its important to locate resources independently.

Tasks

7.2. The Unrolled Part

In this part we’ll write an SVE-based microkernel for small GEMMs. Our targeted kernel gemm_asm_sve_32_6_1 has the following signature:

void gemm_asm_sve_32_6_1( float const * i_a,
                          float const * i_b,
                          float       * io_c );

and performs the operation

\[C \mathrel{+}= A B\]

on 32-bit floating point data with

\[M = 32, \; N = 6, \; K = 1, \quad ldA = 32, \; ldB = 1, \; ldC = 32.\]

We’ll follow the ASIMD approach taken in Section 6.2 and completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops. Once done, we’ll add loops to increase the matrix dimensions. As before, this is particularly simple for \(K\) which we’ll already do at the end of this part.

Once again we have to preserve some SIMD registers in order to follow the procedure call standard AAPCS64. The template in Listing 7.2.1 may serve for the implementation of your SVE microkernel.

Listing 7.2.1 Template for the \((32\times6)=(32\times1)(1\times6)\) matrix kernel. The template temporarily saves the general purpose registers X19-X30 and the lower 64 bits of the SIMD registers V8-V15 on the stack. Additionally, the instruction ptrue p0.b sets all bits of the predicate register P0 to 1.
 1        .text
 2        .type gemm_asm_sve_32_6_1, %function
 3        .global gemm_asm_sve_32_6_1
 4        /*
 5         * Performs the matrix-multiplication C+=A*B
 6         * with the shapes (32x6) = (32x1) * (1x6).
 7         * The input-data is of type float.
 8         *
 9         * @param x0 pointer to A.
10         * @param x1 pointer to B.
11         * @param x2 pointer to C.
12         */ 
13gemm_asm_sve_32_6_1:
14        // set all bits of predicate register p0 to 1
15        ptrue p0.b
16
17        // store
18        stp x19, x20, [sp, #-16]!
19        stp x21, x22, [sp, #-16]!
20        stp x23, x24, [sp, #-16]!
21        stp x25, x26, [sp, #-16]!
22        stp x27, x28, [sp, #-16]!
23        stp x29, x30, [sp, #-16]!
24
25        stp  d8,  d9, [sp, #-16]!
26        stp d10, d11, [sp, #-16]!
27        stp d12, d13, [sp, #-16]!
28        stp d14, d15, [sp, #-16]!
29
30        // your matrix kernel goes here!
31
32        // restore
33        ldp d14, d15, [sp], #16
34        ldp d12, d13, [sp], #16
35        ldp d10, d11, [sp], #16
36        ldp  d8,  d9, [sp], #16
37
38        ldp x29, x30, [sp], #16
39        ldp x27, x28, [sp], #16
40        ldp x25, x26, [sp], #16
41        ldp x23, x24, [sp], #16
42        ldp x21, x22, [sp], #16
43        ldp x19, x20, [sp], #16
44
45        ret
46        .size gemm_asm_sve_32_6_1, (. - gemm_asm_sve_32_6_1)

Tasks

  1. Implement and verify the unrolled matrix kernel C += AB for M=32, N=6, K=1, ldA=32, ldB=1, ldC=32.

  2. Tune your kernel to squeeze more performance out of the core. You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules introduced in Section 6.2. Report and document your optimizations.

  3. Add a loop over K to realize C += AB for M=32, N=6, K=48, ldA=32, ldB=48, ldC=32.

  4. Submit your team name together with your entries for “time (s)”, “#executions”, “GFLOPS” and “%peak” for the two kernels.

7.3. Loop over M

First, we looked at a completely unrolled implementation with shapes \((32 \times 6) \mathrel{+}= (32 \times 1) * (1 \times 6)\). Next, we extended our kernel by adding a loop over \(K\) and implemented \((32 \times 6) \mathrel{+}= (32 \times 48) * (48 \times 6)\).

The remaining parts of the lab study two directions of generalization:

  • Larger sizes \(M\) and \(N\) for our matrices.

  • Values for \(M\) which are not divisible by the vector length.

In this part, we’ll write a kernel which performs the operation

\[C \mathrel{+}= A B\]

on 32-bit floating point data with

\[M = 128, \; N = 6, \; K = 48, \quad ldA = 128, \; ldB = 48, \; ldC = 128.\]

Due to the sizes of the involved matrices, it is not advisable to completely unroll the kernel. In the general case one writes a microkernel and adds three respective loops over \(M\), \(N\) and \(K\). We already wrote the microkernel in Section 7.2 and added a loop over \(K\). Since we also used \(N=6\) in our microkernel only the additional loop over \(M\) has to be added. As in Section 7.2 the new loop requires us to also change our code slightly to account for changing data locations in the loops’ iterations. For now, we are stuck with performing these steps manually. On the bright side: Once understood, we can easily abstract when generating code at runtime in Section 10.

Tasks

  • Implement and verify the matrix kernel C += AB for M=128, N=6, K=48, ldA=128, ldB=48, ldC=128. Re-use the code of your microkernel, implemented in Section 7.2.

  • Optimize your matrix kernel. Respect the rules of Section 6.2. Report and document your optimizations.

  • Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name for your best-performing variant.

7.4. Loop over N

Let’s increase the complexity of our matrix kernel further. Compared to Section 7.3 we increase the size of dimension \(N\) from 6 to 48. Specifically, we implement a kernel which performs the operation

\[C \mathrel{+}= A B\]

on 32-bit floating point data with

\[M = 128, \; N = 48, \; K = 48, \quad ldA = 128, \; ldB = 48, \; ldC = 128.\]

Once again, we simply have to add another loop and adjust the data locations accordingly.

Tasks

  • Implement and verify the matrix kernel C += AB for M=128, N=48, K=48, ldA=128, ldB=48, ldC=128. Re-use the code of your kernel, implemented in Section 7.3.

  • Optimize your matrix kernel. Respect the rules of Section 6.2. Report and document your optimizations.

  • Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name for your best-performing variant.

7.5. Arbitrary Values for M

Supporting arbitrary values for \(K\) is simple: we only have to change the number of iterations of the loop over \(K\). Arbitrary values for \(N\) are slightly more difficult since we might have to rethink our blocking and wrap a new microkernel. The “difficult” case are changes to \(M\) which are not multiples of the vector length. That’s the challenge we’ll tackle now by implementing a kernel which performs the operation:

\[C \mathrel{+}= A B\]

on 32-bit floating point data with

\[M = 31, \; N = 6, \; K = 48, \quad ldA = 31, \; ldB = 48, \; ldC = 31.\]

Tasks

  • Implement and verify the matrix kernel C += AB for M=31, N=6, K=48, ldA=31, ldB=48, ldC=31. Use predicated SVE instructions to tackle \(M=31\) which is not a multiple of 8. Re-use the code of your microkernel, implemented in Section 7.2.

  • Optimize your matrix kernel. Respect the rules of Section 6.2. Report and document your optimizations.

  • Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name for your best-performing variant.