GEMM#

We typically write a few of standalone small GEMM kernels for different matrix shapes to get a feel for the hardware and ISA. Once this is done we move forward and implement a more general version in our JITters. This section describes the development of our standalone GEMM kernels.

Accelerate#

Apple’s Accelerate is a library of vendor-optimized implementations for common workloads. BLAS-3 GEMMs are one of the workloads and we can perform a matrix-matrix multiplication in Accelerate by going through its CBLAS interface:

cblas_sgemm( CblasColMajor,
             i_trans_a == 0 ? CblasNoTrans : CblasTrans,
             i_trans_b == 0 ? CblasNoTrans : CblasTrans,
             i_m,
             i_n,
             i_k,
             1,
             l_a,
             i_lda,
             l_b,
             i_ldb,
             1,
             l_c,
             i_ldc );

We ran Accelerate to benchmark the operation C+=AB with matrix sizes M=N=K and by optionally transposing B:

M/N/K	TransB	P-Core FP32 GFLOPS	E-Core PF32 GFLOPS
16	0	54.8	6.9
16	1	73.1	9.4
32	0	443.0	36.9
32	1	922.1	62.6
64	0	1024.0	57.8
64	1	1431.4	71.1
128	0	1430.9	72.3
128	1	1687.8	74.8
256	0	1606.4	71.7
256	1	1780.5	78.3
512	0	1748.0	74.4
512	1	1825.1	81.7
1024	0	1604.3	69.8
1024	1	1600.7	72.3
2048	0	1511.8	69.2
2048	1	1528.8	68.9

To summarize the measurements, our highest measured GEMM performance on a performance core is 1825.1 FP32 GFLOPS was obtained for M=N=K=512 which is 91% of our measured peak. We also obtained the highest efficiency core performance for the M=N=K=512 setting. Here, Accelerate achieved 81.7 FP32 GFLOPS, which is only 70% of the microbenchmarked peak. We can expect Accelerate to be highly optimized for sufficiently large GEMMs, so this should be a reasonable estimate of the performance possible on M4.

Microkernel#

A microkernel is the innermost part of our small GEMMs. It contains the instructions that do the heavy lifting and is free of control structures. A microkernel does at least one full update of the accumulator block and is called repeatedly in a loop over K, which adds to the intermediate results. We can also unroll a few iterations of the K-loop in the microkernel if it helps with performance. In the case of SME, we did not observe any speedup from unrolling, so we opted for a single K update in our first FP32 kernel.

Register blocking is another key to high performance. When targeting vector instructions (Neon/SVE), we invest up to 24 out of 32 vector registers in the accumulator block with the motivation to keep as much of C’s data in the registers as possible. For SME, the situation is different because we have dedicated tiles for accumulation. Since we have four tiles, each capable of holding 16x16 values, we get four different blocking strategies for the microkernel’s accumulator block: M=64, N=16, or M=32, N=32, or M=16, N=64. This is analogous to the AMX chapter. In our first kernel, we chose an M=32, N=32 blocking strategy. The motivation behind this is to minimize the A and B loads when iterating over dimension K. In the case of M=32, N=32, we have to load 64 values of A and B for four FMOPA instructions. The other two blocking strategies would require us to load a total of 80 values for the same number of outer products.

Since our first kernel performs the operation $C + = A B^{T}$ with M=32, N=32, and K=32, we need to initialize the ZA tiles with the values of C. We then accumulate $A B^{T}$ in the K-loop and store the updated matrix C back into memory when finished. Interestingly, the SME kernels in Arm’s Compute Library first load the data from C into the streaming SVE registers and then copy the data from the vector registers into the ZA tiles. We tried both options, loading directly into the ZA tiles and loading first into the vector registers. Since the indirect approach was faster in our tests, we stuck with it in our kernel and plan to write a microbenchmark to compare the two options. We also used the indirect approach for storing C.

We measured the following performance on M4 with 1-4 threads (23efa23):

Threads	M	N	K	Accelerate FP32 GFLOPS	Kernel FP32 GFLOPS
1	32	32	32	907	923
2	32	32	32	1206	1263
3	32	32	32	1331	1364
4	32	32	32	1301	1320

We see that our kernel is slightly faster than Accelerate, which is to be expected for a sufficiently optimized kernel tailored to a single use case (M=N=K=32, transB). Note that Accelerate receives the GEMM configuration as function parameters and therefore has to execute additional control logic. Note that this is a key advantage of just-in-time kernel generation, since we can tailor our code to a single kernel configuration. Of course, this means that the kernel must be used often enough to overcome the overhead introduced by JITting.