Introduction#

M1 is Apple’s first System on a Chip (SoC) for the Mac and was announced on November 10, 2020. The SoC features an Arm CPU, GPU and NPU (Apple Neural Engine). M1 marks Apple’s transition from x86 to Arm in its Mac product line and has helped the Arm Architecture gain traction in the notebook and desktop segments.

Neon#

The CPU of M1 has four performance cores (Firestorm) and four efficiency cores (Icestorm). The floating-point capabilities of the cores can be used by Neon instructions. A Firestorm core has an estimated frequency of 3.2 GHz, resulting in a theoretical peak performance of 102.4 FP32 GFLOPS or 51.2 FP64 GFLOPS.

One important workload that relies heavily on fast floating-point units is General Matrix-Matrix Multiplications (GEMMs). A GEMM computes the matrix-matrix product $C = A B$ with $A \in R^{M \times K}$ , $B \in R^{K \times N}$ and $C \in R^{M \times N}$ . The LIBXSMM library just-in-time generates fast machine code for small GEMMs, i.e., $M, N, K \leq 64$ for vector instructions. LIBXSMM supports Neon vector instructions and can therefore generate kernels optimized for M1’s CPU.

Illustration of the first three FMLA instructions in a GEMM kernel with M=16, N=4 and K=4. — Fig. 1 Illustration of the first three FMLA instructions in an FP32 GEMM kernel for $M = 16$ , $N = 4$ and $K = 4$ .#

The structure of a generated FP32 GEMM kernel using Neon is shown in Fig. 1. The disassembled implementation can be found in the file gemm_asimd_16_4_4_fmla.dis. The kernel loads a column of A into the four 128-bit vector registers v1, v2, v3 and v4. It also loads a single value of matrix B into the lower 32 bits of vector register v0. The vector registers holding the column of A are then multiplied by the broadcasted value of B, and the result is added to the accumulator registers v16, v17, v18 and v19 holding the first column of C. The multiplications and additions can be done by issuing four 128-bit FMLA (by element) vector instructions.

Now the next value of the first row of B is loaded and multiplied by the first column of A. The result is added to the second column of C. This process continues until we have added the entire outer product of A’s first column and B’s first row to C. We continue similarly by adding the outer products of A’s second column and B’s second row, A’s third column and B’s third row, and A’s fourth column and B’s fourth row to C.

In summary, we have implemented the matrix-matrix multiplication by a series of outer products, which in turn are implemented by a series of FMLA instructions. The outer product formulation is the driving mechanism behind AMX and SME, which will be discussed after a brief performance evaluation.

Performance of just-in-time generated Neon kernels on Apple's M1. — Fig. 2 Sustained floating-point performance of just-in-time generated Neon kernels on a single Firestorm core. Shown is the performance for operation $C + = A B$ as a function of matrix sizes M, N and K. The height of the bars indicates the performance relative to the core’s theoretical FP32 peak (102.4 GFLOPS) and FP64 peak (51.2 GFLOPS). Additionally, the measured FP32 and FP64 GFLOPS are written at the top of the bars.#

Fig. 2 shows the Firestorm GEMM performance for a set of square matrices (M=N=K). We see that the performance of the generated kernels reaches 90% of the theoretical peak for M=N=K=32. However, for large matrices, we see a steep drop. The reason for this behavior is the scope of the generated kernels: They are optimized for small matrix sizes, and usual cache blocking techniques required for large matrices are not applied.

Apple AMX#

Shortly after the release of M1, reports of one or more hidden matrix coprocessors surfaced. Unlike the NPU and GPU, the matrix coprocessor is programmed by issuing AMX instructions from the CPU. The instruction encodings are outside the regular Arm Instruction Set Architecture (ISA) and are not documented by Apple. However, a detailed analysis by Dougall Johnson describes their structure and allows to program the coprocessor.

Illustration of M1's AMX register blocks and an out-product instructions. — Fig. 3 Illustration of M1’s AMX register blocks X, Y and Z. X and Y consist of eight rows or columns. Each column/row has 64 bytes for a total of 512 bytes per register block. Z has a total of 4,096 bytes divided into 64 columns. A 32-bit AMX-FMA instruction is shown in red. This instruction reads the first column of X and the first row of Y. It then computes the outer product and adds it to Z with a four-column stride.#

The AMX Load&Store instructions transfer data between memory and the AMX register blocks X, Y, and Z. The register blocks are shown in Fig. 3: X has eight 64-byte columns, Y has eight 64-byte rows, and Z has 64 64-byte columns. A set of AMX data processing instructions computes the outer product of an X column and a Y row, and adds the result to Z with a data type dependent stride. This means that every 4th column is used for FP32 instructions and every 8th column is used for FP64 instructions. Additionally, we can control the offsets of the X column, Y row and Z column. In the figure, the corresponding parts of X, Y, and Z for the AMXFMA32 instruction without offsets are highlighted in red. We also need to enable and disable AMX mode with a special instruction. Other than that, none of the typical accelerator-related boilerplate is required because AMX instructions operate in the same address space as the Firestorm and Icestorm cores.

Assuming a throughput of a single outer product per cycle, we can expect 16x16x2=512 operations in FP32 arithmetic and 8x8x2=128 operations in FP64 arithmetic per cycle. At a frequency of 3.2 GHz, this results in a theoretical peak performance of 1638.4 FP32 GFLOPS and 409.6 FP64 GFLOPS. Since little is known about the AMX coprocessor, we microbenchmarked the AMX floating point performance by issuing compute instructions with maximum read-after-write distances. This resulted in an observed performance of 1528 FP32 GFLOPS and 382 FP64 GFLOPS, which is similar to the theoretical assumptions. More importantly, this is a 15x improvement over a single Firestorm core for FP32 and a 7.5x improvement for FP64 math.

Performance of AMX FP32 and FP64 microkernels. — Fig. 4 Floating-point performance of possible FP32 microkernels (left) and FP64 microkernels (right). The K dimension was chosen to be large in the experiments performed. The kernels were benchmarked on a Firestorm core.#

Our AMX microkernels operate on accumulator blocks of the operation $C + = A B^{T}$ . Block sizes must be multiples of 16 in FP32 and 8 in FP64. In addition, the blocks must fit into the Z register block. The performance of all supported FP32 and FP64 configurations is shown in Fig. 4. We see that the performance is dominated by the latencies of the AMX instructions. In FP32, the highest performance was obtained for the accumulator block size M=N=32, while in FP64 there are a number of high performance configurations. The reason for this is the larger FP64 stride, which allows for more room to hide latencies.

Performance of AMX-enhanced FP32 and FP64 GEMMs. — Fig. 5 Performance evaluation of the AMX-enhanced just-in-time code generation. The code generator falls back to Neon code for unsupported matrix sizes. The kernels were executed on a Firestorm core.#

Fig. 5 shows the sustained performance of the AMX-enhanced JITter for the operation $C + = A B^{T}$ on M1. Unsupported configurations fall back to the standard Neon JITter. We see that the AMX unit significantly outperforms a Firestorm core for sufficiently large matrices. The highest performance was achieved at M=N=K=256 with a peak utilization of over 80%, corresponding to 1348 FP32 GFLOPS (14.9x over a Firestorm core) and 357 FP64 GFLOPS (8.2x over a Firestorm core).

Scalable Matrix Extension (SME)#

In mid-2021, Arm announced the first technical details of its upcoming Scalable Matrix Extension (SME). SME is based on an outer-product engine and its instructions are available as part of the Arm A-profile A64 Instruction Set Architecture.

At its core, SME is very similar to Apple’s AMX and programming it is like meeting an old friend. SME outer-product instructions can only be executed in a special mode called “Streaming SVE mode and SME architectural state”. Similar to AMX, the SMSTART instruction must be executed before any SME computations can be performed. Once in streaming mode, the scalable vector registers act as the source registers for the outer-product instructions (similar to the X and Y register blocks in AMX). In addition, SME introduces a new matrix array ZA. The role of ZA is very similar to that of the Z register block in Apple AMX. Direct loads to ZA can be done by using LDR (array vector) and direct stores from ZA can be done using STR (array vector). The floating-point outer-product instructions are called FMOPA and replace, for example, the AMXFMA32 instruction. When all computations are finished, we can exit the special mode by calling SMSTOP.

Note that SME and especially SME2 are much more powerful than the AMX version in M1. For example, we can use SME to execute outer products with a variety of data types, or use predication to mask inactive elements. The latter in particular makes it easy to write SME GEMMs with arbitrarily shaped microkernels.