Overview#

On May 7, 2024 Apple announced the M4 chip at its “Let Loose” event. M4 is the first publicly available silicon supporting Arm’s Scalable Matrix Extension (SME). SME has been eagerly awaited by the HPC community for quite some time now, and this page is dedicated to providing information about M4’s SME support.

One last thing: Hello SME!

LIBXSMM#

Initially, this web page described only some early SME microbenchmarks on M4. Meanwhile, we have upstreamed SME capabilities to the just-in-time code generation of tensor processing primitives in the open-source library LIBXSMM. It’s just a few terminal commands if you want to try out the code:

 1git clone https://github.com/libxsmm/libxsmm.git
 2cd libxsmm
 3make -j BLAS=0
 4
 5cd samples/xgemm
 6make -j
 7
 8./gemm_kernel F32 F32 F32 F32 512 512 512 512 512 512 1 1 0 0 0 0 0 0 0 nopf nobr 0 1 10000 0
 9./gemm_kernel F32 F32 F32 F32 512 512 512 512 512 512 1 1 0 0 0 1 0 0 0 nopf nobr 0 1 10000 0
10# for other setting just run ./gemm_kernel

The two examples in lines 8 and 9 execute GEMMs where all matrix sizes are set to 512. The first example in line 8 computes C+=A×B and reaches about 1755 FP32 GFLOPS on M4:

 1------------------------------------------------
 2RUNNING (512x512) X (512x512) = (512x512)
 3a:F32, b:F32, comp:F32, c:F32, BR=1
 4------------------------------------------------
 5function pointer address: 10079c000
 60.000071s for creating jit
 7
 8Printing Norms:
 9L1 reference  : 487176.6939176287269219756
10L1 test       : 487176.6939176287269219756
11L2 abs.error  : 0.000000000000000000000000
12L2 rel.error  : 0.000000000000000000000000
13Linf abs.error: 0.000000000000000000000000
14Linf rel.error: 0.000000000000000000000000
15Check-norm    : 0.000000000000000000000000
16
171.529381s for libxsmm
181755.189780 GFLOPS for libxsmm
19max. error: 0.000000
20------------------------------------------------
21
22
23Total Max Error 0.000000

The second example in line 9 computes C+=A×BT and reaches about 1833 GFLOPS on M4:

 1------------------------------------------------
 2RUNNING (512x512) X (512x512)^T = (512x512)
 3a:F32, b:F32, comp:F32, c:F32, BR=1
 4------------------------------------------------
 5function pointer address: 100aec000
 60.000063s for creating jit
 7
 8Printing Norms:
 9L1 reference  : 486134.3142449855804443359
10L1 test       : 486134.3142449855804443359
11L2 abs.error  : 0.000000000000000000000000
12L2 rel.error  : 0.000000000000000000000000
13Linf abs.error: 0.000000000000000000000000
14Linf rel.error: 0.000000000000000000000000
15Check-norm    : 0.000000000000000000000000
16
171.463767s for libxsmm
181833.867522 GFLOPS for libxsmm
19max. error: 0.000000
20------------------------------------------------
21
22
23Total Max Error 0.000000

Further Information#

We made the initial benchmarking code available on GitHub. The goal of this effort was to move quickly. This means that there may be some bugs left. If you find any, please let us know. You can reach us via Matrix or submit an issue. We have also published the paper Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension, where we also describe some of the benchmarks performed and our just-in-time code generation of matrix kernels.

Here is a small collection of links related to M4 and SME: