Overview#

On May 7, 2024 Apple announced the M4 chip at its “Let Loose” event. M4 is the first publicly available silicon supporting Arm’s Scalable Matrix Extension (SME). SME has been eagerly awaited by the HPC community for quite some time now, and this page is dedicated to providing information about M4’s SME support.

One last thing: Hello SME!

LIBXSMM#

Initially, this web page described only some early SME microbenchmarks on M4. Meanwhile, we have upstreamed SME capabilities to the just-in-time code generation of tensor processing primitives in the open-source library LIBXSMM. It’s just a few terminal commands if you want to try out the code:

git clone https://github.com/libxsmm/libxsmm.git
cd libxsmm
make -j BLAS=0

cd samples/xgemm
make -j

./gemm_kernel F32 F32 F32 F32 512 512 512 512 512 512 1 1 0 0 0 0 0 0 0 nopf nobr 0 1 10000 0
./gemm_kernel F32 F32 F32 F32 512 512 512 512 512 512 1 1 0 0 0 1 0 0 0 nopf nobr 0 1 10000 0
# for other setting just run ./gemm_kernel

The two examples in lines 8 and 9 execute GEMMs where all matrix sizes are set to 512. The first example in line 8 computes \(C \mathrel{+}= A \times B\) and reaches about 1755 FP32 GFLOPS on M4:

------------------------------------------------
RUNNING (512x512) X (512x512) = (512x512)
a:F32, b:F32, comp:F32, c:F32, BR=1
------------------------------------------------
function pointer address: 10079c000
0.000071s for creating jit

Printing Norms:
L1 reference  : 487176.6939176287269219756
L1 test       : 487176.6939176287269219756
L2 abs.error  : 0.000000000000000000000000
L2 rel.error  : 0.000000000000000000000000
Linf abs.error: 0.000000000000000000000000
Linf rel.error: 0.000000000000000000000000
Check-norm    : 0.000000000000000000000000

1.529381s for libxsmm
1755.189780 GFLOPS for libxsmm
max. error: 0.000000
------------------------------------------------


Total Max Error 0.000000

The second example in line 9 computes \(C += A \times B^T\) and reaches about 1833 GFLOPS on M4:

------------------------------------------------
RUNNING (512x512) X (512x512)^T = (512x512)
a:F32, b:F32, comp:F32, c:F32, BR=1
------------------------------------------------
function pointer address: 100aec000
0.000063s for creating jit

Printing Norms:
L1 reference  : 486134.3142449855804443359
L1 test       : 486134.3142449855804443359
L2 abs.error  : 0.000000000000000000000000
L2 rel.error  : 0.000000000000000000000000
Linf abs.error: 0.000000000000000000000000
Linf rel.error: 0.000000000000000000000000
Check-norm    : 0.000000000000000000000000

1.463767s for libxsmm
1833.867522 GFLOPS for libxsmm
max. error: 0.000000
------------------------------------------------


Total Max Error 0.000000

Further Information#

We made the initial benchmarking code available on GitHub. The goal of this effort was to move quickly. This means that there may be some bugs left. If you find any, please let us know. You can reach us via Matrix or submit an issue. We have also published the paper Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension, where we also describe some of the benchmarks performed and our just-in-time code generation of matrix kernels.

Here is a small collection of links related to M4 and SME: