The Scalable Matrix Extension (SME) made its debut in the M4 system-on-a-chip in the 2024 iPad Pro. Since then, more products have become available with SME support. Following our initial SME sprint, we upstreamed an SME code generator for tensor processing primitives to the LIBXSMM library.

Small matrix-matrix multiplications are one of the supported primitives, and the code generation can be tested with a few commands:

git clone https://github.com/libxsmm/libxsmm.git
cd libxsmm; make -j BLAS=0
cd samples/xgemm; make -j
./gemm_kernel F32 F32 F32 F32 512 512 512 512 512 512 \
              1 1 0 0 0 1 0 0 0 nopf nobr 0 1 10000 0

On a 2024 Mac mini with an M4, this results in a performance of about 1833 GFLOPS in FP32 arithmetic:

------------------------------------------------
RUNNING (512x512) X (512x512)^T = (512x512)
a:F32, b:F32, comp:F32, c:F32, BR=1
------------------------------------------------
[...]
1.464460s for libxsmm
1832.998967 GFLOPS for libxsmm
max. error: 0.000000
------------------------------------------------
[...]

For more details, see our SME web page, our SME paper, and our paper on tensor processing primitives.