Tensor Processing Primitives#

Small General Matrix-Matrix multiplications (GEMMs) are among the most important primitives for tensor computations. We now discuss the implementation of a just-in-time code generator that can generate optimized kernels tailored to M4. Before implementing the automated process for a wide range of primitive configurations, e.g., matrix sizes in a GEMM, it is advisable to write some standalone assembly kernels for different matrix shapes to get a feel for the hardware and ISA. Once this is done, we can move forward and implement a more general version in a JITter. The just-in-time code generator we are aiming for is based on the LIBXSMM library.

In this section, we explain how the JITter performs matrix multiplication using the SME unit. We use a blocking strategy to combine different microkernels and use SME instructions to transpose certain parts of a matrix on the stack before computing the product. We limit all discussion to FP32 arithmetic, as this is the most appropriate precision for the SME unit in the M4 chip.

Microkernel#

A microkernel is the innermost part of our GEMMs. It contains the instructions that do the heavy lifting and is free of control flow. It does at least one full update of the accumulator block and is called repeatedly in a loop over K that sums the intermediate results. Since we use FP32 as our datatype, there are four ZA tiles available for the accumulator blocks, i.e. for storing matrix C. Each ZA tile can hold 16x16 FP32 values. So, we have three different ways to build a matrix block of C from the available ZA tiles: 32x32, 16x64, and 64x16. This suggests that there are three possible microkernels on these block sizes.

An example of JITter output for such a microkernel might look like this:

kloop:
    sub     x8, x8, #0x1
    ld1w    {z0.s, z1.s}, pn8/z, [x0]
    ld1w    {z2.s, z3.s}, pn9/z, [x1]
    add     x0, x0, x9
    add     x1, x1, x10
    fmopa   za0.s, p1/m, p0/m, z2.s, z0.s
    fmopa   za1.s, p1/m, p2/m, z2.s, z1.s
    fmopa   za2.s, p3/m, p0/m, z3.s, z0.s
    fmopa   za3.s, p3/m, p2/m, z3.s, z1.s
    cbnz    x8, kloop

The example shows a 32x32 microkernel that loops over the entire K dimension. To update all four ZA tiles, we need four streaming SVE registers. The two registers Z0 and Z1 are used for a partial column of A (line 3), and the two registers Z2 and Z3 are used for a partial row of B (line 4). The corresponding memory addresses are stored in the general purpose registers X0 and X1. After the loads, the addresses are updated directly to point to the next column and row (lines 5-6). Finally, FMOPA instructions are executed to compute the outer products and add them to the corresponding ZA tiles (lines 7-10).

The microkernels for the other two options are very similar. The main difference is that the 16x64 and 64x16 kernels require five streaming SVE registers. For example, in the 64x16 case, four registers are required for A and one for B. Thus, the 32x32 kernel is the most efficient in terms of the number of registers required.

Blocking#

Using only one of these microkernels, possibly with additional masking, for a GEMM may not be sufficient, as we may run into a situation where not all tiles are used for accumulation. This can lead to a performance hit of up to 4x, as discussed in the microbenchmarking section.

The following example shows two blocking strategies for an 80x80 matrix C, i.e. M=80 and N=80 for the GEMM:

Homogenous blocking with 64x16 microkernel. — Fig. 21 Illustration of a GEMM kernel executing 64x16 microkernels with and without masking.#

heterogeneous Blocking using all microkernels. — Fig. 22 Illustration of a GEMM kernel executing 32x32, 16x64, and 64x16 microkernels.#

The left side shows a configuration where the 64x16 microkernel is used without masking for the first 64 rows and with masking for the last 16 rows. We see that only half of the kernels use all the ZA tiles. In fact, the five 16x16 blocks are each accumulated into a single ZA tile. Similar problems would occur if we were using only the 32x32 microkernel or the 16x64 microkernel.

We need to find a good blocking strategy for a given size of matrix C, possibly using more than one of the three microkernels discussed for an instantiation of a GEMM primitive. Such a blocking strategy is shown on the right. Here, four blocks of C are computed by the green 32x32 microkernel. The remaining parts are computed by our 16x64 (brown) and 64x16 (purple) microkernels. The last 16x16 values in the lower right corner of the matrix are computed by the 64x16 microkernel, where, similar to before, the lower 48 rows are masked for simplicity. The advantage of the blocking strategy is that only seven kernels need to be executed instead of ten. In addition, all but one microkernel use all available ZA tiles. We use this heterogeneous blocking in the just-in-time code generator.

Performance#

Since the outer-product engine expects columns of A and rows of B in the registers, the kernel described so far is suitable for computations of the form $C + = A \times B^{T}$ . We used a shell script to benchmark the performance of the GEMMs. The script uses the gemm_kernel executable in LIBXSMM. This sample implementation tests and verifies the GEMM routines generated by the library. The following figure shows the performance of the benchmarked GEMMs:

GEMM C+=AxBt — Fig. 23 Performance of the GEMM $C + = A \times B^{T}$ for varying matrix sizes.#

The size of the matrix dimension K is kept constant at 512, while the sizes of the M and N dimensions are varied from 1 to 512. The tests use a leading dimension of 256 for A and C up to a size of 256. For sizes larger than 256, a leading dimension of 512 is used. We see that kernels using all ZA tiles and executing unmasked FMOPA instructions achieve almost 95% of the theoretical peak performance.

Transposing B#

Matrix B must be transposed for FMOPA if it is stored in column-major format, i.e. if we compute GEMMs of the form $C + = A \times B^{T}$ . In order to use the same microkernel code generation routines and to avoid unnecessary loads or shifts, we transpose blocks of B on the stack. Instead of transposing the entire matrix B, we transpose up to 64 columns at a time. The transposed block of matrix B is written to the stack and used by the microkernel.

We perform the actual transposition using two different views of the two-dimensional ZA tiles. First, the values are loaded into the ZA tiles using a vertical view, and then transferred to the streaming registers using a horizontal view. This process produces the desired transposition of a block of C.

The following figure shows the performance of different $C + = A \times B$ GEMMs generated by LIBXSMM:

GEMM C+=AxB — Fig. 24 Performance of the GEMM $C + = A \times B$ for varying matrix sizes.#

Due to the additional stack transposition, the performance is a slightly lower because no computation takes place during the transpositions. However, under optimal conditions, the kernels achieve up to 87% of the maximum possible performance.