5. Scalable Matrix Extension#

The Scalable Matrix Extension (SME) is designed to accelerate dense linear algebra workloads and is interesting in several ways. First and foremost, it has a set of Matrix Outer Product Accumulate (MOPA) instructions. Compared to “regular” Neon or SVE vector instructions such as FMLA, MOPA instructions perform more operations relative to the number of register bits accessed. Second, SME adds a new two-dimensional ZA array. Tiles of ZA are the destinations of all MOPA instructions. Third, SME requires the use of a special streaming mode to enable its capabilities. In streaming mode, Neon and SVE capabilities are not available. Instead, streaming mode provides Streaming SVE (SSVE) for vector processing. Finally, the design of SME has many parallels to the Apple Matrix Extension (AMX) and related patents (US10346163B2, US11042373B2). On May 7 2024, Apple introduced the fourth generation of its M-Series system-on-a-chip line. Unlike previous generations, M4 supports SME in addition to AMX.

5.1. ZA Array#

Table 5.1.1 Size of the ZA array depending on the Streaming Vector Length (SVL).#
SVL (bits)	SVL (bytes)	SVL (words)	ZA (bits)	ZA (bytes)	ZA (words)
128	16	4	2048	256	64
256	32	8	8192	1024	256
512	64	16	32768	4096	1024
1024	128	32	131072	16384	4096

The size of the two-dimensional ZA array is determined by the Streaming Vector Length (SVL) in bytes: SVL⨉SVL. For example, suppose the SVL is 512 bits (64 bytes). In this case, we get a ZA array size of 4096 bytes, which means that we can store 1024 FP32 values in the ZA array. Table 5.1.1 shows the ZA array sizes for an SVL of 128, 256, 512 and 1024 bits. Apple’s M4 has an SVL of 512 bits.

../_images/za_array.svg — Fig. 5.1.1 Illustration of the ZA array for an SVL of 64 bytes. The array has 64 rows and 64 colums. It can thus hold a total of 4096 bytes. The rows of the array can accessed as array vectors.#

Fig. 5.1.1 shows the two-dimensional ZA array for an SVL of 64 bytes. The ZA array consists 64 rows and 64 columns. We can access the data of the array in different ways. One option for data access is that of ZA array vectors. Each ZA array vector corresponds to a row of the ZA array and has a size of SVL bytes. Thus, for the shown ZA array, we have 64 ZA array vector each of which has a size of 64 bytes.

5.2. ZA Tiles#

An additional data access method to the ZA array is given by the concept ZA tiles. ZA tiles can have byte, half-word, word, double-word or quad-word elements. The element size determines the number of corresponding ZA tiles. When considering the size of square ZA tiles, it is convenient to express the SVL by means of the element size. In detail, ${SVL}_{B}$ refers to the SVL in bytes, whereas ${SVL}_{H}$ , ${SVL}_{S}$ , ${SVL}_{D}$ and ${SVL}_{Q}$ refer to it in halfwords, singlewords, doublewords and quadwords.

Table 5.2.1 Names of the ZA tiles and the number of elements per tile, depending on the element size.#
Element Size (bytes)	Tiles	Elements per Tile
Byte (1)	ZA0.B	${SVL}_{B} \times {SVL}_{B}$
Halfword (2)	ZA0.H, ZA1.H	${SVL}_{H} \times {SVL}_{H}$
Word (4)	ZA0.S, ZA1.S, ZA2.S, ZA3.S	${SVL}_{S} \times {SVL}_{S}$
Doubleword (8)	ZA0.D - ZA7.D	${SVL}_{D} \times {SVL}_{D}$
Quadword (16)	ZA0.Q - ZA15.Q	${SVL}_{Q} \times {SVL}_{Q}$

Table 5.2.1 lists the tile names and the number of elements per tile based on the underlying element size. For example, consider the case of word-sized elements and an SVL of 64 bytes or 16 words. In this case, we can access the 4096 bytes of the ZA array through the four tiles ZA0.S - ZA3.S, each of which holds 256 elements, resulting in a tile size of 1024 bytes.

A ZA tile occupies a subset of the rows in the ZA array. The ID of the first row occupied is the ID of the ZA tile. For example, the set of rows that make up ZA2.S contains row 2. The other rows follow with a stride given by the number of element bytes. In the ZA2.S example, the word-sized elements have a size of 4 bytes. This means that the IDs of the rows that make up ZA2.S are 2, 2+4, 2+8, and so on. In other words, assuming $E_{B}$ is the element size in bytes, $T_{ID}$ is the ID of the tile, and $R_{ID}$ is the ID of a ZA array row occupied by the tile, then $R_{ID} mod E_{B} = T_{ID}$ .

../_images/za0s.svg — Fig. 5.2.1 Illustration of the ZA tile ZA0.S. The left side shows the location of the tile in the ZA array. The right side shows the two-dimensional layout of the tile, which consists of 16 rows, each containing 16 word-size values.#

../_images/za1s.svg — Fig. 5.2.2 Illustration of the ZA tile ZA1.S.#

../_images/za2s.svg — Fig. 5.2.3 Illustration of the ZA tile ZA2.S.#

../_images/za3s.svg — Fig. 5.2.4 Illustration of the ZA tile ZA3.S.#

Fig. 5.2.1, Fig. 5.2.2, Fig. 5.2.3 and Fig. 5.2.4 illustrate the four tiles ZA0.S, ZA1.S, ZA2.S and ZA3.S and their respective location in the ZA array. We see that every of the four tiles has a size of 16⨉16 words and occupies sixteen rows in the ZA array.

5.3. ZA Tiles Slices#

A ZA tile slice is either a row or a column of a ZA tile. Section B1.4 of the Arm Architecture Reference Manual for A-profile architecture uses the term horizontal slice when referring to a row of a ZA tile, and the term vertical slice when refering to a column of a ZA tile.

../_images/za2hs6.svg — Fig. 5.3.1 Illustration of the horizontal ZA tile slice ZA2H.S[6] in the ZA array (left) and the ZA tile ZA2.S (right). The location of ZA2H.S[6] is highlighted in dark gray.#

The location of a horizontal tile slize in the ZA array is straightforward, since every row of a ZA tile maps directly to one row of the ZA array. For example, consider that we are interested in the ZA array location of the horizontal slice ZA2H.S[6] or, in other words, the row with ID 6 of tile ZA2.S. In this case, the row in the ZA array has ID $2 + 6 \cdot 4 = 26$ . Fig. 5.3.1 illustrates the location of the slice ZA2H.S[6] in the ZA array and the ZA tile ZA2.S.

../_images/za2vs6.svg — Fig. 5.3.2 Illustration of the vertical ZA tile slice ZA2V.S[6] in the ZA array (left) and the ZA tile ZA2.S (right). The location of ZA2V.S[6] is highlighted in dark gray.#

The location of a vertical tile slice in the ZA array is more complex because one column of a ZA tile partially occupies multiple columns of the ZA array. Fig. 5.3.2 illustrates the situation for the vertical slice ZA2V.S[6]. As shown on the right side of the figure, ZA2V.S[6] is short for column with ID 6 of ZA tile ZA2.S. Since each element in ZA2.S has four bytes, the location of the vertical slice partially occupies four of the byte-sized ZA array columns. In particular, ZA2V.S[6] occupies the intersection of the rows of the ZA tile in the ZA array with ZA array columns 24-27. This is also shown on the left side of Fig. 5.3.2.

5.4. Streaming Mode#

AArch64 cores with the FEAT_SME feature or its extension FEAT_SME2 support Streaming SVE (SSVE) mode. In SSVE mode, instructions can access the 32 SVE scalable vector registers Z0-Z31 and the 16 predicate registers P0-P15. The size of the vector and predicate registers in SSVE mode is determined by the SVL. Therefore a vector register has SVL bytes and a predicate register has one bit for each SVL byte. In addition, access to the ZA storage can be enabled or disabled separately. Apple’s M4 chip is an example that has features FEAT_SME and FEAT_SME2. Since the SVL on M4 is 64 bytes, the vector registers have a size of 64 bytes, while the predicate registers have a size of 64 bits.

We can enter SSVE mode and enable ZA storage by using the SMSTART instruction. In our use cases, we will typically use SSVE mode and ZA storage together, so we will typically enable both. This is accomplished by simply using SMSTART without any options. Similarly, we can exit SSVE mode and disable ZA storage by using the SMSTOP instruction.

Note that entering SSVE mode can also limit the availability of instruction classes. In particular, many of the Neon instructions are illegal unless FEAT_SME_FA64 is enabled. This means that we must exit SSVE mode before running Neon instructions in such a case.

When entering or exiting SSVE mode, all bits of the SVE registers are set to zero. The lower 128 bits of registers Z0-Z31 overlap with the 128-bit Neon vector registers V0-V31. This means that as a callee, we must preserve the contents of D8-D15 in addition to the standard requirements discussed in Section 2.6. So we can use the Neon template from Listing 4.3.1 to follow the PCS.

Listing 5.4.1 Streaming mode example code illustrating the clearing of SVE register bits by SMSTART and SMSTOP.#

    .text
    .type smode, %function
    .global smode
smode:
    /*
     * Prologue
     */
    stp fp, lr, [sp, #-16]!
    mov fp, sp
    stp  d8,  d9, [sp, #-16]!
    stp d10, d11, [sp, #-16]!
    stp d12, d13, [sp, #-16]!
    stp d14, d15, [sp, #-16]!

    /*
     * Code
     */
    fmov s0, #16
    smstart sm
    str s0, [x0]
    // illegal instruction without FEAT_SME_FA64
    // fmla v0.4s, v0.4s, v0.4s
    fmov s0, #16
    str s0, [x1]
    smstop sm
    str s0, [x2]
    fmov s0, #16
    str s0, [x3]

    /*
     * Epilogue
     */
    ldp d14, d15, [sp], #16
    ldp d12, d13, [sp], #16
    ldp d10, d11, [sp], #16
    ldp  d8,  d9, [sp], #16
    ldp fp, lr, [sp], #16

    ret

.size smode, .-smode

Listing 5.4.1 shows an example of how SMSTART and SMSTOP clear the SVE register bits. In particular, four cases are shown:

The fmov s0, #16 instruction in line 18 sets the value of register S0 to the FP32 value 16.0f outside of streaming mode. Then smstart sm in line 19 enters streaming mode. The instruction also clears all bits of the SVE registers, including the lower 32 bits of Z0 that overlap with S0. Thus, the str s0, [x0] instruction in line 20 is executed in streaming mode and stores the FP32 value 0.0f in memory.
The fmov s0, #16 instruction in line 23 is executed in streaming mode and sets register S0 to the FP32 value 16.0f. Then, still in streaming mode, the str s0, [x1] instruction in line 24 stores this value of register S0 in memory.
The smstop sm instruction in line 25 exits streaming mode. This also clears all bits of the SVE registers, so the following str s0, [x2] instruction in line 26 stores the FP32 value 0.0f in memory.
The two instructions fmov s0, #16 and str s0, [x3] in lines 27 and 28 are executed outside of streaming mode. This sets the FP32 value 16.0f in register S0 and stores the value of register S0 in memory.

Note that the commented FMLA (vector) instruction in line 22 is illegal unless FEAT_SME_FA64 is enabled.

5.5. Loads and Stores#

We have already discussed loads and stores accessing registers Z0-Z31 in ch:sve, so we will limit our attention to loads and stores accessing the ZA array.

5.5.1. ZA Array Vectors#

The LDR (array vector) and STR (array vector) instructions access the ZA array as array vectors. The structure is different from the loads and stores we have seen as part of the Base, Neon, or SVE instructions. In particular, the instructions use a vector select register W12-W15 that determines the row of the ZA array being accessed.

Scalable Matrix Extension

Contents