5. Scalable Matrix Extension#
The Scalable Matrix Extension (SME) is designed to accelerate dense linear algebra workloads and is interesting in several ways. First and foremost, it has a set of Matrix Outer Product Accumulate (MOPA) instructions. Compared to “regular” Neon or SVE vector instructions such as FMLA, MOPA instructions perform more operations relative to the number of register bits accessed. Second, SME adds a new two-dimensional ZA array. Tiles of ZA are the destinations of all MOPA instructions. Third, SME requires the use of a special streaming mode to enable its capabilities. In streaming mode, Neon and SVE capabilities are not available. Instead, streaming mode provides Streaming SVE (SSVE) for vector processing. Finally, the design of SME has many parallels to the Apple Matrix Extension (AMX) and related patents (US10346163B2, US11042373B2). On May 7 2024, Apple introduced the fourth generation of its M-Series system-on-a-chip line. Unlike previous generations, M4 supports SME in addition to AMX.
5.1. ZA Array#
SVL (bits) |
SVL (bytes) |
SVL (words) |
ZA (bits) |
ZA (bytes) |
ZA (words) |
---|---|---|---|---|---|
128 |
16 |
4 |
2048 |
256 |
64 |
256 |
32 |
8 |
8192 |
1024 |
256 |
512 |
64 |
16 |
32768 |
4096 |
1024 |
1024 |
128 |
32 |
131072 |
16384 |
4096 |
The size of the two-dimensional ZA array is determined by the Streaming Vector Length (SVL) in bytes: SVL⨉SVL. For example, suppose the SVL is 512 bits (64 bytes). In this case, we get a ZA array size of 4096 bytes, which means that we can store 1024 FP32 values in the ZA array. Table 5.1.1 shows the ZA array sizes for an SVL of 128, 256, 512 and 1024 bits. Apple’s M4 has an SVL of 512 bits.
Fig. 5.1.1 Illustration of the ZA array for an SVL of 64 bytes. The array has 64 rows and 64 colums. It can thus hold a total of 4096 bytes. The rows of the array can accessed as array vectors.#
Fig. 5.1.1 shows the two-dimensional ZA array for an SVL of 64 bytes. The ZA array consists 64 rows and 64 columns. We can access the data of the array in different ways. One option for data access is that of ZA array vectors. Each ZA array vector corresponds to a row of the ZA array and has a size of SVL bytes. Thus, for the shown ZA array, we have 64 ZA array vector each of which has a size of 64 bytes.
5.2. ZA Tiles#
An additional data access method to the ZA array is given by the concept ZA tiles. ZA tiles can have byte, half-word, word, double-word or quad-word elements. The element size determines the number of corresponding ZA tiles. When considering the size of square ZA tiles, it is convenient to express the SVL by means of the element size. In detail, \(\text{SVL}_\text{B}\) refers to the SVL in bytes, whereas \(\text{SVL}_\text{H}\), \(\text{SVL}_\text{S}\), \(\text{SVL}_\text{D}\) and \(\text{SVL}_\text{Q}\) refer to it in halfwords, singlewords, doublewords and quadwords.
Element Size (bytes) |
Tiles |
Elements per Tile |
---|---|---|
Byte (1) |
ZA0.B |
\(\text{SVL}_\text{B} \times \text{SVL}_\text{B}\) |
Halfword (2) |
ZA0.H, ZA1.H |
\(\text{SVL}_\text{H} \times \text{SVL}_\text{H}\) |
Word (4) |
ZA0.S, ZA1.S, ZA2.S, ZA3.S |
\(\text{SVL}_\text{S} \times \text{SVL}_\text{S}\) |
Doubleword (8) |
ZA0.D - ZA7.D |
\(\text{SVL}_\text{D} \times \text{SVL}_\text{D}\) |
Quadword (16) |
ZA0.Q - ZA15.Q |
\(\text{SVL}_\text{Q} \times \text{SVL}_\text{Q}\) |
Table 5.2.1 lists the tile names and the number of elements per tile based on the underlying element size. For example, consider the case of word-sized elements and an SVL of 64 bytes or 16 words. In this case, we can access the 4096 bytes of the ZA array through the four tiles ZA0.S - ZA3.S, each of which holds 256 elements, resulting in a tile size of 1024 bytes.
A ZA tile occupies a subset of the rows in the ZA array. The ID of the first row occupied is the ID of the ZA tile. For example, the set of rows that make up ZA2.S contains row 2. The other rows follow with a stride given by the number of element bytes. In the ZA2.S example, the word-sized elements have a size of 4 bytes. This means that the IDs of the rows that make up ZA2.S are 2, 2+4, 2+8, and so on. In other words, assuming \(\text{E}_\text{B}\) is the element size in bytes, \(\text{T}_\text{ID}\) is the ID of the tile, and \(\text{R}_\text{ID}\) is the ID of a ZA array row occupied by the tile, then \(\text{R}_\text{ID} \; \text{mod} \; \text{E}_\text{B} = \text{T}_\text{ID}\).
Fig. 5.2.1 Illustration of the ZA tile ZA0.S. The left side shows the location of the tile in the ZA array. The right side shows the two-dimensional layout of the tile, which consists of 16 rows, each containing 16 word-size values.#
Fig. 5.2.2 Illustration of the ZA tile ZA1.S.#
Fig. 5.2.3 Illustration of the ZA tile ZA2.S.#
Fig. 5.2.4 Illustration of the ZA tile ZA3.S.#
Fig. 5.2.1, Fig. 5.2.2, Fig. 5.2.3 and Fig. 5.2.4 illustrate the four tiles ZA0.S, ZA1.S, ZA2.S and ZA3.S and their respective location in the ZA array. We see that every of the four tiles has a size of 16⨉16 words and occupies sixteen rows in the ZA array.
5.3. ZA Tiles Slices#
A ZA tile slice is either a row or a column of a ZA tile. Section B1.4 of the Arm Architecture Reference Manual for A-profile architecture uses the term horizontal slice when referring to a row of a ZA tile, and the term vertical slice when refering to a column of a ZA tile.
Fig. 5.3.1 Illustration of the horizontal ZA tile slice ZA2H.S[6] in the ZA array (left) and the ZA tile ZA2.S (right). The location of ZA2H.S[6] is highlighted in dark gray.#
The location of a horizontal tile slize in the ZA array is straightforward, since every row of a ZA tile maps directly to one row of the ZA array. For example, consider that we are interested in the ZA array location of the horizontal slice ZA2H.S[6] or, in other words, the row with ID 6 of tile ZA2.S. In this case, the row in the ZA array has ID \(2+6 \cdot 4=26\). Fig. 5.3.1 illustrates the location of the slice ZA2H.S[6] in the ZA array and the ZA tile ZA2.S.
Fig. 5.3.2 Illustration of the vertical ZA tile slice ZA2V.S[6] in the ZA array (left) and the ZA tile ZA2.S (right). The location of ZA2V.S[6] is highlighted in dark gray.#
The location of a vertical tile slice in the ZA array is more complex because one column of a ZA tile partially occupies multiple columns of the ZA array. Fig. 5.3.2 illustrates the situation for the vertical slice ZA2V.S[6]. As shown on the right side of the figure, ZA2V.S[6] is short for column with ID 6 of ZA tile ZA2.S. Since each element in ZA2.S has four bytes, the location of the vertical slice partially occupies four of the byte-sized ZA array columns. In particular, ZA2V.S[6] occupies the intersection of the rows of the ZA tile in the ZA array with ZA array columns 24-27. This is also shown on the left side of Fig. 5.3.2.
5.4. Streaming Mode#
AArch64 cores with the FEAT_SME feature or its extension FEAT_SME2 support Streaming SVE (SSVE) mode. In SSVE mode, instructions can access the 32 SVE scalable vector registers Z0-Z31 and the 16 predicate registers P0-P15. The size of the vector and predicate registers in SSVE mode is determined by the SVL. Therefore a vector register has SVL bytes and a predicate register has one bit for each SVL byte. In addition, access to the ZA storage can be enabled or disabled separately. Apple’s M4 chip is an example that has features FEAT_SME and FEAT_SME2. Since the SVL on M4 is 64 bytes, the vector registers have a size of 64 bytes, while the predicate registers have a size of 64 bits.
We can enter SSVE mode and enable ZA storage by using the SMSTART instruction. In our use cases, we will typically use SSVE mode and ZA storage together, so we will typically enable both. This is accomplished by simply using SMSTART without any options. Similarly, we can exit SSVE mode and disable ZA storage by using the SMSTOP instruction.
Note that entering SSVE mode can also limit the availability of instruction classes. In particular, many of the Neon instructions are illegal unless FEAT_SME_FA64 is enabled. This means that we must exit SSVE mode before running Neon instructions in such a case.
When entering or exiting SSVE mode, all bits of the SVE registers are set to zero. The lower 128 bits of registers Z0-Z31 overlap with the 128-bit Neon vector registers V0-V31. This means that as a callee, we must preserve the contents of D8-D15 in addition to the standard requirements discussed in Section 2.6. So we can use the Neon template from Listing 4.3.1 to follow the PCS.
1 .text
2 .type smode, %function
3 .global smode
4smode:
5 /*
6 * Prologue
7 */
8 stp fp, lr, [sp, #-16]!
9 mov fp, sp
10 stp d8, d9, [sp, #-16]!
11 stp d10, d11, [sp, #-16]!
12 stp d12, d13, [sp, #-16]!
13 stp d14, d15, [sp, #-16]!
14
15 /*
16 * Code
17 */
18 fmov s0, #16
19 smstart sm
20 str s0, [x0]
21 // illegal instruction without FEAT_SME_FA64
22 // fmla v0.4s, v0.4s, v0.4s
23 fmov s0, #16
24 str s0, [x1]
25 smstop sm
26 str s0, [x2]
27 fmov s0, #16
28 str s0, [x3]
29
30 /*
31 * Epilogue
32 */
33 ldp d14, d15, [sp], #16
34 ldp d12, d13, [sp], #16
35 ldp d10, d11, [sp], #16
36 ldp d8, d9, [sp], #16
37 ldp fp, lr, [sp], #16
38
39 ret
40
41.size smode, .-smode
Listing 5.4.1 shows an example of how SMSTART and SMSTOP clear the SVE register bits. In particular, four cases are shown:
The
fmov s0, #16
instruction in line 18 sets the value of registerS0
to the FP32 value16.0f
outside of streaming mode. Thensmstart sm
in line 19 enters streaming mode. The instruction also clears all bits of the SVE registers, including the lower 32 bits ofZ0
that overlap withS0
. Thus, thestr s0, [x0]
instruction in line 20 is executed in streaming mode and stores the FP32 value0.0f
in memory.The
fmov s0, #16
instruction in line 23 is executed in streaming mode and sets registerS0
to the FP32 value16.0f
. Then, still in streaming mode, thestr s0, [x1]
instruction in line 24 stores this value of registerS0
in memory.The
smstop sm
instruction in line 25 exits streaming mode. This also clears all bits of the SVE registers, so the followingstr s0, [x2]
instruction in line 26 stores the FP32 value0.0f
in memory.The two instructions
fmov s0, #16
andstr s0, [x3]
in lines 27 and 28 are executed outside of streaming mode. This sets the FP32 value16.0f
in registerS0
and stores the value of registerS0
in memory.
Note that the commented FMLA (vector) instruction in line 22 is illegal unless FEAT_SME_FA64 is enabled.
5.5. Loads and Stores#
We have already discussed loads and stores accessing registers Z0-Z31, so we will limit our attention to loads and stores accessing the ZA array.
5.5.1. ZA Array Vectors#
The LDR (array vector) and STR (array vector) instructions access the ZA array as array vectors. The structure is different from the loads and stores we have seen as part of the Base, Neon, or SVE instructions. In particular, the instructions use a vector select register W12-W15 that determines the row of the ZA array being accessed.