4. Neon#

Instruction set architectures have dedicated registers and instructions for vector and matrix processing. In the case of AArch64, Neon is required in all AArch64-compliant processors and is also called Advanced SIMD (ASIMD). It supports scalar floating-point operations and vector operations on vector registers with up to 128 bits. Depending on the microarchitecture (see Section 1.1), we can also use the Scalable Vector Extension (SVE) for vector processing or the Scalable Matrix Extensions (SME) for matrix processing. The vector and matrix processing instructions are used in conjunction with the base instructions introduced in Section 3. Here, the base instructions are used for address calculation, branching, and conditional code execution, and the vector and matrix instructions are used for the actual heavy lifting in terms of computation. This chapter discusses Neon’s vector instructions.

Note

Neon has been extended with some matrix processing capabilities in 2019. In this book, we will not discuss these instructions and will instead limit our matrix processing discussions to SME. However, if you are interested in Neon’s matrix instructions, the BFMMLA instruction is a good place to start.

Theoretically, we could build the entire tensor compiler using only Neon. However, in most cases, using the more advanced SVE and SME, if available, will improve performance. There are two ways to proceed:

Study Neon only, i.e. skip everything about SVE and SME and continue to the end of the book. Then, when everything works, go back and add SVE and SME support to the compiler.
Study Neon, SVE, and SME before continuing. Then write a unified tensor compiler that can generate code for all three options.

If you have the breadth to stick with the ISA for a bit longer, and have access to SVE and SME hardware, the second option is recommended. The rationale behind this recommendation is that you will have a better awareness of more powerful instructions when making design decisions in the code generator.

This section follows the structure of Section 2 and Section 3. That is, we introduce the SIMD and floating-point registers, discuss load and store instructions, and finally introduce data processing instructions.

4.1. Registers#

Neon has thirty-two 128-bit SIMD and floating-point registers that are visible to the A64 instruction set. The registers are architecturally named V0 to V31.

../_images/registers_neon.svg — Fig. 4.1.1 Illustration of the thirty-two 128-bit Neon registers V0-V30 visible to the A64 instruction set, the floating-point control register (FPCR), and the floating-point status register (FPSR).#

As shown in Fig. 4.1.1, the register can be accessed as:

8-bit registers: B0 to B31.
16-bit registers: H0 to H31.
32-bit registers: S0 to S31.
64-bit registers: D0 to D31.
128-bit registers: Q0 to Q31.

In addition, we can use the registers as 128-bit vectors of elements or 64-bit vectors of elements. We will discuss this view of the registers in Section 4.2. Neon also has special-purpose registers, two of which are also shown Fig. 4.1.1:

Floating-point Control Register (FPCR): Controls floating-point behavior. For example, we could enable/disable NaNs, set rounding modes, or enable/disable flushing of denormalized numbers to zero.
Floating-point Status Register (FPSR): Provides floating-point status information. For example, exception bits in the register are set when division by zero or saturation occurred.

4.2. Arrangement Specifiers#

Many Neon loads and stores, as well as all data-processing instructions, use arrangement specifiers. An arrangement specifier is a suffix in the form .<N><T> used when referring to a register. This suffix encodes the number of lanes and the lane width each instruction operates on. Thus, arrangement specifiers determine how to partition a register’s 64- or 128-bit view into lanes.

Table 4.2.1 Neon arrangement specifiers.#
Specifier	Vector Width (bits)	Number of Lanes	Lane Width (bits)
`.2D`	128	2	64
`.4S`	128	4	32
`.8H`	128	8	16
`.16B`	128	16	8
`.1D`	64	1	64
`.2S`	64	2	32
`.4H`	64	4	16
`.8B`	64	8	8

Table 4.2.1 shows the arrangement specifiers available in Neon. In an instruction, we apply these specifiers to the vector registers introduced in Section 4.1. For example, V17.4S means that the instruction treats register Q17 as a vector containing four 32-bit values.

4.3. Procedure Call Standard#

The procedure call standard defines the role of the Neon registers in function calls. V0-V7 are used to pass values into a function and return values. Registers V8-V31 are scratch registers, where V8-V15 are callee-saved and V16-V31 are called-saved. Unlike the GPRs, we do not have to preserve the entire contents of the callee-saved Neon registers. Instead, only the lower 64 bits for V8-V15 need to be preserved, i.e. the values in D8-D15.

Listing 4.3.1 Example assembly program that sets the frame pointer register and temporarily stores registers X19-X30 and D8-D15 on the stack.#

    .text
    .type pcs, %function
    .global pcs
 pcs:
    // save frame pointer and link register
    stp fp, lr, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov fp, sp

    // save callee-saved registers
    stp x19, x20, [sp, #-16]!
    stp x21, x22, [sp, #-16]!
    stp x23, x24, [sp, #-16]!
    stp x25, x26, [sp, #-16]!
    stp x27, x28, [sp, #-16]!

    stp  d8,  d9, [sp, #-16]!
    stp d10, d11, [sp, #-16]!
    stp d12, d13, [sp, #-16]!
    stp d14, d15, [sp, #-16]!

    // use registers as needed

    // restore callee-saved registers
    ldp d14, d15, [sp], #16
    ldp d12, d13, [sp], #16
    ldp d10, d11, [sp], #16
    ldp  d8,  d9, [sp], #16

    ldp x27, x28, [sp], #16
    ldp x25, x26, [sp], #16
    ldp x23, x24, [sp], #16
    ldp x21, x22, [sp], #16
    ldp x19, x20, [sp], #16

    // restore frame pointer and link register
    ldp fp, lr, [sp], #16

    ret

Listing 4.3.1 shows an updated version of the template we originally introduced in Section 2.6. Now, in addition to X19-X30, we temporarily store the contents of D8-D15 on the stack. Of course, we can eliminate the corresponding stack transfers if the lower 64 bits of a register in V8-V15 are not modified.

4.4. Loads and Stores#

As with the base instructions, a group of instructions allows us to transfer data between memory and the SIMD&FP registers.

The LDR (immediate, SIMD&FP) and STR (immediate, SIMD&FP) instructions work similarly to LDR (immediate) and STR (immediate) of the base instructions discussed. However, we can now use the B, H, S, D, or Q view on the SIMD&FP registers. Similarly, LDP (SIMD&FP) and STP (SIMD&FP) allow us to transfer data between memory and two SIMD&FP registers. We give high-level descriptions for some example instructions:

ldr d5, [x0]: Load 64 bits (double word) from memory into register D5. In memory, the data is located at the 64-bit address held in register X0.
ldr q1, [x3]: Load 128 bits (quad word) from memory into register Q1. In memory, the data is located at the 64-bit address held in register X3.
str h1, [x3, #32]: Store 16 bits (half word) from register H1 into memory. The memory address is calculated by adding offset 32 to the value in register X3.
ldp q3, q8, [x2]: Load 2x128 bits from memory into registers Q3 and Q8. In memory, the data is at the 64-bit address held in register X2.

A particularly interesting pair of load and store instructions in Neon are LD1 (multiple structures) and ST1 (multiple structures). LD1 (multiple structures) allows us to load data from memory into up to four consecutive SIMD&FP registers, while ST1 (multiple structures) allows us to store data from up to four consecutive registers into memory. The term “consecutive” means that if the first register has the ID Vt, then the following registers must have the IDs (Vt+1)%32, (Vt+2)%32, and (Vt+3)%32. Again, we provide high-level descriptions for some examples:

ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]: Load 4x4x32 bits (512 bits total) from memory into registers V0, V1, V2 and V3. In memory, the data is located at the 64-bit address held in register X0.
st1 {v31.2d, v0.2d, v1.2d, v2.2d}, [x3], #64: Store 4x2x64 bits (512 bits total) from registers V31, V0, V1, and V2 into memory. The memory address is held in register X3. In addition, the value of register X3 is incremented by 64.