3. Base Instructions#

The A64 base instructions form the basis of the A64 instruction set. They are further subdivided into the following instruction groups:

  • Loads and stores associated with general-purpose registers.

  • Data processing (immediate).

  • Data processing (register).

  • Branch, exception generation, and system instructions.

We cover each of the four instruction groups in a separate section by discussing some important examples and related concepts. The details of all base instructions are available in the Arm A-profile A64 Instruction Set Architecture.

3.1. Loads and Stores#

The first group of instructions transfers data between memory and the general-purpose registers. An instruction that transfers data from memory to a register is called a load. Conversely, an instruction that transfers data from a register to memory is called a store.

3.1.1. LDR (immediate)#

We start by discussing the LDR (immediate) instruction in detail. It transfers data from memory into a register. As its name suggests, we can use the instruction in assembly code through the mnemonic ldr. The instruction has encodings from different addressing modes (“post-index”, “pre-index” and “unsigned offset”). Specifically, the unsigned offset class has a 32-bit variant with the encoding LDR <Wt>, [<Xn|SP>{, #<pimm>}] and a 64-bit variant with the encoding LDR <Xt>, [<Xn|SP>{, #<pimm>}].

The A64 ISA describes the assembler symbols as follows:

<Wt>

Is the 32-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.

<Xn|SP>

Is the 64-bit name of the general-purpose base register or stack pointer, encoded in the “Rn” field.

<Xt>

Is the 64-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.

<pimm>

For the “32-bit” variant: is the optional positive immediate byte offset, a multiple of 4 in the range 0 to 16380, defaulting to 0 and encoded in the “imm12” field as <pimm>/4. For the “64-bit” variant: is the optional positive immediate byte offset, a multiple of 8 in the range 0 to 32760, defaulting to 0 and encoded in the “imm12” field as <pimm>/8.

Being new to assembly language programming, this information is difficult to parse. However, most of it makes sense immediately when looking at some examples. Suppose we use one of the following LDR (immediate) instructions in our assembly program. Then we can give high-level descriptions of the instructions:

ldr w5, [x0]

Load 32 bits (word) from memory into register W5. In memory, the data is located at the 64-bit address hold in register X0.

ldr x1, [x3]

Load 64 bits (double word) from memory into register X1. In memory, the data is located at the 64-bit address hold in register X3.

ldr x1, [x3, #32]

Load 64 bits (double word) from memory into register X1. In memory, the data is located at the 64-bit address obtained by adding 32 to the value in register X3.

Examples

Knowing the procedure call standard (see Section 2.6) and load instructions, we can write our first simple functions in assembly language.

Listing 3.1.1 Syntax of the function load_store_0.#
int32_t load_store_0( int32_t const * a );

Suppose we want to implement a function load_store_0 with the syntax given in Listing 3.1.1. The function should take a single address as a parameter and load 32 bits of data from that address. The return value should be the 32 bits loaded.

Listing 3.1.2 Implementation of the function load_store_0, which loads 32 bits.#
    .text
    .type load_store_0, %function
    .global load_store_0
load_store_0:
    ldr w0, [x0]
    ret

An example implementation of the load_store_0 function is shown in Listing 3.1.2. According to the PCS, the 64-bit address is passed to the function through register X0. The address is then used in the ldr w0, [x0] instruction, which loads 32 bits from this address into W0. Since W0 is simply a different view of R0, the data in X0 is overwritten. That is, the instruction replaces the lower 32 bits with the loaded data and sets the upper 32 bits to zero. The ret instruction jumps back to the address in the link register. We will discuss the details of RET and other branch instructions at a later point. According to the PCS, we know that the return value of the function is expected by the caller to be in register W0.

Listing 3.1.3 Syntax of the function load_store_1.#
int32_t load_store_1( int32_t const * a );

Now suppose we want to implement the function load_store_1 with the syntax shown in Listing 3.1.3. While the syntax is the same as load_store_0, the goal is to load the data from the address with an additional offset of 16 bytes.

Listing 3.1.4 Implementation of the function load_store_1, which loads 32 bits.#
    .text
    .type load_store_1, %function
    .global load_store_1
load_store_1:
    ldr w0, [x0, #16]
    ret

Listing 3.1.4 shows an example implementation of the load_store_1 function. This time, the instruction ldr w0, [x0, #16] loads data from the address obtained by adding 16 to the value in X0.

3.1.2. STR (immediate)#

Store instructions transfer data from the registers to memory. STR (immediate) is an example of a store instruction. Analogous to LDR (immediate), it has encodings from different addressing modes. The unsigned offset class of STR (immediate) also has a 32-bit variant and a 64-bit variant. The encodings of the two variants are given as STR <Wt>, [<Xn|SP>{, #<pimm>}] and STR <Xt>, [<Xn|SP>{, #<pimm>}], where the assembler symbols are similar to the LDR (immediate) counterpart.

Again, we formulate high-level descriptions for some examples:

str w5, [x0]

Store 32 bits (word) from register W5 into memory. The data is written into memory at the 64-bit address hold in register X0.

str x1, [x3]

Store 64 bits (double word) from register X1 into memory. The data is written into memory at the 64-bit address hold in register X3.

str x1, [x3, #32]

Store 64 bits (double word) from register X1 into memory. The memory address is calculated by adding offset 32 to the value in register X3.

Example

Again, we can write an example function that uses STR. This time we use both LDR and STR to copy 64 bits of data.

Listing 3.1.5 Syntax of the function load_store_2.#
void load_store_2( uint64_t const * a,
                   uint64_t       * b );

The syntax of the example function load_store_2 is shown in Listing 3.1.5. The function takes two 64-bit pointers to two 64-bit unsigned integers as parameters and has no return value.

Listing 3.1.6 Implementation of the load_store_2 function, which copies 64 bits from one memory location to another.#
    .text
    .type load_store_2, %function
    .global load_store_2
load_store_2:
    ldr x0, [x0]
    str x0, [x1]
    ret

Listing 3.1.6 shows an example implementation of the function. According to the PCS, the two 64-bit parameters are passed through the general-purpose registers X0 and X1. The first instruction ldr x0, [x0] loads 64 bits of data from the address hold in register X0 and overwrites the register with the loaded data. The second instruction str x0, [x1] stores the data in X0 into memory at the address hold in register X1. Both instructions together copy 64 bits of data from one memory location to another. The last instruction ret jumps back to the address hold in the link register (X30).

3.1.3. Load/Store Pair#

There are also some instructions that can load data from memory to multiple registers at once, or store data from multiple registers into memory. LDP and STP are such instructions that use two general-purpose registers. The 64-bit variant of LDP in signed offset addressing mode has the encoding LDP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}], while the 64-bit variant of STP in signed offset addressing mode has the encoding STP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]. Both instructions load/store a total of 128 bits of data.

As before, we formulate high-level descriptions of some examples and then show an example application:

ldp x5, x7, [x2]

Load 2x64 bits from memory into registers X5 and X7. In memory, the data is located at the 64-bit address hold in register X2.

str w2, w5, [x3]

Store 2x32 bits from registers W2 and W5 into memory. The data is written into memory at the 64-bit address hold in register X3.

Example

In this example, we use LDP and STP to copy two 64-bit integers from one memory location to another.

Listing 3.1.7 Syntax of the function load_store_3.#
void load_store_3( int64_t const * a,
                   int64_t       * b );

The syntax of such a function could be similar to that of load_store_3 shown in Listing 3.1.7. The function takes two 64-bit pointers to two arrays of signed 64-bit integers as parameters and has no return value.

Listing 3.1.8 Implementation of the load_store_3 function, which copies 128 bits from one memory location to another.#
    .text
    .type load_store_3, %function
    .global load_store_3
load_store_3:
    ldp x2, x3, [x0]
    stp x2, x3, [x1]
    ret

An example implementation is shown in Listing 3.1.8. The implementation consists of three instructions. First, ldp x2, x3, [x0] loads 2x64 bits from memory into registers X2 and X3 using the address in parameter register X0. Second, stp x2, x3, [x1] stores 2x64 bits from registers X2 and X3 to memory using the address in parameter register X1. Finally, ret branches to the program address in the link register (X30).

3.2. Machine Code#

In Section 2.3 we have already seen that we can use the assembler to translate assembly code into instruction words. We use the term instruction word to refer to the full 32-bit encoding of a single A64 instruction (sometimes also called the instruction encoding). However, we only briefly looked at the instruction stp x29, x30, [sp, -16]! and its instruction word 0b10101001101111110111101111111101. Now, with our knowledge of the structure of LDR (immediate), we can easily see the structure of the instruction word.

The following table shows the assembly code and corresponding instruction words of LDR (immediate), unsigned offset instructions:

Bit IDs

31-22

21-10

9-5

4-0

Field

imm12

Rn

Rt

Pattern

1s11100101

iiiiiiiiiiii

nnnnn

ttttt

ldr w0, [x0]

1011100101

000000000000

00000

00000

ldr w1, [x0]

1011100101

000000000000

00000

00001

ldr w5, [x0]

1011100101

000000000000

00000

00101

ldr x1, [x3]

1111100101

000000000000

00011

00001

ldr x1, [x3, #32]

1111100101

000000000100

00011

00001

ldr x1, [x3, #4088]

1111100101

000111111111

00011

00001

ldr w1, [x3, #2044]

1011100101

000111111111

00011

00001

ldr x0, [x0, #32760]

1111100101

111111111111

00000

00000

.word 0xb9400000

1011100101

000000000000

00000

00000

.word 0xb9400001

1011100101

000000000000

00000

00001

.word 0xf97ffc00

1111100101

111111111111

00000

00000

The first row shows the IDs of the bits, where the most significant instruction bit is given by the bit with ID 31 and the least significant one by the bit with ID 0. The second row shows the location of the three fields imm12, Rn, and Rt as described in the A64 ISA. The immediate imm12 has a total of 12 bits and encodes the offset. The IDs of the two registers used are encoded in the fields Rn and Rt. The third row shows a short form of the instruction word pattern. The nine bits with IDs 22-29 and ID 31 are fixed, all other bits depend on how the instruction is used. We will go through the examples one by one:

ldr w0, [x0]

General-purpose register W0 has ID 0, encoded in field Rt, so we have the value 00000 in bits 4-0. Register X0 also has the ID 0, encoded in the Rn field. We have not specified an offset. According to the A64 ISA, the offset is set to 0 by default and we get the value 000000000000 in bits 21-10. We are loading the data into a W register, so we are using the 32-bit variant of the instruction. This information is stored in the size bit s, which has bit ID 30 and is 0 in this case.

ldr w1, [x0]

Compared to the previous example, only the ID of the 32-bit W register has changed. Therefore, we now have the value 00001 in bits 4-0.

ldr w5, [x0]

Again, only the ID of the W register has changed and we get 00101 for bits 4-0.

ldr x1, [x3]

The IDs of both registers changed. Thus, we get 00001 for Rt and 00011 for Rn. Additionally, we are now loading 64 bits into register X1. This means that the s bit with ID 30 is set to 1.

ldr x1, [x3, #32]

Compared to the previous instruction, we now load from a different location in memory. Specifically, the address is given by the value in register X3 with an added offset of 32 bytes. The offset is encoded in the field imm12 as 000000000100. Note that the offset is specified in eight-byte increments. In other words, 1002=410, i.e. we obtain the effective offset as 48=32.

ldr x1, [x3, #4088]

Only the offset has changed, which is now encoded as 000111111111 in the imm12 field: 1111111112=51110 and 5118=4088.

ldr w1, [x3, #2044]

This is an interesting example because, at first glance, two properties change. First, we are now loading 32 bits instead of 64 bits. Second, we have an offset of 2044 bytes instead of 4088 bytes. However, the offset of the 32-bit variant of LDR (immediate) is specified in 4-byte increments. This means that the numeric value in the field imm12 is identical to the one before: 000111111111. Comparing the instruction word of this instruction to the previous one, only the s bit with ID 30 changes from 1 to 0.

ldr x0, [x0, #32760]

In this case, we load 64 bits from memory into register X0. The data is loaded from the address that we get by adding the offset 32760 to the value in register X0. The immediate is now 111111111111, i.e. 32760 is the largest offset that can be encoded in the imm12 field of the 64-bit variant of the instruction.

.word 0xb9400000

We can also write the instruction word of an instruction directly. In this case, 0xb9400000 is simply the hexadecimal representation of the 32 instruction bits of ldr w0, [x0].

.word 0xb9400001

Hexadecimal representation of the instruction ldr w1, [x0].

.word 0xf97ffc00

Hexadecimal representation of the instruction ldr x0, [x0, #32760].

Note

Since we have powerful assemblers that can translate assembly code into machine code, knowledge of the structure and use of instruction words may seem unnecessary. However, there are situations where this knowledge comes in handy, for example:

  • Assemblers may not support all available instructions. Apple’s proprietary AMX instructions are such a case, where the matrix accelerator on the M1, M2, and M3 series is only accessible by writing machine code directly. Only with the introduction of M4 in 2014 did Apple start to support the standardized Scalable Matrix Extensions (SME).

  • One might want to bypass the overhead of translating ASCII assembly code into machine code. This is the case in situations where machine code is generated at runtime and written directly to memory to be executed. We will discuss just-in-time machine code generation as a core building block for our tensor compiler in Section 8.

3.3. Addressing Modes#

The A64 instruction set has several addressing modes for loading and storing data. In general, load and store instructions use a 64-bit base address along with an optional offset. The base address is held in one of the 31 general-purpose registers, the stack pointer, or the program counter. The offset can be encoded directly as an immediate in the 32 instruction bits or held in an offset register.

Following C1.3.3 of the Arm Architecture Reference Manual for A-profile architecture, the most important addressing modes can be summarized as follows:

Base register only: [base(, #0)]

In the simplest case, we use only the base register to determine the memory address.

Listing 3.3.1 Implementation of the addressing_modes_0 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_0, %function
    .global addressing_modes_0
addressing_modes_0:
    ldr w2, [x0, #0]
    str w2, [x1, #0]
    ret

The example function in Listing 3.3.1 shows the base register only addressing mode. Specifically, the function copies 32 bits from memory at the address given by the value in X0 to the memory address given by the value in X1. The immediate offsets of load ldr w2, [x0, #0] and store str w2, [x1, #0] are explicitly set to zero. Alternatively, we could write ldr w2, [x0] for the load and str w2, [x1] for the store. Note that both forms result in the same instruction word because the offset is set to zero by default in the unsigned offset class of LDR (immediate) and STR (immediate).

Base plus immediate offset: [base(, #imm)]

Another addressing mode is given by a base address together with an immediate offset. In this case, the effective memory address is calculated by adding the immediate offset to the base address. Since the immediate is limited in size, the number of possible immediate offsets is also limited. In the case of LDR (immediate), the immediate has a size of 12 bits. Furthermore, in the 32-bit variant of the instruction, only positive multiples of 4 are encoded as immediates. This results in an offset range of 0 to 16380 for this variant because 2124=16380.

Listing 3.3.2 Implementation of the addressing_modes_1 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_1, %function
    .global addressing_modes_1
addressing_modes_1:
    ldr w2, [x0, #8]
    str w2, [x1, #12]
    ret

The example function in Listing 3.3.2 shows the base plus immediate offset addressing mode. Unlike before, the effective address for the 32-bit load ldr w2, [x0, #8] is determined by adding 8 to the value in register X0. The effective address for the 32-bit store str w2, [x1, #12] is determined by adding 12 to the value in register X1.

Base plus register offset: [base, Xm(, LSL #imm)]

In addition to immediate offsets, we can also provide the offset through an additional register. In this case, the offset can be any 64-bit value. We can also encode an additional left shift of the offset as part of the instruction.

Listing 3.3.3 Implementation of the addressing_modes_2 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_2, %function
    .global addressing_modes_2
addressing_modes_2:
    ldr w4, [x0, x1]
    str w4, [x2, x3]
    ret

The example function in Listing 3.3.3 illustrates the base plus register offset addressing modes. The ldr w4, [x0, x1] instruction uses LDR (register). This means that the effective memory address is calculated by adding the 64-bit value in X0 to the value in X1. In this case, the effective address for the load is given by adding the values in X0 and X1. The effective address for the store str w4, [x2, x3] is given by adding the values in X2 and X3.

Pre-indexed: [base, #imm]!

So far we have only discussed addressing modes that leave the register holding the base address untouched. In contrast, pre-indexed instructions calculate the effective address by adding the immediate offset to the base address and also write this address back to the register that held the base address.

Listing 3.3.4 Implementation of the addressing_modes_3 function, which copies 2x32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_3, %function
    .global addressing_modes_3
addressing_modes_3:
    ldr w2, [x0, #4]!
    str w2, [x1, #4]!
    ldr w2, [x0, #4]!
    str w2, [x1, #4]!
    ret

Listing 3.3.4 illustrates the use of pre-indexed loads and stores. The pre-indexed load ldr w2, [x0, #4]! calculates the effective address by adding 4 to the value in register X0. The address is not only used for the memory access, but is also written back to register X0. So, if we interpret the original value in X0 as a C pointer to an array of 32-bit data, we load the second value of the array and then increment the pointer by one, which is a four-byte increment in pointer arithmetic. Next, the instruction str w2, [x1, #4]! stores the 32 bits in W2 at the address obtained by adding 4 to the value in X1. It also writes the value incremented by 4 back to X1. The following two instructions ldr w2, [x0, #4]! and str w2, [x1, #4]! do the same with incremented values in X0 and X1.

Post-indexed: [base], #imm

Post-indexed instructions use the base address for the memory access, but also add an immediate offset to the value in the register that held the base address.

Listing 3.3.5 Implementation of the addressing_modes_4 function, which copies 2x32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_4, %function
    .global addressing_modes_4
addressing_modes_4:
    ldr w2, [x0], #4
    str w2, [x1], #4
    ldr w2, [x0], #4
    str w2, [x1], #4
    ret

Post-indexed loads and stores are shown in Listing 3.3.5. The example is similar to the pre-indexed example in Listing 3.3.4. However, this time post-indexed loads and stores are used instead of pre-indexed ones. In the case of ldr w2, [x0], #4, this means that the instruction first loads the 32 bits from memory using the address in X0, and then increments the value in X0 by 4.

Literal (PC-relative): label

We can also use the value in the program counter as the base address, along with an immediate offset. For example, in the case of LDR (literal), the positive or negative immediate offset has a size of 19 bits and encoded multiples of 4. This results in a range of 2191+2 to 2191+2 for the immediate. So we can have an offset from -1MiB to 1MiB relative to the program counter. In assembly code, we can use labels for the immediate offset so that the assembler will insert the correct numeric value for us.

Listing 3.3.6 Implementation of the function addressing_modes_5, which loads 32 bits from a PC-relative location and then writes them to memory.#
    .section .text
    .type addressing_modes_5, %function
    .global addressing_modes_5
addressing_modes_5:
    ldr w1, my_data
    str w1, [x0]
    ret

my_data:
    .word 128

The use of a PC-relative load is illustrated in Listing 3.3.6. In the case of ldr w1, my_data, the assembler would determine the immediate offset by comparing the location of the label my_data with that of the instruction. At runtime, the instruction then loads the 32 bits at the effective address obtained by adding the immediate offset to the value in the program counter. In this example, the instruction loads the value 128 to register W1.

3.4. Data Processing#

In addition to the load and store instructions, another large group of instructions are the data-processing instructions. These instructions read operands from up to two source registers, perform the computation, and write the result back to a destination register. The registers that an instruction reads from are called source registers. The registers an instruction writes to are called destination registers. When an instruction reads from and writes to a register, it is also referred to as a destination register in the corresponding A64 ISA instruction description. In addition to register data, some instructions use immediate values. Since the ISA contains many data-processing instructions, we will provide high-level descriptions of a few examples.

ADD (immediate): add x3, x8, #16

Add the unsigned immediate value of 16 to the value in source register X8 and write the result to destination register X3.

MOV (register): mov w0, w1

Copy the 32-bit value from source register W1 to destination register W0.

AND (shifted register): and x17, x8, x2, lsl #2

Perform a bitwise AND of the 64-bit value in source register X8 with the 64-bit value in source register X2 after it has been logically shifted to the left by 2. Write the result to destination register X17.

EOR (shifted register): eor w16, w16, w16

Perform a bitwise exclusive-OR of the 32-bit value in source register W16 with the 32-bit value in source register W16 and write the result to destination register W16. This effectively zeroes the 64-bit register X16, since the 32 upper bits are also implicitly set to zero.

MADD: madd x0, x1, x2, x3

Multiply the 64-bit values in source registers X1 and X2, add the intermediate result to the 64-bit value in source register X3, and write the final 64-bit result to destination register X0.

A special class of data-processing instructions sets the NZCV condition flags based on the result of the instruction. We call these instructions flag-setting instructions. The NZCV condition flags are defined as follows:

N

Negative condition flag. Set to 1 if the result of the last flag-setting instruction was negative.

Z

Zero condition flag. Set to 1 if the result of the last flag-setting instruction was zero, and to 0 otherwise. A result of zero often indicates an equal result from a comparison.

C

Carry condition flag. Set to 1 if the last flag-setting instruction resulted in a carry condition, for example an unsigned overflow on an addition.

V

Overflow condition flag. Set to 1 if the last flag-setting instruction resulted in an overflow condition, for example a signed overflow on an addition.

—Arm Limited: C5.2.11 of the Arm Architecture Reference Manual for A-profile architecture

Many flag-setting instructions can be distinguished from their “standard” counterparts by the s suffix, for example: ADDS (shifted register), ADDS (immediate), SUBS (shifted register), SUBS (immediate). An often-used special case is CMP, which exists in extended register, immediate, and shifted register variants. These instructions are aliases of their SUBS counterparts, where only the condition flags are set, but the actual result of the subtraction is discarded. For example, cmp x5, #16 is equivalent to subs xzr, x5, #16.