3. Base Instructions#
The A64 base instructions form the basis of the A64 instruction set. They are further subdivided into the following instruction groups:
Loads and stores associated with general-purpose registers.
Data processing (immediate).
Data processing (register).
Branch, exception generation, and system instructions.
We cover each of the four instruction groups in a separate section by discussing some important examples and related concepts. The details of all base instructions are available in the Arm A-profile A64 Instruction Set Architecture.
3.1. Loads and Stores#
The first group of instructions transfers data between memory and the general-purpose registers. An instruction that transfers data from memory to a register is called a load. Conversely, an instruction that transfers data from a register to memory is called a store.
3.1.1. LDR (immediate)#
We start by discussing the LDR (immediate) instruction in detail.
It transfers data from memory into a register.
As its name suggests, we can use the instruction in assembly code through the mnemonic ldr
.
The instruction has encodings from different addressing modes (“post-index”, “pre-index” and “unsigned offset”).
Specifically, the unsigned offset class has a 32-bit variant with the encoding LDR <Wt>, [<Xn|SP>{, #<pimm>}]
and a 64-bit variant with the encoding LDR <Xt>, [<Xn|SP>{, #<pimm>}]
.
The A64 ISA describes the assembler symbols as follows:
<Wt>
Is the 32-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.
<Xn|SP>
Is the 64-bit name of the general-purpose base register or stack pointer, encoded in the “Rn” field.
<Xt>
Is the 64-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.
<pimm>
For the “32-bit” variant: is the optional positive immediate byte offset, a multiple of 4 in the range 0 to 16380, defaulting to 0 and encoded in the “imm12” field as <pimm>/4. For the “64-bit” variant: is the optional positive immediate byte offset, a multiple of 8 in the range 0 to 32760, defaulting to 0 and encoded in the “imm12” field as <pimm>/8.
Being new to assembly language programming, this information is difficult to parse. However, most of it makes sense immediately when looking at some examples. Suppose we use one of the following LDR (immediate) instructions in our assembly program. Then we can give high-level descriptions of the instructions:
ldr w5, [x0]
Load 32 bits (word) from memory into register
W5
. In memory, the data is located at the 64-bit address hold in registerX0
.ldr x1, [x3]
Load 64 bits (double word) from memory into register
X1
. In memory, the data is located at the 64-bit address hold in registerX3
.ldr x1, [x3, #32]
Load 64 bits (double word) from memory into register
X1
. In memory, the data is located at the 64-bit address obtained by adding32
to the value in registerX3
.
Examples
Knowing the procedure call standard (see Section 2.6) and load instructions, we can write our first simple functions in assembly language.
int32_t load_store_0( int32_t const * a );
Suppose we want to implement a function load_store_0
with the syntax given in Listing 3.1.1.
The function should take a single address as a parameter and load 32 bits of data from that address.
The return value should be the 32 bits loaded.
.text
.type load_store_0, %function
.global load_store_0
load_store_0:
ldr w0, [x0]
ret
An example implementation of the load_store_0
function is shown in Listing 3.1.2.
According to the PCS, the 64-bit address is passed to the function through register X0
.
The address is then used in the ldr w0, [x0]
instruction, which loads 32 bits from this address into W0
.
Since W0
is simply a different view of R0
, the data in X0
is overwritten.
That is, the instruction replaces the lower 32 bits with the loaded data and sets the upper 32 bits to zero.
The ret
instruction jumps back to the address in the link register.
We will discuss the details of RET and other branch instructions at a later point.
According to the PCS, we know that the return value of the function is expected by the caller to be in register W0
.
int32_t load_store_1( int32_t const * a );
Now suppose we want to implement the function load_store_1
with the syntax shown in Listing 3.1.3.
While the syntax is the same as load_store_0
, the goal is to load the data from the address with an additional offset of 16 bytes.
.text
.type load_store_1, %function
.global load_store_1
load_store_1:
ldr w0, [x0, #16]
ret
Listing 3.1.4 shows an example implementation of the load_store_1
function.
This time, the instruction ldr w0, [x0, #16]
loads data from the address obtained by adding 16
to the value in X0
.
3.1.2. STR (immediate)#
Store instructions transfer data from the registers to memory.
STR (immediate) is an example of a store instruction.
Analogous to LDR (immediate), it has encodings from different addressing modes.
The unsigned offset class of STR (immediate) also has a 32-bit variant and a 64-bit variant.
The encodings of the two variants are given as STR <Wt>, [<Xn|SP>{, #<pimm>}]
and STR <Xt>, [<Xn|SP>{, #<pimm>}]
, where the assembler symbols are similar to the LDR (immediate) counterpart.
Again, we formulate high-level descriptions for some examples:
str w5, [x0]
Store 32 bits (word) from register
W5
into memory. The data is written into memory at the 64-bit address hold in registerX0
.str x1, [x3]
Store 64 bits (double word) from register
X1
into memory. The data is written into memory at the 64-bit address hold in registerX3
.str x1, [x3, #32]
Store 64 bits (double word) from register
X1
into memory. The memory address is calculated by adding offset32
to the value in registerX3
.
Example
Again, we can write an example function that uses STR. This time we use both LDR and STR to copy 64 bits of data.
void load_store_2( uint64_t const * a,
uint64_t * b );
The syntax of the example function load_store_2
is shown in Listing 3.1.5.
The function takes two 64-bit pointers to two 64-bit unsigned integers as parameters and has no return value.
load_store_2
function, which copies 64 bits from one memory location to another.# .text
.type load_store_2, %function
.global load_store_2
load_store_2:
ldr x0, [x0]
str x0, [x1]
ret
Listing 3.1.6 shows an example implementation of the function.
According to the PCS, the two 64-bit parameters are passed through the general-purpose registers X0
and X1
.
The first instruction ldr x0, [x0]
loads 64 bits of data from the address hold in register X0
and overwrites the register with the loaded data.
The second instruction str x0, [x1]
stores the data in X0
into memory at the address hold in register X1
.
Both instructions together copy 64 bits of data from one memory location to another.
The last instruction ret
jumps back to the address hold in the link register (X30
).
3.1.3. Load/Store Pair#
There are also some instructions that can load data from memory to multiple registers at once, or store data from multiple registers into memory.
LDP and STP are such instructions that use two general-purpose registers.
The 64-bit variant of LDP in signed offset addressing mode has the encoding LDP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]
, while the 64-bit variant of STP in signed offset addressing mode has the encoding STP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]
.
Both instructions load/store a total of 128 bits of data.
As before, we formulate high-level descriptions of some examples and then show an example application:
ldp x5, x7, [x2]
Load 2x64 bits from memory into registers
X5
andX7
. In memory, the data is located at the 64-bit address hold in registerX2
.str w2, w5, [x3]
Store 2x32 bits from registers
W2
andW5
into memory. The data is written into memory at the 64-bit address hold in registerX3
.
Example
In this example, we use LDP and STP to copy two 64-bit integers from one memory location to another.
void load_store_3( int64_t const * a,
int64_t * b );
The syntax of such a function could be similar to that of load_store_3
shown in Listing 3.1.7.
The function takes two 64-bit pointers to two arrays of signed 64-bit integers as parameters and has no return value.
load_store_3
function, which copies 128 bits from one memory location to another.# .text
.type load_store_3, %function
.global load_store_3
load_store_3:
ldp x2, x3, [x0]
stp x2, x3, [x1]
ret
An example implementation is shown in Listing 3.1.8.
The implementation consists of three instructions.
First, ldp x2, x3, [x0]
loads 2x64 bits from memory into registers X2
and X3
using the address in parameter register X0
.
Second, stp x2, x3, [x1]
stores 2x64 bits from registers X2
and X3
to memory using the address in parameter register X1
.
Finally, ret
branches to the program address in the link register (X30
).
3.2. Machine Code#
In Section 2.3 we have already seen that we can use the assembler to translate assembly code into instruction words.
We use the term instruction word to refer to the full 32-bit encoding of a single A64 instruction (sometimes also called the instruction encoding).
However, we only briefly looked at the instruction stp x29, x30, [sp, -16]!
and its instruction word 0b10101001101111110111101111111101
.
Now, with our knowledge of the structure of LDR (immediate), we can easily see the structure of the instruction word.
The following table shows the assembly code and corresponding instruction words of LDR (immediate), unsigned offset instructions:
Bit IDs |
31-22 |
21-10 |
9-5 |
4-0 |
---|---|---|---|---|
Field |
imm12 |
Rn |
Rt |
|
Pattern |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The first row shows the IDs of the bits, where the most significant instruction bit is given by the bit with ID 31 and the least significant one by the bit with ID 0.
The second row shows the location of the three fields imm12
, Rn
, and Rt
as described in the A64 ISA.
The immediate imm12
has a total of 12 bits and encodes the offset.
The IDs of the two registers used are encoded in the fields Rn
and Rt
.
The third row shows a short form of the instruction word pattern.
The nine bits with IDs 22-29 and ID 31 are fixed, all other bits depend on how the instruction is used.
We will go through the examples one by one:
ldr w0, [x0]
General-purpose register
W0
has ID 0, encoded in fieldRt
, so we have the value00000
in bits 4-0. RegisterX0
also has the ID 0, encoded in theRn
field. We have not specified an offset. According to the A64 ISA, the offset is set to 0 by default and we get the value000000000000
in bits 21-10. We are loading the data into a W register, so we are using the 32-bit variant of the instruction. This information is stored in the size bits
, which has bit ID 30 and is0
in this case.ldr w1, [x0]
Compared to the previous example, only the ID of the 32-bit W register has changed. Therefore, we now have the value
00001
in bits 4-0.ldr w5, [x0]
Again, only the ID of the W register has changed and we get
00101
for bits 4-0.ldr x1, [x3]
The IDs of both registers changed. Thus, we get
00001
forRt
and00011
forRn
. Additionally, we are now loading 64 bits into registerX1
. This means that thes
bit with ID 30 is set to1
.ldr x1, [x3, #32]
Compared to the previous instruction, we now load from a different location in memory. Specifically, the address is given by the value in register
X3
with an added offset of 32 bytes. The offset is encoded in the fieldimm12
as000000000100
. Note that the offset is specified in eight-byte increments. In other words, , i.e. we obtain the effective offset as .ldr x1, [x3, #4088]
Only the offset has changed, which is now encoded as
000111111111
in theimm12
field: and .ldr w1, [x3, #2044]
This is an interesting example because, at first glance, two properties change. First, we are now loading 32 bits instead of 64 bits. Second, we have an offset of 2044 bytes instead of 4088 bytes. However, the offset of the 32-bit variant of LDR (immediate) is specified in 4-byte increments. This means that the numeric value in the field
imm12
is identical to the one before:000111111111
. Comparing the instruction word of this instruction to the previous one, only thes
bit with ID 30 changes from1
to0
.ldr x0, [x0, #32760]
In this case, we load 64 bits from memory into register
X0
. The data is loaded from the address that we get by adding the offset 32760 to the value in registerX0
. The immediate is now111111111111
, i.e. 32760 is the largest offset that can be encoded in theimm12
field of the 64-bit variant of the instruction..word 0xb9400000
We can also write the instruction word of an instruction directly. In this case,
0xb9400000
is simply the hexadecimal representation of the 32 instruction bits ofldr w0, [x0]
..word 0xb9400001
Hexadecimal representation of the instruction
ldr w1, [x0]
..word 0xf97ffc00
Hexadecimal representation of the instruction
ldr x0, [x0, #32760]
.
Note
Since we have powerful assemblers that can translate assembly code into machine code, knowledge of the structure and use of instruction words may seem unnecessary. However, there are situations where this knowledge comes in handy, for example:
Assemblers may not support all available instructions. Apple’s proprietary AMX instructions are such a case, where the matrix accelerator on the M1, M2, and M3 series is only accessible by writing machine code directly. Only with the introduction of M4 in 2014 did Apple start to support the standardized Scalable Matrix Extensions (SME).
One might want to bypass the overhead of translating ASCII assembly code into machine code. This is the case in situations where machine code is generated at runtime and written directly to memory to be executed. We will discuss just-in-time machine code generation as a core building block for our tensor compiler in Section 8.
3.3. Addressing Modes#
The A64 instruction set has several addressing modes for loading and storing data. In general, load and store instructions use a 64-bit base address along with an optional offset. The base address is held in one of the 31 general-purpose registers, the stack pointer, or the program counter. The offset can be encoded directly as an immediate in the 32 instruction bits or held in an offset register.
Following C1.3.3 of the Arm Architecture Reference Manual for A-profile architecture, the most important addressing modes can be summarized as follows:
- Base register only:
[base(, #0)]
In the simplest case, we use only the base register to determine the memory address.
Listing 3.3.1 Implementation of theaddressing_modes_0
function, which copies 32 bits from one memory location to another.#.section .text .type addressing_modes_0, %function .global addressing_modes_0 addressing_modes_0: ldr w2, [x0, #0] str w2, [x1, #0] ret
The example function in Listing 3.3.1 shows the base register only addressing mode. Specifically, the function copies 32 bits from memory at the address given by the value in
X0
to the memory address given by the value inX1
. The immediate offsets of loadldr w2, [x0, #0]
and storestr w2, [x1, #0]
are explicitly set to zero. Alternatively, we could writeldr w2, [x0]
for the load andstr w2, [x1]
for the store. Note that both forms result in the same instruction word because the offset is set to zero by default in the unsigned offset class of LDR (immediate) and STR (immediate).- Base plus immediate offset:
[base(, #imm)]
Another addressing mode is given by a base address together with an immediate offset. In this case, the effective memory address is calculated by adding the immediate offset to the base address. Since the immediate is limited in size, the number of possible immediate offsets is also limited. In the case of LDR (immediate), the immediate has a size of 12 bits. Furthermore, in the 32-bit variant of the instruction, only positive multiples of 4 are encoded as immediates. This results in an offset range of 0 to 16380 for this variant because
.Listing 3.3.2 Implementation of theaddressing_modes_1
function, which copies 32 bits from one memory location to another.#.section .text .type addressing_modes_1, %function .global addressing_modes_1 addressing_modes_1: ldr w2, [x0, #8] str w2, [x1, #12] ret
The example function in Listing 3.3.2 shows the base plus immediate offset addressing mode. Unlike before, the effective address for the 32-bit load
ldr w2, [x0, #8]
is determined by adding 8 to the value in registerX0
. The effective address for the 32-bit storestr w2, [x1, #12]
is determined by adding 12 to the value in registerX1
.- Base plus register offset:
[base, Xm(, LSL #imm)]
In addition to immediate offsets, we can also provide the offset through an additional register. In this case, the offset can be any 64-bit value. We can also encode an additional left shift of the offset as part of the instruction.
Listing 3.3.3 Implementation of theaddressing_modes_2
function, which copies 32 bits from one memory location to another.#.section .text .type addressing_modes_2, %function .global addressing_modes_2 addressing_modes_2: ldr w4, [x0, x1] str w4, [x2, x3] ret
The example function in Listing 3.3.3 illustrates the base plus register offset addressing modes. The
ldr w4, [x0, x1]
instruction uses LDR (register). This means that the effective memory address is calculated by adding the 64-bit value inX0
to the value inX1
. In this case, the effective address for the load is given by adding the values inX0
andX1
. The effective address for the storestr w4, [x2, x3]
is given by adding the values inX2
andX3
.- Pre-indexed:
[base, #imm]!
So far we have only discussed addressing modes that leave the register holding the base address untouched. In contrast, pre-indexed instructions calculate the effective address by adding the immediate offset to the base address and also write this address back to the register that held the base address.
Listing 3.3.4 Implementation of theaddressing_modes_3
function, which copies 2x32 bits from one memory location to another.#.section .text .type addressing_modes_3, %function .global addressing_modes_3 addressing_modes_3: ldr w2, [x0, #4]! str w2, [x1, #4]! ldr w2, [x0, #4]! str w2, [x1, #4]! ret
Listing 3.3.4 illustrates the use of pre-indexed loads and stores. The pre-indexed load
ldr w2, [x0, #4]!
calculates the effective address by adding 4 to the value in registerX0
. The address is not only used for the memory access, but is also written back to registerX0
. So, if we interpret the original value inX0
as a C pointer to an array of 32-bit data, we load the second value of the array and then increment the pointer by one, which is a four-byte increment in pointer arithmetic. Next, the instructionstr w2, [x1, #4]!
stores the 32 bits inW2
at the address obtained by adding 4 to the value inX1
. It also writes the value incremented by 4 back toX1
. The following two instructionsldr w2, [x0, #4]!
andstr w2, [x1, #4]!
do the same with incremented values inX0
andX1
.- Post-indexed:
[base], #imm
Post-indexed instructions use the base address for the memory access, but also add an immediate offset to the value in the register that held the base address.
Listing 3.3.5 Implementation of theaddressing_modes_4
function, which copies 2x32 bits from one memory location to another.#.section .text .type addressing_modes_4, %function .global addressing_modes_4 addressing_modes_4: ldr w2, [x0], #4 str w2, [x1], #4 ldr w2, [x0], #4 str w2, [x1], #4 ret
Post-indexed loads and stores are shown in Listing 3.3.5. The example is similar to the pre-indexed example in Listing 3.3.4. However, this time post-indexed loads and stores are used instead of pre-indexed ones. In the case of
ldr w2, [x0], #4
, this means that the instruction first loads the 32 bits from memory using the address inX0
, and then increments the value inX0
by 4.- Literal (PC-relative):
label
We can also use the value in the program counter as the base address, along with an immediate offset. For example, in the case of LDR (literal), the positive or negative immediate offset has a size of 19 bits and encoded multiples of 4. This results in a range of
to for the immediate. So we can have an offset from -1MiB to 1MiB relative to the program counter. In assembly code, we can use labels for the immediate offset so that the assembler will insert the correct numeric value for us.Listing 3.3.6 Implementation of the functionaddressing_modes_5
, which loads 32 bits from a PC-relative location and then writes them to memory.#.section .text .type addressing_modes_5, %function .global addressing_modes_5 addressing_modes_5: ldr w1, my_data str w1, [x0] ret my_data: .word 128
The use of a PC-relative load is illustrated in Listing 3.3.6. In the case of
ldr w1, my_data
, the assembler would determine the immediate offset by comparing the location of the labelmy_data
with that of the instruction. At runtime, the instruction then loads the 32 bits at the effective address obtained by adding the immediate offset to the value in the program counter. In this example, the instruction loads the value 128 to registerW1
.
3.4. Data Processing#
In addition to the load and store instructions, another large group of instructions are the data-processing instructions. These instructions read operands from up to two source registers, perform the computation, and write the result back to a destination register. The registers that an instruction reads from are called source registers. The registers an instruction writes to are called destination registers. When an instruction reads from and writes to a register, it is also referred to as a destination register in the corresponding A64 ISA instruction description. In addition to register data, some instructions use immediate values. Since the ISA contains many data-processing instructions, we will provide high-level descriptions of a few examples.
- ADD (immediate):
add x3, x8, #16
Add the unsigned immediate value of
16
to the value in source registerX8
and write the result to destination registerX3
.- MOV (register):
mov w0, w1
Copy the 32-bit value from source register
W1
to destination registerW0
.- AND (shifted register):
and x17, x8, x2, lsl #2
Perform a bitwise AND of the 64-bit value in source register
X8
with the 64-bit value in source registerX2
after it has been logically shifted to the left by 2. Write the result to destination registerX17
.- EOR (shifted register):
eor w16, w16, w16
Perform a bitwise exclusive-OR of the 32-bit value in source register
W16
with the 32-bit value in source registerW16
and write the result to destination registerW16
. This effectively zeroes the 64-bit registerX16
, since the 32 upper bits are also implicitly set to zero.- MADD:
madd x0, x1, x2, x3
Multiply the 64-bit values in source registers
X1
andX2
, add the intermediate result to the 64-bit value in source registerX3
, and write the final 64-bit result to destination registerX0
.
A special class of data-processing instructions sets the NZCV condition flags based on the result of the instruction. We call these instructions flag-setting instructions. The NZCV condition flags are defined as follows:
- N
Negative condition flag. Set to 1 if the result of the last flag-setting instruction was negative.
- Z
Zero condition flag. Set to 1 if the result of the last flag-setting instruction was zero, and to 0 otherwise. A result of zero often indicates an equal result from a comparison.
- C
Carry condition flag. Set to 1 if the last flag-setting instruction resulted in a carry condition, for example an unsigned overflow on an addition.
- V
Overflow condition flag. Set to 1 if the last flag-setting instruction resulted in an overflow condition, for example a signed overflow on an addition.
—Arm Limited: C5.2.11 of the Arm Architecture Reference Manual for A-profile architecture
Many flag-setting instructions can be distinguished from their “standard” counterparts by the s
suffix, for example: ADDS (shifted register), ADDS (immediate), SUBS (shifted register), SUBS (immediate).
An often-used special case is CMP
, which exists in extended register, immediate, and shifted register variants.
These instructions are aliases of their SUBS
counterparts, where only the condition flags are set, but the actual result of the subtraction is discarded.
For example, cmp x5, #16
is equivalent to subs xzr, x5, #16
.