3. Base Instructions#

The A64 base instructions form the basis of the A64 instruction set. They are further subdivided into the following instruction groups:

  • Loads and stores associated with general-purpose registers.

  • Data processing (immediate).

  • Data processing (register).

  • Branches, exception generation, and system instructions.

We cover each of the four instruction groups in a separate section by discussing some important examples and related concepts. The details of all base instructions are available in the Arm A-profile A64 Instruction Set Architecture.

3.1. Loads and Stores#

The first group of instructions transfers data between memory and the general-purpose registers. An instruction that transfers data from memory to a register is called a load. Conversely, an instruction that transfers data from a register to memory is called a store.

3.1.1. LDR (immediate)#

We start by discussing the LDR (immediate) instruction in detail. It transfers data from memory into a register. As its name suggests, we can use the instruction in assembly code through the mnemonic ldr. The instruction has encodings for different addressing modes (“post-index”, “pre-index”, and “unsigned offset”). Specifically, the unsigned offset class has a 32-bit variant with the encoding LDR <Wt>, [<Xn|SP>{, #<pimm>}] and a 64-bit variant with the encoding LDR <Xt>, [<Xn|SP>{, #<pimm>}].

The A64 ISA describes the assembler symbols as follows:

<Wt>

Is the 32-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.

<Xn|SP>

Is the 64-bit name of the general-purpose base register or stack pointer, encoded in the “Rn” field.

<Xt>

Is the 64-bit name of the general-purpose register to be transferred, encoded in the “Rt” field.

<pimm>

For the “32-bit” variant: is the optional positive immediate byte offset, a multiple of 4 in the range 0 to 16380, defaulting to 0 and encoded in the “imm12” field as <pimm>/4. For the “64-bit” variant: is the optional positive immediate byte offset, a multiple of 8 in the range 0 to 32760, defaulting to 0 and encoded in the “imm12” field as <pimm>/8.

—Arm Limited: Arm A-profile A64 Instruction Set Architecture.

This information can be difficult to parse for those new to assembly language programming, but it becomes clearer when examining examples. Suppose we use one of the following LDR (immediate) instructions in our assembly program; then we can give high-level descriptions of the instructions:

ldr w5, [x0]

Load 32 bits (word) from memory into register W5. In memory, the data is located at the 64-bit address held in register X0.

ldr x1, [x3]

Load 64 bits (double word) from memory into register X1. In memory, the data is located at the 64-bit address held in register X3.

ldr x1, [x3, #32]

Load 64 bits (double word) from memory into register X1. In memory, the data is located at the 64-bit address obtained by adding 32 to the value in register X3.

Examples

Knowing the procedure call standard (see Section 2.6) and load instructions, we can write our first simple functions in assembly language.

Listing 3.1.1 Syntax of the function load_store_0.#
int32_t load_store_0( int32_t const * a );

Suppose we want to implement a function load_store_0 with the syntax given in Listing 3.1.1. The function should take a single address as a parameter and load 32 bits of data from that address. The return value should be the 32 bits loaded.

Listing 3.1.2 Implementation of the function load_store_0, which loads 32 bits.#
    .text
    .type load_store_0, %function
    .global load_store_0
load_store_0:
    ldr w0, [x0]
    ret

An example implementation of the load_store_0 function is shown in Listing 3.1.2. According to the PCS, the 64-bit address is passed to the function through register X0. The address is then used in the ldr w0, [x0] instruction, which loads 32 bits from this address into W0. Since W0 is simply a view of R0, the data in X0 is overwritten. That is, the instruction replaces the lower 32 bits with the loaded data and sets the upper 32 bits to zero. The ret instruction jumps back to the address in the link register. We will discuss the details of RET and other branch instructions in Section 3.5. According to the PCS, the return value of the function is expected by the caller to be in register W0.

Listing 3.1.3 Syntax of the function load_store_1.#
int32_t load_store_1( int32_t const * a );

Now suppose we want to implement the function load_store_1 with the syntax shown in Listing 3.1.3. While the syntax is the same as load_store_0, the goal is to load data from the address with an additional offset of 16 bytes.

Listing 3.1.4 Implementation of the function load_store_1, which loads 32 bits.#
    .text
    .type load_store_1, %function
    .global load_store_1
load_store_1:
    ldr w0, [x0, #16]
    ret

Listing 3.1.4 shows an example implementation of the load_store_1 function. This time, the instruction ldr w0, [x0, #16] loads data from the address obtained by adding 16 to the value in X0.

3.1.2. STR (immediate)#

Store instructions transfer data from registers to memory. STR (immediate) is an example of a store instruction. Analogous to LDR (immediate), it has encodings from different addressing modes. The unsigned offset class of STR (immediate) also has a 32-bit variant and a 64-bit variant. The encodings of the two variants are given as STR <Wt>, [<Xn|SP>{, #<pimm>}] and STR <Xt>, [<Xn|SP>{, #<pimm>}]. The assembler symbols are similar to the LDR (immediate) counterpart.

Again, we formulate high-level descriptions for some examples:

str w5, [x0]

Store 32 bits (word) from register W5 into memory. The data is written into memory at the 64-bit address held in register X0.

str x1, [x3]

Store 64 bits (double word) from register X1 into memory. The data is written into memory at the 64-bit address held in register X3.

str x1, [x3, #32]

Store 64 bits (double word) from register X1 into memory. The memory address is calculated by adding offset 32 to the value in register X3.

Example

We can write an example function that uses STR. This time we use both LDR and STR to copy 64 bits of data.

Listing 3.1.5 Syntax of the function load_store_2.#
void load_store_2( uint64_t const * a,
                   uint64_t       * b );

The syntax of the example function load_store_2 is shown in Listing 3.1.5. The function takes two 64-bit pointers to two 64-bit unsigned integers as parameters and has no return value.

Listing 3.1.6 Implementation of the load_store_2 function, which copies 64 bits from one memory location to another.#
    .text
    .type load_store_2, %function
    .global load_store_2
load_store_2:
    ldr x0, [x0]
    str x0, [x1]
    ret

Listing 3.1.6 shows an example implementation of the function. According to the PCS, the two 64-bit parameters are passed through the general-purpose registers X0 and X1. The first instruction ldr x0, [x0] loads 64 bits of data from the address held in register X0 and overwrites the register with the loaded data. The second instruction str x0, [x1] stores the data in X0 into memory at the address held in register X1. Both instructions together copy 64 bits of data from one memory location to another. The last instruction ret jumps back to the address held in the link register (X30).

3.1.3. Load/Store Pair#

There are also some instructions that can load data from memory to multiple registers at once, or store data from multiple registers into memory. LDP and STP are such instructions that use two general-purpose registers. The 64-bit LDP variant in signed offset addressing mode has the encoding LDP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}], while the 64-bit variant of STP in signed offset addressing mode has the encoding STP <Xt1>, <Xt2>, [<Xn|SP>{, #<imm>}]. Both instructions load/store a total of 128 bits of data.

As before, we formulate high-level descriptions of some examples and then show an example application:

ldp x5, x7, [x2]

Load 2x64 bits from memory into registers X5 and X7. In memory, the data is located at the 64-bit address held in register X2.

stp w2, w5, [x3]

Store 2x32 bits from registers W2 and W5 into memory. The data is written into memory at the 64-bit address held in register X3.

Example

In this example, we use LDP and STP to copy two 64-bit integers from one memory location to another.

Listing 3.1.7 Syntax of the function load_store_3.#
void load_store_3( int64_t const * a,
                   int64_t       * b );

The syntax of such a function could be similar to load_store_3 shown in Listing 3.1.7. The function takes two 64-bit pointers to two arrays of signed 64-bit integers as parameters and has no return value.

Listing 3.1.8 Implementation of the load_store_3 function, which copies 128 bits from one memory location to another.#
    .text
    .type load_store_3, %function
    .global load_store_3
load_store_3:
    ldp x2, x3, [x0]
    stp x2, x3, [x1]
    ret

An example implementation is shown in Listing 3.1.8. The implementation consists of three instructions. First, ldp x2, x3, [x0] loads 2x64 bits from memory into registers X2 and X3 using the address in parameter register X0. Second, stp x2, x3, [x1] stores 2x64 bits from registers X2 and X3 to memory using the address in parameter register X1. Finally, ret branches to the program address in the link register (X30).

3.2. Machine Code#

In Section 2.3 we have already seen that we can use the assembler to translate assembly code into instruction words. We use the term instruction word to refer to the full 32-bit encoding of a single A64 instruction (sometimes also called the instruction encoding). However, we have only briefly looked at the instruction stp x29, x30, [sp, -16]! and its instruction word 0b10101001101111110111101111111101. Now, with our knowledge of the structure of LDR (immediate), we can easily see the structure of the instruction word.

The following table shows the assembly code and corresponding instruction words of LDR (immediate), unsigned offset instructions:

Bit IDs

31-22

21-10

9-5

4-0

Field

imm12

Rn

Rt

Pattern

1s11100101

iiiiiiiiiiii

nnnnn

ttttt

ldr w0, [x0]

1011100101

000000000000

00000

00000

ldr w1, [x0]

1011100101

000000000000

00000

00001

ldr w5, [x0]

1011100101

000000000000

00000

00101

ldr x1, [x3]

1111100101

000000000000

00011

00001

ldr x1, [x3, #32]

1111100101

000000000100

00011

00001

ldr x1, [x3, #4088]

1111100101

000111111111

00011

00001

ldr w1, [x3, #2044]

1011100101

000111111111

00011

00001

ldr x0, [x0, #32760]

1111100101

111111111111

00000

00000

.word 0xb9400000

1011100101

000000000000

00000

00000

.word 0xb9400001

1011100101

000000000000

00000

00001

.word 0xf97ffc00

1111100101

111111111111

00000

00000

The first row shows the IDs of the bits, where the most significant instruction bit is given by the bit with ID 31 and the least significant one by the bit with ID 0. The second row shows the location of the three fields imm12, Rn, and Rt as described in the A64 ISA. The immediate imm12 has a total of 12 bits and encodes the offset. Bit 30 (the size bit) determines whether this is the 32-bit (0) or 64-bit (1) variant. The base register and destination register IDs are encoded in fields Rn (bits 9-5) and Rt (bits 4-0) respectively. The third row shows a short form of the instruction word pattern. The nine bits with IDs 22-29 and ID 31 are fixed; all other bits depend on how the instruction is used. We go through the examples one by one:

ldr w0, [x0]

General-purpose register W0 has ID 0, encoded in field Rt, so we have the value 00000 in bits 4-0. Register X0 also has the ID 0, encoded in the Rn field. We have not specified an offset. According to the A64 ISA, the offset is set to 0 by default and we get the value 000000000000 in bits 21-10. We are loading the data into a W register, so we are using the 32-bit variant of the instruction. This information is stored in the size bit s, which has bit ID 30 and is 0 in this case.

ldr w1, [x0]

Compared to the previous example, only the ID of the 32-bit W register has changed. Therefore, we now have the value 00001 in bits 4-0.

ldr w5, [x0]

Again, only the ID of the W register has changed and we get 00101 for bits 4-0.

ldr x1, [x3]

The IDs of both registers changed. Thus, we get 00001 for Rt and 00011 for Rn. Additionally, we are now loading 64 bits into register X1. This means that the s bit with ID 30 is set to 1.

ldr x1, [x3, #32]

Compared to the previous instruction, we now load from a different location in memory. Specifically, the address is given by the value in register X3 with an added offset of 32 bytes. The offset is encoded in the field imm12 as 000000000100. Note that for the 64-bit variant, the offset is specified in eight-byte increments. In other words, \(100_2 = 4_{10}\), i.e., we obtain the effective offset as \(4 \cdot 8=32\).

ldr x1, [x3, #4088]

Only the offset has changed, which is now encoded as 000111111111 in the imm12 field: \(1 1111 1111_2 = 511_{10}\) and \(511 \cdot 8=4088\).

ldr w1, [x3, #2044]

This is an interesting example because, at first glance, two properties change. First, we are now loading 32 bits instead of 64 bits. Second, we have an offset of 2044 bytes instead of 4088 bytes. However, the offset of the 32-bit variant of LDR (immediate) is specified in 4-byte increments. This means that the numeric value in the field imm12 is identical to the one before: 000111111111. Comparing the instruction word of this instruction to the previous one, only the s bit with ID 30 changes from 1 to 0.

ldr x0, [x0, #32760]

In this case, we load 64 bits from memory into register X0. The data is loaded from the address that we get by adding the offset 32760 to the value in register X0. The immediate is now 111111111111, i.e., 32760 is the largest offset that can be encoded in the imm12 field of the 64-bit variant of the instruction.

.word 0xb9400000

We can also write the instruction word of an instruction directly. In this case, 0xb9400000 is simply the hexadecimal representation of the 32 instruction bits of ldr w0, [x0].

.word 0xb9400001

Hexadecimal representation of the instruction ldr w1, [x0].

.word 0xf97ffc00

Hexadecimal representation of the instruction ldr x0, [x0, #32760].

Note

Since we have powerful assemblers that can translate assembly code into machine code, knowledge of the structure and use of instruction words may seem unnecessary. However, there are situations where this knowledge comes in handy, for example:

  • Assemblers may not support all available instructions. Apple’s proprietary AMX instructions are such a case, where the matrix accelerator on the M1, M2, and M3 series is only accessible by writing machine code directly. Only with the introduction of M4 in 2014 did Apple start to support the standardized Scalable Matrix Extensions (SME).

  • One might want to bypass the overhead of translating ASCII assembly code into machine code. This is the case in situations where machine code is generated at runtime and written directly to memory to be executed. We will discuss just-in-time machine code generation as a core building block for our tensor compiler in Section 8.

3.3. Addressing Modes#

The A64 instruction set has several addressing modes for loading and storing data. In general, load and store instructions use a 64-bit base address along with an optional offset. The base address is held in one of the 31 general-purpose registers, the stack pointer, or the program counter. The offset can be encoded directly as an immediate in the 32 instruction bits or held in an offset register.

Following C1.3.3 of the Arm Architecture Reference Manual for A-profile architecture, the most important addressing modes can be summarized as follows:

Base register only: [base(, #0)]

In the simplest case, we use only the base register to determine the memory address.

Listing 3.3.1 Implementation of the addressing_modes_0 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_0, %function
    .global addressing_modes_0
addressing_modes_0:
    ldr w2, [x0, #0]
    str w2, [x1, #0]
    ret

The example function in Listing 3.3.1 shows the base register only addressing mode. Specifically, the function copies 32 bits from the memory address in X0 to the memory address in X1. The immediate offsets of load ldr w2, [x0, #0] and store str w2, [x1, #0] are explicitly set to zero. Alternatively, we could write ldr w2, [x0] for the load and str w2, [x1] for the store. Note that both forms result in the same instruction word because the offset is set to zero by default in the unsigned offset class of LDR (immediate) and STR (immediate).

Base plus immediate offset: [base(, #imm)]

Another addressing mode is given by a base address together with an immediate offset. In this case, the effective memory address is calculated by adding the immediate offset to the base address. Since the immediate is limited in size, the number of possible immediate offsets is also limited. In the case of the unsigned offset encoding of LDR (immediate), the immediate field has a size of 12 bits. For the 32-bit variant, the immediate is scaled by 4, allowing offsets from 0 to 16380 in multiples of 4.

Listing 3.3.2 Implementation of the addressing_modes_1 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_1, %function
    .global addressing_modes_1
addressing_modes_1:
    ldr w2, [x0, #8]
    str w2, [x1, #12]
    ret

The example function in Listing 3.3.2 shows the base plus immediate offset addressing mode. Unlike before, the effective address for the 32-bit load ldr w2, [x0, #8] is determined by adding 8 to the value in register X0. The effective address for the 32-bit store str w2, [x1, #12] is determined by adding 12 to the value in register X1.

Base plus register offset: [base, Xm(, LSL #imm)]

In addition to immediate offsets, we can also provide the offset through an additional register. In this case, the offset can be any 64-bit value. We can also encode an additional left shift of the offset as part of the instruction.

Listing 3.3.3 Implementation of the addressing_modes_2 function, which copies 32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_2, %function
    .global addressing_modes_2
addressing_modes_2:
    ldr w4, [x0, x1]
    str w4, [x2, x3]
    ret

The example function in Listing 3.3.3 illustrates the base plus register offset addressing mode. The ldr w4, [x0, x1] instruction uses LDR (register). This means that the effective memory address is calculated by adding the 64-bit value in X0 to the value in X1. The effective address for the store str w4, [x2, x3] is given by adding the values in X2 and X3.

Pre-indexed: [base, #imm]!

So far we have only discussed addressing modes that leave the register holding the base address untouched. In contrast, pre-indexed instructions calculate the effective address by adding the immediate offset to the base address and also write this address back to the register that held the base address.

Listing 3.3.4 Implementation of the addressing_modes_3 function, which copies 2x32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_3, %function
    .global addressing_modes_3
addressing_modes_3:
    ldr w2, [x0, #4]!
    str w2, [x1, #4]!
    ldr w2, [x0, #4]!
    str w2, [x1, #4]!
    ret

Listing 3.3.4 illustrates pre-indexed loads and stores. The pre-indexed load ldr w2, [x0, #4]! calculates the effective address by adding 4 to the value in register X0. The address is not only used for the memory access, but is also written back to register X0. So, if we interpret the original value in X0 as a C pointer to an array of 32-bit data, we load the second value of the array and then increment the pointer by one, which is a four-byte increment in pointer arithmetic. Next, the instruction str w2, [x1, #4]! stores the 32 bits in W2 at the address obtained by adding 4 to the value in X1. It also writes the value incremented by 4 back to X1. The following two instructions ldr w2, [x0, #4]! and str w2, [x1, #4]! do the same with incremented values in X0 and X1.

Post-indexed: [base], #imm

Post-indexed instructions use the base address for the memory access, but also add an immediate offset to the value in the register that held the base address.

Listing 3.3.5 Implementation of the addressing_modes_4 function, which copies 2x32 bits from one memory location to another.#
    .section .text
    .type addressing_modes_4, %function
    .global addressing_modes_4
addressing_modes_4:
    ldr w2, [x0], #4
    str w2, [x1], #4
    ldr w2, [x0], #4
    str w2, [x1], #4
    ret

Post-indexed loads and stores are shown in Listing 3.3.5. The example is similar to the pre-indexed example in Listing 3.3.4. However, this time post-indexed loads and stores are used instead of pre-indexed ones. In the case of ldr w2, [x0], #4, this means that the instruction first loads the 32 bits from memory using the address in X0, and then increments the value in X0 by 4.

Literal (PC-relative): label

We can also use the value in the program counter as the base address, along with an immediate offset. For example, in the case of LDR (literal), the positive or negative immediate offset has a size of 19 bits and encodes multiples of 4. This results in a range of ±1 MiB (±1,048,576 bytes) for the offset, calculated as \(\pm 2^{18} \times 4\) bytes. So we can have an offset from -1 MiB to 1 MiB relative to the program counter. In assembly code, we can use labels for the immediate offset so that the assembler will insert the correct numeric value for us.

Listing 3.3.6 Implementation of the function addressing_modes_5, which loads 32 bits from a PC-relative location and then writes them to memory.#
    .section .text
    .type addressing_modes_5, %function
    .global addressing_modes_5
addressing_modes_5:
    ldr w1, my_data
    str w1, [x0]
    ret

my_data:
    .word 128

The use of a PC-relative load is illustrated in Listing 3.3.6. In the case of ldr w1, my_data, the assembler would determine the immediate offset by comparing the location of the label my_data with that of the instruction. At runtime, the instruction then loads the 32 bits at the effective address obtained by adding the immediate offset to the value in the program counter. In this example, the instruction loads the value 128 to register W1.

3.4. Data Processing#

In addition to load and store instructions, another large group is the data-processing instructions. These instructions read operands from up to two source registers, perform the computation, and write the result back to a destination register. The registers that an instruction reads from are called source registers. The registers an instruction writes to are called destination registers. When an instruction reads from and writes to a register, that register is also referred to as a destination register in the corresponding A64 ISA instruction description. In addition to register data, some instructions use immediate values. Since the ISA contains many data-processing instructions, we formulate high-level descriptions for a few examples:

ADD (immediate): add x3, x8, #16

Add the unsigned immediate value of 16 to the value in source register X8 and write the result to destination register X3.

MOV (register): mov w0, w1

Copy the 32-bit value from source register W1 to destination register W0.

AND (shifted register): and x17, x8, x2, lsl #2

Perform a bitwise AND of the 64-bit value in source register X8 with the 64-bit value in source register X2 after it has been logically shifted to the left by 2. Write the result to destination register X17.

EOR (shifted register): eor w16, w16, w16

Perform a bitwise exclusive-OR of the 32-bit value in source register W16 with itself and write the result to destination register W16. This clears the lower 32 bits to zero. Since this is a 32-bit operation, the upper 32 bits of X16 are also implicitly set to zero, effectively zeroing the entire 64-bit register.

MADD: madd x0, x1, x2, x3

Multiply the 64-bit values in source registers X1 and X2, add the intermediate result to the 64-bit value in source register X3, and write the final 64-bit result to destination register X0.

A special class of data-processing instructions sets the NZCV condition flags based on the result of the instruction. We call these instructions flag-setting instructions. The NZCV condition flags are defined as follows:

N

Negative condition flag. Set to 1 if the result of the last flag-setting instruction was negative.

Z

Zero condition flag. Set to 1 if the result of the last flag-setting instruction was zero, and to 0 otherwise. A result of zero often indicates an equal result from a comparison.

C

Carry condition flag. Set to 1 if the last flag-setting instruction resulted in a carry condition, for example an unsigned overflow on an addition.

V

Overflow condition flag. Set to 1 if the last flag-setting instruction resulted in an overflow condition, for example a signed overflow on an addition.

—Arm Limited: C5.2.11 of the Arm Architecture Reference Manual for A-profile architecture

Many flag-setting instructions can be distinguished from their “standard” counterparts by the s suffix, for example: ADDS (shifted register), ADDS (immediate), SUBS (shifted register), SUBS (immediate).

An often-used special case is CMP, which exists in extended register, immediate, and shifted register variants. These instructions are aliases of their SUBS counterparts, where only the condition flags are set, but the actual result of the subtraction is discarded. For example, cmp x5, #16 is equivalent to subs xzr, x5, #16, where xzr is the zero register.

3.5. Branches#

The last group of base instructions is branches, exception generation, and system instructions. Branching instructions are the most important category out of these and the topic of this section. Until this point, we discussed load and store instructions, as well as data processing instructions. Several of these instructions in a code block are simply executed one after another (see also Section 2.1), leading to a linear program flow. Branching instructions allow us to deviate from the linear program flow and jump to different positions in our program. We already used the two branching instructions BL and RET to jump to the start of a function and back to the calling context.

Before discussing example applications, we briefly introduce the most important branching instructions BL, RET, B, B.cond, CBZ, CBNZ:

Branch with Link (BL)

Branch to an offset relative to the program counter and set the link register (X30) to PC+4. The 26-bit immediate in the field imm26 encodes a signed integer that is multiplied by four to obtain the offset. Thus, we obtain an offset range of ±128 MiB, calculated as \(\pm 2^{25} \times 4\) bytes.

In assembly code we can use a label instead of a numeric value. If the offset to the label is outside of the ±128 MiB range, the linker inserts a veneer to reach the destination.

Return from subroutine (RET)

Branch unconditionally to the address in a register. The register defaults to LR (X30).

Branch (B)

Branch unconditionally to an offset relative to the program counter. The offset is stored in a 26-bit immediate, leading to ±128 MiB range.

Branch conditionally (B.cond)

Branch conditionally to an offset relative to the program counter. The 19-bit immediate in the field imm19 encodes the offset with range ±1 MiB. The branching condition is encoded in the field cond. For example, 0000 refers to the equal (EQ) condition, while 1011 refers to less than (LT). The condition specifies a bit pattern for a subset of the NZCV condition flags that has to be fulfilled for the branch to be executed. In the case of the EQ condition, the zero condition flag (Z) has to be 1 for the branch to execute. For LT condition, the negative condition flag (N) has to be unequal to the overflow condition flag (V).

Compare and branch on zero (CBZ)

Branch to an offset relative to the program counter if a register is zero. The 5-bit register ID is encoded in the field Rt. The offset has a range of ±1 MiB and is encoded in the 19-bit immediate in field imm19.

Compare and branch on nonzero (CBNZ)

Branch to an offset relative to the program counter if a register is nonzero. The field Rt encodes the 5-bit register ID, and the field imm19 the 19-bit immediate, leading to a ±1 MiB offset range.

Examples

The following examples are available in the archive branches.tar.xz together with a driver showcasing them.

Listing 3.5.1 Implementation of the function branches_0.#
1    .text
2    .type branches_0, %function
3    .global branches_0
4branches_0:
5    mov w0, #1
6    b my_label
7    mov w0, #2
8my_label:
9    ret

Our first example has the function signature int32_t branches_0() and is shown in Listing 3.5.1. In line 5 the function sets the value of the return register W0 to 1. Next, the unconditional branch in line 6 is executed and jumps to my_label. Thus, we jump 8 bytes forward relative to the current program counter. Effectively, we jump over mov w0, #2 and execute ret next.

In summary, calling branches_0 will always return 1.

Listing 3.5.2 Implementation of the function branches_1.#
 1    .text
 2    .type branches_1, %function
 3    .global branches_1
 4branches_1:
 5    cmp w0, #25
 6    mov w0, #1
 7    b.eq my_label
 8    mov w0, #2
 9my_label:
10    ret

Listing 3.5.2 shows the implementation of the function int32_t branches_1(int32_t). The instruction cmp w0, #25 sets the NZCV condition flags based on the result of the subtraction of 25 from the value in the first parameter register W0. Specifically, the Z condition flag is set to 1 if W0 holds the value 25, and set to 0 in all other cases.

Next, the instruction in line 6 sets the value of W0 to 1. Line 7 contains the conditional branch b.eq my_label. The branch jumps to my_label if the Z condition flag is 1, as implied by the EQ condition. If not, the function continues linearly and executes mov w0, #2 which sets the value of W0 to 2. ret is the last instruction executed in either case and jumps to the address in the link register (LR).

In summary, calling branches_1 returns 1 if the parameter 25 is passed and 2 in all other cases.

Listing 3.5.3 Implementation of the function branches_2.#
1    .text
2    .type branches_2, %function
3    .global branches_2
4branches_2:
5    cbz w0, my_label
6    mov w0, #1
7my_label:
8    ret

Listing 3.5.3 shows an example function using CBZ. In the given case, cbz w0, my_label jumps only if parameter register W0 is zero.

Thus, the function returns 1 if called with parameter 0. In all other cases, the function does not modify W0, meaning that the value of the parameter is also the function’s return value.

Listing 3.5.4 Implementation of the function branches_3.#
1    .text
2    .type branches_3, %function
3    .global branches_3
4branches_3:
5    cbnz w0, my_label
6    mov w0, #1
7my_label:
8    ret

The function branches_3 in Listing 3.5.4 is similar to branches_2. However, in this case the CBNZ instruction in line 5 branches if the value in W0 is nonzero. Therefore, the function returns 0 if called with the parameter 0: branches_3(0). In all other cases, the function returns 1 since the instruction mov w0, #1 is executed.

Listing 3.5.5 Implementation of the function branches_4.#
 1    .text
 2    .type branches_4, %function
 3    .global branches_4
 4branches_4:
 5    mov x2, x0
 6
 7    cbnz x1, my_label
 8    mov x0, #1
 9    b my_end
10
11my_label:
12    subs x1, x1, #1
13    b.eq my_end
14
15my_loop:
16    subs x1, x1, #1
17    mul x0, x0, x2
18    b.ne my_loop
19
20my_end:
21    ret

The example in Listing 3.5.5 implements the function uint64_t branches_4(uint64_t, uint64_t). Effectively, branches_4 computes the n-th power of a number. For example, the function returns \(2^8=256\) when passing the parameters 2 as base and 8 as exponent: branches_4(2, 8).

mov x2, x0 copies the base from register X0 to X2. cbnz x1, my_label in line 7 jumps to my_label if the exponent is nonzero.

If the exponent is zero, program execution continues linearly and the instructions in lines 8 and 9 are executed next. mov x0, #1 sets return register W0 to 1 and b my_end jumps to the end of the function.

If the exponent is nonzero, the instructions in lines 12 and 13 are executed next. subs x1, x1, #1 decrements the value of X1 and sets the NZCV condition flags. Next, the conditional branch in line 13 jumps to the end of the function if the Z condition flag is set to 1. This is the case if the result of the previous instruction was zero, which holds if the passed exponent was 1. In that case the value in register X0 is returned unmodified, meaning that we simply return the base if the exponent is 1.

The block in lines 15-18 covers the general case where the exponent is greater than 1. In line 16 the value of X1 is decremented and the NZCV condition flags are set. mul x0, x0, x2 multiplies the current value in X0 with the base in register X2, and writes the result to X0. The conditional branch b.ne my_loop only jumps to label my_loop if the Z condition flag is not set. This means that the jump is only executed if the result of the SUBS instruction in line 16 was non-zero. In summary, we have written a loop that multiples X0 exponent-1 times with the base, effectively computing the exponentiation.