2. Assembly Language#

Assembly language is a low-level language and is tied to specific classes of hardware. In contrast, C/C++ is a high-level language. Any system with an appropriate compiler can compile a C/C++ program into assembly language. This is the reason why C/C++ code is portable and assembly code is not: If we want to target a different core, say an x86-64 core instead of an AArch64 core, we simply recompile the C/C++ code and generate new assembly code. If we wrote the assembly code ourselves, we are stuck and have to start over. So why do we care about assembly code? The answer is simple: Performance.

This chapter covers the basics of assembly language for the Arm architecture. Section 3 introduces base instructions, followed by Section 4, which introduces vector instructions for the heavy lifting in tensor workloads. We’ll then use these basics to write a set of performance-critical primitives for our tensor compiler in Section 7.

2.1. AArch64#

../_images/arch.svg — Fig. 2.1.1 Illustration of a load-store architecture.#

Before jumping into assembly language, let us define a very simple model of a CPU core to help us understand how programs are executed in hardware. Our core, shown in Fig. 2.1.1, has the following components:

Memory: Large but slow storage for instructions and data.
Registers: Fast but small storage for data.
Program Counter (PC): Location of the current instruction in memory.
Arithmetic Logic Unit (ALU): Circuit that performs arithmetic and logical operations on binary numbers.

We also assume that we are programming a load-store architecture. This means that the CPU core strictly separates data movement instructions from data processing instructions. Data movement instructions are responsible for copying data from memory to registers (load) and from registers to memory (store). Data processing instructions only modify data in registers, never data in memory. In this book, we will learn how to program AArch64, the 64-bit execution state of the Arm architecture. AArch64 is a load-store architecture and each instruction has a fixed length of 32 bits or 4 bytes.

In a very simplified way, the core executes instructions using the following procedure:

Fetch the instruction from memory relative to the PC.
Increment the PC by 4, now holding the address of the next instruction.
Execute the instruction (possibly changing the PC).
Repeat

Example

Suppose we want to write a simple function that adds two eight-bit binary numbers. The numbers are initially located in memory, and the result should also be located in memory after our function has finished. We could write an assembly program that contains the following instructions:

Load the first binary number from memory into a first register.
Load the second binary number from memory into a second register.
Add the data in the first and second registers and write the result into a third register.
Store the data in the third register to memory.

Assume that the values of the two binary numbers in memory are 01010111 and 00010101 and that our simple program is stored in memory starting at address my_prog. Since memory is byte-addressable, we would find the four instructions at addresses my_prog+0, my_prog+4, my_prog+8, and my_prog+12. By initializing the program counter to my_prog+0, we can execute the program by repeating the above procedure four times:

Action	PC	1st Register	2nd Register	3rd Register
Fetch first load instruction from memory at address `my_prog+0`	`my_prog+0`
Increment PC by 4	`my_prog+4`
Execute load instruction	`my_prog+4`	`01010111`
Fetch second load instruction from memory at address `my_prog+4`	`my_prog+4`	`01010111`
Increment PC by 4	`my_prog+8`	`01010111`
Execute load instruction	`my_prog+8`	`01010111`	`00010101`
Fetch add instruction from memory at address `my_prog+8`	`my_prog+8`	`01010111`	`00010101`
Increment PC by 4	`my_prog+12`	`01010111`	`00010101`
Execute add instruction	`my_prog+12`	`01010111`	`00010101`	`01101100`
Fetch store instruction from memory at address `my_prog+12`	`my_prog+12`	`01010111`	`00010101`	`01101100`
Increment PC by 4	`my_prog+16`	`01010111`	`00010101`	`01101100`
Execute store instruction	`my_prog+16`	`01010111`	`00010101`	`01101100`

At this point, we have achieved our goal, as 01101100, the result of adding 01010111 and 00010101, is now located in memory.

2.2. GNU Assembly Syntax#

Listing 2.2.1 Simple C program that prints “Hello World!”. It is assumed that the source code is stored in the file hello_world.c.#

#include <stdio.h>

int main() {
  puts( "Hello World!" );

  return 0;
}

We start to learn about assembly language by looking at some compiler-generated code. Let us take the simple example program in Listing 2.2.1. We can compile this program into assembly code using the GNU Compiler Collection (GCC) with the following command:

gcc -S hello_world.c

Alternatively, we can do the same using with the Clang Compiler with the command:

clang -S hello_world.c

Both compilers will produce an assembly file named hello_world.s. Since compilation depends on many parameters, we will typically get (very) different results on different systems and with different compilers.

We can also write assembly code directly. Assembly language is the human-readable language closest to hardware. It has different flavors and syntaxes. We use the GNU Assembly Syntax (GAS), which is also the default when the GCC or Clang compiler generates assembly code.

Listing 2.2.2 Example assembly program stored in the file hello_world.s.#

    // read-only data section
    .section .rodata
my_msg:
    // null-terminated string
    .asciz  "Hello World!"

    // text section
    .text
    .global main
main:
    // save frame pointer and link register
    stp     x29, x30, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov     x29, sp

    // print "Hello World!" by calling puts
    adr     x0, my_msg
    bl      puts

    // set return value
    mov     w0, #0

    // restore frame pointer and link register
    ldp     x29, x30, [sp], #16

    ret

On an AArch64 Linux system, we could write assembly code similar to that shown in Listing 2.2.2 for our “Hello World!” example. In a nutshell, the assembly program uses four important building blocks:

Assembler Directives

Names begin with a period “.”.
Define symbols, allocate memory, control the assembly process.
There are many directives available, we only need a few.
Directives in the example: .section, .rodata, .asciz, .text, .global, .size.

Labels

Label definitions are immediately followed by a colon “:”.
Allow us to abstractly refer to locations in an assembly program, such as the location of a function.
Labels in the example: my_msg:, main:.

Assembly Instructions

An instruction performs a specific operation.
May be broken down by the hardware into multiple micro-ops (µOPs).
Instructions in the example:

Comments

Help us understand the assembly code.
Ignored by the assembler.
C-style syntax
Use extensively, assembly code is not self-explanatory.

2.3. Assembler#

The previous section discussed the basic structure of assembly code. Now we will assemble the program in Listing 2.2.2, that is, translate it so that an AArch64 machine can run it.

We can invoke the GNU assembler with the following command:

as hello_world.s -o hello_world.o

The resulting file hello_world.o contains binary data. We can display its contents in ASCII by using od:

od -A x -t x1 hello_world.o > hello_world.hex

where -A x specifies hexadecimal addresses and -t x1 specifies the output format of the data as hexadecimal with one byte per column.

Listing 2.3.1 Hexadecimal dump hello_world.hex of the file hello_world.o generated by the tool od.#

000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
000010 01 00 b7 00 01 00 00 00 00 00 00 00 00 00 00 00
000020 00 00 00 00 00 00 00 00 d0 01 00 00 00 00 00 00
000030 00 00 00 00 40 00 00 00 00 00 40 00 09 00 08 00
000040 fd 7b bf a9 fd 03 00 91 00 00 00 10 00 00 00 94
000050 00 00 80 52 fd 7b c1 a8 c0 03 5f d6 48 65 6c 6c
000060 6f 20 57 6f 72 6c 64 21 00 00 00 00 00 00 00 00
000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000080 00 00 00 00 00 00 00 00 00 00 00 00 03 00 01 00
000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The first 10 lines of the resulting ASCII file will be similar to those shown in Listing 2.3.1. At first glance, this may not seem very helpful. We make sense of the bytes by looking at the sections in the ELF file:

readelf -S hello_world.o > hello_world.relf

The result show us the location of each section in the assembled file.

Listing 2.3.2 Output of the tool readelf in the file hello_world.relf.#

There are 9 section headers, starting at offset 0x1d0:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       000000000000001c  0000000000000000  AX       0     0     4
  [ 2] .rela.text        RELA             0000000000000000  00000160
       0000000000000030  0000000000000018   I       6     1     8
  [ 3] .data             PROGBITS         0000000000000000  0000005c
       0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .bss              NOBITS           0000000000000000  0000005c
       0000000000000000  0000000000000000  WA       0     0     1
  [ 5] .rodata           PROGBITS         0000000000000000  0000005c
       000000000000000d  0000000000000000   A       0     0     1
  [ 6] .symtab           SYMTAB           0000000000000000  00000070
       00000000000000d8  0000000000000018           7     7     8
  [ 7] .strtab           STRTAB           0000000000000000  00000148
       0000000000000015  0000000000000000           0     0     1
  [ 8] .shstrtab         STRTAB           0000000000000000  00000190
       0000000000000039  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)

For our “Hello World!” example, the output will be similar to that shown in Listing 2.3.2. The offset values in the last column indicate the location in the file, given in bytes. Returning to our example program in Listing 2.2.2, we are interested in the two sections .rodata and .text. Starting with .rodata, we can see that the section starts at offset 0x5c and has a size of 0xd bytes. Looking at the hexadecimal dump in Listing 2.3.1, we find the following 13 bytes, which we can convert back to ASCII:

File Offset	Byte	ASCII Character
0x58	48	H
0x59	65	e
0x5a	6c	l
0x5b	6c	l
0x5c	6f	o
0x5d	20	(space)
0x5e	57	W
0x5f	6f	o
0x60	72	r
0x61	6c	l
0x62	64	d
0x63	21	!
0x64	00	NUL (null)

Great, we found the null-terminated string Hello World! in the .rodata section.

Next we look at the .text section. This section starts at offset 0x40 and has a size of 0x1c bytes. From Section 2.1 we already know that each instruction is 32 bits or 4 bytes long. So the 0x1c bytes correspond to 7 instructions, which is what we expect. The assembler has translated the human-readable mnemonics into machine code. The instructions are treated as little-endian which means that the least significant byte is stored first. The encoding of instructions is defined in the instruction set architecture and will be discussed in the following sections.

Example

The first instruction in the sample code in Listing 2.2.2 is stp x29, x30, [sp, -16]!, which is fd 7b bf a9 in the hexdump shown in Listing 2.3.1. Being aware the little-endian encoding of the instructions, we can convert the hexadecimal value to binary, which in the case of 0xa9bf7bfd is 0b1010'1001'1011'1111'0111'1011'1111'1101. This is the the 64-bit variant of the pre-index encoding of STP in the instruction set architecture, encoded as: 1010 1001 10 imm7 Rt2 Rn Rt where imm7 is a 7-bit immediate value and Rt2, Rn, and Rt are register IDs.

Manually looking up the machine code of instructions in an ELF file is tedious. We can use the objdump tool to automate this process and disassemble the binary file:

objdump --syms -S -d hello_world.o > hello_world.dis

This will create a hello_world.dis file containing the disassembled code.

Listing 2.3.3 Output of the tool objdump when applied to hello_world.o.#

hello_world.o:     file format elf64-littleaarch64

SYMBOL TABLE:
0000000000000000 l    d  .text	0000000000000000 .text
0000000000000000 l    d  .data	0000000000000000 .data
0000000000000000 l    d  .bss	0000000000000000 .bss
0000000000000000 l    d  .rodata	0000000000000000 .rodata
0000000000000000 l       .rodata	0000000000000000 my_msg
0000000000000000 g       .text	0000000000000000 main
0000000000000000         *UND*	0000000000000000 puts



Disassembly of section .text:

0000000000000000 <main>:
   0:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
   4:	910003fd 	mov	x29, sp
   8:	10000000 	adr	x0, 0 <main>
   c:	94000000 	bl	0 <puts>
  10:	52800000 	mov	w0, #0x0                   	// #0
  14:	a8c17bfd 	ldp	x29, x30, [sp], #16
  18:	d65f03c0 	ret

Sample output for our “Hello World!” example is shown in Listing 2.3.3. As we can see, the objdump shows mnemonics for the machine code.

Note

Each instruction in assembly code has exactly one representation as 32 machine code bits. However, in some cases, different mnemonics can result in the same machine code. In such cases objdump shows the preferred disassembly when disassembling the machine code. One such case is CMP (extended register), which is an alias of SUBS (extended register), where CMP is the preferred disassembly.

However, the program cannot be executed because it contains undefined parts. For example, the puts function is marked with and *UND* in the symbol table. This is because puts is provided in libc, the standard library for the C language. We still need to use a linker, the topic of the next section, to make it available.

2.4. Linker#

The goal of this section is to create an executable from the object file hello_world.o that we assembled in Section 2.3. To do this, we need to perform a step called linking. During linking we, for example, combine object files, resolve dependencies and assign addresses. We can choose from several linkers to perform this step, including the GNU linker ld or the LLVM linker lld. However, manually linking with ld or lld requires us to perform some extra steps, such as writing startup and exit code that communicates with Linux through system calls. Instead, we will use a wrapper that does the extra steps for us automatically.

Using the GNU Compiler Collection, we can create our executable by simply using gcc:

gcc -o hello_world hello_world.o

Similarly, using the compiler and tooling infrastructure Clang, we can do the following:

clang -o hello_world hello_world.o

Both will create the executable hello_world that we can run:

./hello_world

We see that the string “Hello World!” is printed to the command line. We have just successfully executed our first assembly program.

If we look at the symbol table of the executable using objdump --syms hello_world, we see many additional entries. Before, we had the *UND* entry for puts. Now, a glibc version is mentioned in the name, e.g., puts@GLIBC_2.17, but the section is still given as *UND*. This is because dynamic linking is the default behavior for external libraries in most cases. The respective symbols are resolved by the dynamic linker at runtime of the program, which also loads the required dynamic libraries. If we want the implementation of puts to be part of our executable, we can opt for static linking. Using GCC, this can be achieved by using the static flag:

gcc -static -o hello_world hello_world.o

Looking at the symbol table again, we now see that the implementation of puts is available in our executable:

0000000000401640  w    F .text	000000000000025c puts

2.5. Registers#

The AArch64 Application Level Programmers’ Model has thirty-one general-purpose registers (GPRs) that are visible to the A64 instruction set: R0-R30. The GPRs are also called integer registers. They can be accessed as 64-bit X registers or as 32-bit W registers.

../_images/gprs.svg — Fig. 2.5.1 Illustration of the thirty-one general-purpose registers R0-R30 visible to the A64 instruction set, some special-purpose registers and the PSTATE fields.#

As shown in Fig. 2.5.1, the lower 32 bits of the X registers overlap with the W registers.

In addition to the general-purpose registers, AArch64 defines a number of special-purpose registers and vector registers. We will discuss the vector registers as part of Neon chapter. The most important abstractions are the following:

Link Register (LR): The link register is simply another name for the general-purpose register X30. It stores the return address when a function is called. This means that after a function call, we return to the calling scope by jumping to the address in the link register.
Zero Register (ZR): The zero register can be used in some instructions. It is always read as zero, writes to the register are simply ignored. Note that the AArch64 programmers’ model assumes that there is a zero register. However, this does not imply that the register exists in hardware as a physical register, typically it will not.
Stack Pointer (SP): Dedicated 64-bit stack pointer register that holds the address where the stack ends. Note that the stack grows with decreasing virtual addresses, i.e. we have to decrement the stack pointer to allocate memory.
Program Counter (PC): 64-bit program counter that holds the address of the current instruction. We cannot write directly to the program counter, but some instructions modify it. We will use the program counter for branching, for example, when implementing loops or conditional code.
Process State (PSTATE): The process state (PSTATE) is an abstraction that holds process state information. Of particular interest to us are the NZCV condition flags that are part of this state. Simply put, we will use them to store the result of comparisons in conditional code execution, and then jump conditionally based on these flags.

2.6. Procedure Call Standard#

The Procedure Call Standard (PCS) used by the Application Binary Interface (ABI) defines the role of the GPRs in function calls. GPRs R0-R7 are used to pass arguments to a function and to return values. Registers R8-R17 and R19-R28 are scratch registers. R18 and R29 can be used as temporary registers in some cases, but it is advisable to avoid using them altogether. One such place with strict requirements is Apple platforms:

Apple platforms adhere to the following choices:

The platforms reserve register x18. Don’t use this register.

The frame pointer register (x29) must always address a valid frame record.

—Apple: Writing ARM64 code for Apple platforms

The PCS provides the rules for calling functions of others and for writing functions that are called by others.

In the first case, we are the caller of the other function. According to the PCS, it is our responsibility to save intermediate data in the caller-saved registers before calling the function. In other words: The called function may overwrite the data in any caller-saved registers. Then, after the function returns to our scope, we must restore the data. In the PCS, registers R0-R18 and R30 are caller-saved.

In the second case, our function is the callee. Again, according to the PCS, we must preserve the data in a set of registers. These registers are called callee-saved registers. So, if we plan to overwrite the data in a callee-saved register, we must save the intermediate data before modifying the register. Then, before jumping back to the caller’s scope, we must restore the data, since the caller may rely on it being preserved. According to the PCS, registers R19-R29 are saved by the callee.

In fact, when writing assembly code, we can use the stack to temporarily store the contents of registers. So, when implementing a function, we would first identify all the registers that our function modifies and that we need to preserve. Then, at the beginning of the function, we save the contents of all these registers on the stack. Now we can proceed with implementing the function and the registers without worrying about the PCS. When we are finished, just before jumping back to the scope of the caller, we simply restore the contents of the previously saved registers by loading them from the stack.

Listing 2.6.1 Example assembly program that sets the frame pointer register and temporarily stores registers X19-X30 on the stack.#

    .text
    .type pcs_gprs, %function
    .global pcs_gprs
 pcs_gprs:
    // save frame pointer and link register
    stp fp, lr, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov fp, sp

    // save callee-saved registers
    stp x19, x20, [sp, #-16]!
    stp x21, x22, [sp, #-16]!
    stp x23, x24, [sp, #-16]!
    stp x25, x26, [sp, #-16]!
    stp x27, x28, [sp, #-16]!

    // use registers as needed

    // restore callee-saved registers
    ldp x27, x28, [sp], #16
    ldp x25, x26, [sp], #16
    ldp x23, x24, [sp], #16
    ldp x21, x22, [sp], #16
    ldp x19, x20, [sp], #16

    // restore frame pointer and link register
    ldp fp, lr, [sp], #16

    ret

Listing 2.6.1 shows an example implementation that adheres to the PCS. Specifically, we use the pre-index encoding of STP to copy the contents of registers X19-X30 to the stack. The STP instructions first decrement the stack pointer by 16, thus allocating 16 bytes on the stack, before storing the 16 bytes of data distributed across two X registers to the stack.

Note

We will discuss base instructions (including STP and LDP) in more detail in Section 3.

LDP’s post-index encoding has the opposite effect. First, each instruction loads 16 bytes from the stack into two X registers. Then the stack pointer is incremented by 16, effectively freeing the memory.

The two instructions in lines 6-7 create the stack frame. First, in line 6, the data of the frame pointer register (X29) and the link register (X30) are stored to the stack. Then, the MOV (to/from SP) instruction in line 6 copies the data from the stack pointer to the frame pointer register (X29), as required on Apple platforms, for example.

Note that we also temporarily store the data of the frame pointer register (X29) and the link register (X30) on the stack. Although FP (X29) and LR (X30) are typically not used directly in function implementations, their values must be preserved because they may be modified during nested function calls. In this case, we must restore them after the function calls.