2. Assembly Language#
Assembly language is a low-level language and is tied to specific classes of hardware. In contrast, C/C++ is a high-level language. Any system with an appropriate compiler can compile a C/C++ program into assembly language. This is the reason why C/C++ code is portable and assembly code is not: If we want to target a different core, say an x86-64 core instead of an AArch64 core, we simply recompile the C/C++ code and generate new assembly code. If we wrote the assembly code ourselves, we are stuck and have to start over. So why do we care about assembly code? The answer is simple: Performance.
This chapter covers the basics of assembly language for the Arm architecture. Section 3 introduces base instructions, followed by Section 4, which introduces vector instructions for the heavy lifting in tensor workloads. We’ll then use these basics to write a set of performance-critical primitives for our tensor compiler in Section 7.
2.1. AArch64#
Fig. 2.1.1 Illustration of a load-store architecture.#
Before jumping into assembly language, let us define a very simple model of a CPU core to help us understand how programs are executed in hardware. Our core, shown in Fig. 2.1.1, has the following components:
- Memory
Large but slow storage for instructions and data.
- Registers
Fast but small storage for data.
- Program Counter (PC)
Location of the current instruction in memory.
- Arithmetic Logic Unit (ALU)
Circuit that performs arithmetic and logical operations on binary numbers.
We also assume that we are programming a load-store architecture. This means that the CPU core strictly separates data movement instructions from data processing instructions. Data movement instructions are responsible for copying data from memory to registers (load) and from registers to memory (store). Data processing instructions only modify data in registers, never data in memory. In this book, we will learn how to program AArch64, the 64-bit execution state of the Arm architecture. AArch64 is a load-store architecture and each instruction has a fixed length of 32 bits or 4 bytes.
In a very simplified way, the core executes instructions using the following procedure:
Fetch the instruction from memory relative to the PC.
Increment the PC by 4, now holding the address of the next instruction.
Execute the instruction (possibly changing the PC).
Repeat
Example
Suppose we want to write a simple function that adds two eight-bit binary numbers. The numbers are initially located in memory, and the result should also be located in memory after our function has finished. We could write an assembly program that contains the following instructions:
Load the first binary number from memory into a first register.
Load the second binary number from memory into a second register.
Add the data in the first and second registers and write the result into a third register.
Store the data in the third register to memory.
Assume that the values of the two binary numbers in memory are 01010111
and 00010101
and that our simple program is stored in memory starting at address my_prog
.
Since memory is byte-addressable, we would find the four instructions at addresses my_prog+0
, my_prog+4
, my_prog+8
, and my_prog+12
.
By initializing the program counter to my_prog+0
, we can execute the program by repeating the above procedure four times:
Action |
PC |
1st Register |
2nd Register |
3rd Register |
---|---|---|---|---|
Fetch first load instruction from memory at address |
|
|||
Increment PC by 4 |
|
|||
Execute load instruction |
|
|
||
Fetch second load instruction from memory at address |
|
|
||
Increment PC by 4 |
|
|
||
Execute load instruction |
|
|
|
|
Fetch add instruction from memory at address |
|
|
|
|
Increment PC by 4 |
|
|
|
|
Execute add instruction |
|
|
|
|
Fetch store instruction from memory at address |
|
|
|
|
Increment PC by 4 |
|
|
|
|
Execute store instruction |
|
|
|
|
At this point, we have achieved our goal, as 01101100
, the result of adding 01010111
and 00010101
, is now located in memory.
2.2. GNU Assembly Syntax#
hello_world.c
.##include <stdio.h>
int main() {
puts( "Hello World!" );
return 0;
}
We start to learn about assembly language by looking at some compiler-generated code. Let us take the simple example program in Listing 2.2.1. We can compile this program into assembly code using the GNU Compiler Collection (GCC) with the following command:
gcc -S hello_world.c
Alternatively, we can do the same using with the Clang Compiler with the command:
clang -S hello_world.c
Both compilers will produce an assembly file named hello_world.s
.
Since compilation depends on many parameters, we will typically get (very) different results on different systems and with different compilers.
We can also write assembly code directly. Assembly language is the human-readable language closest to hardware. It has different flavors and syntaxes. We use the GNU Assembly Syntax (GAS), which is also the default when the GCC or Clang compiler generates assembly code.
// read-only data section
.section .rodata
my_msg:
// null-terminated string
.asciz "Hello World!"
// text section
.text
.global main
main:
// save frame pointer and link register
stp x29, x30, [sp, #-16]!
// update frame pointer to current stack pointer
mov x29, sp
// print "Hello World!" by calling puts
adr x0, my_msg
bl puts
// set return value
mov w0, #0
// restore frame pointer and link register
ldp x29, x30, [sp], #16
ret
On an AArch64 Linux system, we could write assembly code similar to that shown in Listing 2.2.2 for our “Hello World!” example. In a nutshell, the assembly program uses four important building blocks:
- Assembler Directives
- Labels
Label definitions are immediately followed by a colon “:”.
Allow us to abstractly refer to locations in an assembly program, such as the location of a function.
Labels in the example:
my_msg:
,main:
.
- Assembly Instructions
An instruction performs a specific operation.
May be broken down by the hardware into multiple micro-ops (µOPs).
Instructions in the example:
- Comments
Help us understand the assembly code.
Ignored by the assembler.
C-style syntax
Use extensively, assembly code is not self-explanatory.
2.3. Assembler#
The previous section discussed the basic structure of assembly code. Now we will assemble the program in Listing 2.2.2, that is, translate it so that an AArch64 machine can run it.
We can invoke the GNU assembler with the following command:
as hello_world.s -o hello_world.o
The resulting file hello_world.o
contains binary data.
We can display its contents in ASCII by using od:
od -A x -t x1 hello_world.o > hello_world.hex
where -A x
specifies hexadecimal addresses and -t x1
specifies the output format of the data as hexadecimal with one byte per column.
000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
000010 01 00 b7 00 01 00 00 00 00 00 00 00 00 00 00 00
000020 00 00 00 00 00 00 00 00 d0 01 00 00 00 00 00 00
000030 00 00 00 00 40 00 00 00 00 00 40 00 09 00 08 00
000040 fd 7b bf a9 fd 03 00 91 00 00 00 10 00 00 00 94
000050 00 00 80 52 fd 7b c1 a8 c0 03 5f d6 48 65 6c 6c
000060 6f 20 57 6f 72 6c 64 21 00 00 00 00 00 00 00 00
000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000080 00 00 00 00 00 00 00 00 00 00 00 00 03 00 01 00
000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The first 10 lines of the resulting ASCII file will be similar to those shown in Listing 2.3.1. At first glance, this may not seem very helpful. We make sense of the bytes by looking at the sections in the ELF file:
readelf -S hello_world.o > hello_world.relf
The result show us the location of each section in the assembled file.
There are 9 section headers, starting at offset 0x1d0:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000000000 00000040
000000000000001c 0000000000000000 AX 0 0 4
[ 2] .rela.text RELA 0000000000000000 00000160
0000000000000030 0000000000000018 I 6 1 8
[ 3] .data PROGBITS 0000000000000000 0000005c
0000000000000000 0000000000000000 WA 0 0 1
[ 4] .bss NOBITS 0000000000000000 0000005c
0000000000000000 0000000000000000 WA 0 0 1
[ 5] .rodata PROGBITS 0000000000000000 0000005c
000000000000000d 0000000000000000 A 0 0 1
[ 6] .symtab SYMTAB 0000000000000000 00000070
00000000000000d8 0000000000000018 7 7 8
[ 7] .strtab STRTAB 0000000000000000 00000148
0000000000000015 0000000000000000 0 0 1
[ 8] .shstrtab STRTAB 0000000000000000 00000190
0000000000000039 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), p (processor specific)
For our “Hello World!” example, the output will be similar to that shown in Listing 2.3.2.
The offset values in the last column indicate the location in the file, given in bytes.
Returning to our example program in Listing 2.2.2, we are interested in the two sections .rodata
and .text
.
Starting with .rodata
, we can see that the section starts at offset 0x5c
and has a size of 0xd
bytes.
Looking at the hexadecimal dump in Listing 2.3.1, we find the following 13 bytes, which we can convert back to ASCII:
File Offset |
Byte |
ASCII Character |
---|---|---|
0x58 |
48 |
H |
0x59 |
65 |
e |
0x5a |
6c |
l |
0x5b |
6c |
l |
0x5c |
6f |
o |
0x5d |
20 |
(space) |
0x5e |
57 |
W |
0x5f |
6f |
o |
0x60 |
72 |
r |
0x61 |
6c |
l |
0x62 |
64 |
d |
0x63 |
21 |
! |
0x64 |
00 |
NUL (null) |
Great, we found the null-terminated string Hello World!
in the .rodata
section.
Next we look at the .text
section.
This section starts at offset 0x40
and has a size of 0x1c
bytes.
From Section 2.1 we already know that each instruction is 32 bits or 4 bytes long.
So the 0x1c
bytes correspond to 7 instructions, which is what we expect.
The assembler has translated the human-readable mnemonics into machine code.
The instructions are treated as little-endian which means that the least significant byte is stored first.
The encoding of instructions is defined in the instruction set architecture and will be discussed in the following sections.
Example
The first instruction in the sample code in Listing 2.2.2 is stp x29, x30, [sp, -16]!
, which is fd 7b bf a9
in the hexdump shown in Listing 2.3.1.
Being aware the little-endian encoding of the instructions, we can convert the hexadecimal value to binary, which in the case of 0xa9bf7bfd
is 0b1010'1001'1011'1111'0111'1011'1111'1101
.
This is the the 64-bit variant of the pre-index encoding of STP in the instruction set architecture, encoded as: 1010 1001 10 imm7 Rt2 Rn Rt
where imm7
is a 7-bit immediate value and Rt2
, Rn
, and Rt
are register IDs.
Manually looking up the machine code of instructions in an ELF file is tedious. We can use the objdump tool to automate this process and disassemble the binary file:
objdump --syms -S -d hello_world.o > hello_world.dis
This will create a hello_world.dis
file containing the disassembled code.
hello_world.o: file format elf64-littleaarch64
SYMBOL TABLE:
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .rodata 0000000000000000 .rodata
0000000000000000 l .rodata 0000000000000000 my_msg
0000000000000000 g .text 0000000000000000 main
0000000000000000 *UND* 0000000000000000 puts
Disassembly of section .text:
0000000000000000 <main>:
0: a9bf7bfd stp x29, x30, [sp, #-16]!
4: 910003fd mov x29, sp
8: 10000000 adr x0, 0 <main>
c: 94000000 bl 0 <puts>
10: 52800000 mov w0, #0x0 // #0
14: a8c17bfd ldp x29, x30, [sp], #16
18: d65f03c0 ret
Sample output for our “Hello World!” example is shown in Listing 2.3.3.
As we can see, the objdump
shows mnemonics for the machine code.
Note
Each instruction in assembly code has exactly one representation as 32 machine code bits. However, in some cases, different mnemonics can result in the same machine code. In such cases objdump
shows the preferred disassembly when disassembling the machine code. One such case is CMP (extended register), which is an alias of SUBS (extended register), where CMP is the preferred disassembly.
However, the program cannot be executed because it contains undefined parts.
For example, the puts
function is marked with and *UND*
in the symbol table.
This is because puts is provided in libc, the standard library for the C language.
We still need to use a linker, the topic of the next section, to make it available.
2.4. Linker#
The goal of this section is to create an executable from the object file hello_world.o
that we assembled in Section 2.3.
To do this, we need to perform a step called linking. During linking we, for example, combine object files, resolve dependencies and assign addresses.
We can choose from several linkers to perform this step, including the GNU linker ld or the LLVM linker lld.
However, manually linking with ld
or lld
requires us to perform some extra steps, such as writing startup and exit code that communicates with Linux through system calls.
Instead, we will use a wrapper that does the extra steps for us automatically.
Using the GNU Compiler Collection, we can create our executable by simply using gcc
:
gcc -o hello_world hello_world.o
Similarly, using the compiler and tooling infrastructure Clang, we can do the following:
clang -o hello_world hello_world.o
Both will create the executable hello_world
that we can run:
./hello_world
We see that the string “Hello World!” is printed to the command line. We have just successfully executed our first assembly program.
If we look at the symbol table of the executable using objdump --syms hello_world
, we see many additional entries. Before, we had the *UND*
entry for puts
. Now, a glibc version is mentioned in the name, e.g., puts@GLIBC_2.17
, but the section is still given as *UND*
. This is because dynamic linking is the default behavior for external libraries in most cases. The respective symbols are resolved by the dynamic linker at runtime of the program, which also loads the required dynamic libraries. If we want the implementation of puts
to be part of our executable, we can opt for static linking. Using GCC, this can be achieved by using the static flag:
gcc -static -o hello_world hello_world.o
Looking at the symbol table again, we now see that the implementation of puts
is available in our executable:
0000000000401640 w F .text 000000000000025c puts
2.5. Registers#
The AArch64 Application Level Programmers’ Model has thirty-one general-purpose registers (GPRs) that are visible to the A64 instruction set: R0-R30. The GPRs are also called integer registers. They can be accessed as 64-bit X registers or as 32-bit W registers.
Fig. 2.5.1 Illustration of the thirty-one general-purpose registers R0-R30 visible to the A64 instruction set, some special-purpose registers and the PSTATE fields.#
As shown in Fig. 2.5.1, the lower 32 bits of the X registers overlap with the W registers.
In addition to the general-purpose registers, AArch64 defines a number of special-purpose registers and vector registers. We will discuss the vector registers as part of Neon chapter. The most important abstractions are the following:
- Link Register (LR)
The link register is simply another name for the general-purpose register X30. It stores the return address when a function is called. This means that after a function call, we return to the calling scope by jumping to the address in the link register.
- Zero Register (ZR)
The zero register can be used in some instructions. It is always read as zero, writes to the register are simply ignored. Note that the AArch64 programmers’ model assumes that there is a zero register. However, this does not imply that the register exists in hardware as a physical register, typically it will not.
- Stack Pointer (SP)
Dedicated 64-bit stack pointer register that holds the address where the stack ends. Note that the stack grows with decreasing virtual addresses, i.e. we have to decrement the stack pointer to allocate memory.
- Program Counter (PC)
64-bit program counter that holds the address of the current instruction. We cannot write directly to the program counter, but some instructions modify it. We will use the program counter for branching, for example, when implementing loops or conditional code.
- Process State (PSTATE)
The process state (PSTATE) is an abstraction that holds process state information. Of particular interest to us are the NZCV condition flags that are part of this state. Simply put, we will use them to store the result of comparisons in conditional code execution, and then jump conditionally based on these flags.
2.6. Procedure Call Standard#
The Procedure Call Standard (PCS) used by the Application Binary Interface (ABI) defines the role of the GPRs in function calls. GPRs R0-R7 are used to pass arguments to a function and to return values. Registers R8-R17 and R19-R28 are scratch registers. R18 and R29 can be used as temporary registers in some cases, but it is advisable to avoid using them altogether. One such place with strict requirements is Apple platforms:
- Apple platforms adhere to the following choices:
The platforms reserve register x18. Don’t use this register.
The frame pointer register (x29) must always address a valid frame record.
The PCS provides the rules for calling functions of others and for writing functions that are called by others.
In the first case, we are the caller of the other function. According to the PCS, it is our responsibility to save intermediate data in the caller-saved registers before calling the function. In other words: The called function may overwrite the data in any caller-saved registers. Then, after the function returns to our scope, we must restore the data. In the PCS, registers R0-R18 and R30 are caller-saved.
In the second case, our function is the callee. Again, according to the PCS, we must preserve the data in a set of registers. These registers are called callee-saved registers. So, if we plan to overwrite the data in a callee-saved register, we must save the intermediate data before modifying the register. Then, before jumping back to the caller’s scope, we must restore the data, since the caller may rely on it being preserved. According to the PCS, registers R19-R29 are saved by the callee.
In fact, when writing assembly code, we can use the stack to temporarily store the contents of registers. So, when implementing a function, we would first identify all the registers that our function modifies and that we need to preserve. Then, at the beginning of the function, we save the contents of all these registers on the stack. Now we can proceed with implementing the function and the registers without worrying about the PCS. When we are finished, just before jumping back to the scope of the caller, we simply restore the contents of the previously saved registers by loading them from the stack.
1 .text
2 .type pcs_gprs, %function
3 .global pcs_gprs
4 pcs_gprs:
5 // save frame pointer and link register
6 stp fp, lr, [sp, #-16]!
7 // update frame pointer to current stack pointer
8 mov fp, sp
9
10 // save callee-saved registers
11 stp x19, x20, [sp, #-16]!
12 stp x21, x22, [sp, #-16]!
13 stp x23, x24, [sp, #-16]!
14 stp x25, x26, [sp, #-16]!
15 stp x27, x28, [sp, #-16]!
16
17 // use registers as needed
18
19 // restore callee-saved registers
20 ldp x27, x28, [sp], #16
21 ldp x25, x26, [sp], #16
22 ldp x23, x24, [sp], #16
23 ldp x21, x22, [sp], #16
24 ldp x19, x20, [sp], #16
25
26 // restore frame pointer and link register
27 ldp fp, lr, [sp], #16
28
29 ret
Listing 2.6.1 shows an example implementation that adheres to the PCS. Specifically, we use the pre-index encoding of STP to copy the contents of registers X19-X30 to the stack. The STP instructions first decrement the stack pointer by 16, thus allocating 16 bytes on the stack, before storing the 16 bytes of data distributed across two X registers to the stack.
Note
We will discuss base instructions (including STP and LDP) in more detail in Section 3.
LDP’s post-index encoding has the opposite effect. First, each instruction loads 16 bytes from the stack into two X registers. Then the stack pointer is incremented by 16, effectively freeing the memory.
The two instructions in lines 6-7 create the stack frame. First, in line 6, the data of the frame pointer register (X29) and the link register (X30) are stored to the stack. Then, the MOV (to/from SP) instruction in line 6 copies the data from the stack pointer to the frame pointer register (X29), as required on Apple platforms, for example.
Note that we also temporarily store the data of the frame pointer register (X29) and the link register (X30) on the stack. Although FP (X29) and LR (X30) are typically not used directly in function implementations, their values must be preserved because they may be modified during nested function calls. In this case, we must restore them after the function calls.