5. Writing Assembly Code: AArch64
Writing code in assembly language gives us fine-grained control over the executed instructions. This section will cover the basics and our first math kernel. Once we are firm in using assembly language, we’ll use these skills to write high performance kernels for small GEMMs.
5.1. A New World
We get started by repeating the concepts introduced in the lectures. This part is freestyle and aims at getting more comfortable with our new knowledge and tools.
Tasks
Choose one or two assembly examples from the lectures. For these examples, follow the lectures to perform the following steps:
Use the assembler to generate machine code.
Run and test your obtained machine code by writing C/C++ drivers.
Create hex-dumps of the assembled code.
Use the disassembler.
Modify the code to illustrate and test the following two concepts:
Aliases are syntactic sugar, we can replace them with the underlying instructions. Of course, this implies that you chose an example with an alias.
Instead of mnemonics one can also write machine code directly.
5.2. GDB and Valgrind
Being beginners, writing code in assembly language can be error-prone. The tools GDB and Valgrind, are helpful when debugging our code.
We’ll get used to both tools by trying out the following code:
1 .text
2 .align 4
3 .type load_asm, %function
4 .global load_asm
5load_asm:
6 ldr x1, [x0, #8]!
7 ldp x2, x3, [x0]
8 ldp x4, x5, [x0, #16]
9
10 ret
11 .size load_asm, (. - load_asm)
1#include <cstdint>
2#include <cstdlib>
3
4extern "C" {
5 void load_asm( uint64_t const * i_a );
6}
7
8int main() {
9 uint64_t * l_a = new uint64_t[10];
10 for( unsigned short l_va = 0; l_va < 10; l_va++ ) {
11 l_a[l_va] = (l_va+1)*100;
12 }
13
14 // ok
15 load_asm( l_a+2 );
16
17 // not ok #1
18 // load_asm( l_a+12 );
19
20 // not ok #2
21 // load_asm( l_a+8 );
22
23 // not ok #3
24 // load_asm( l_a+6 );
25
26 delete[] l_a;
27
28 return EXIT_SUCCESS;
29}
Tasks
Explain the assembly code. When executing the function-call
load_asm( l_a+2 )
in the driver, what are the contents of registers X1-X5 beforeret
is executed in line 10?Compile and execute the code. Use
-g
as compile flag. Now, run the code through GDB:Set a break-point when entering the function
load_asm
:break load_asm
.Show the contents of the registers:
info registers
.Now step through the load instructions by using
step
and show the registers’ contents after every step.
Why are lines 18, 21, and 24 in the driver troublesome? Run the uncommented troublemakers through Valgrind and explain the output!
5.3. Copying Data
Now, let us load and store some data. Assume the following piece of code:
1#include <cstdint>
2#include <cstdlib>
3
4extern "C" {
5 void copy_asm( uint32_t const * i_a,
6 uint64_t * o_b );
7 void copy_c( uint32_t const * i_a,
8 uint64_t * o_b );
9}
10
11int main() {
12 uint32_t l_a[7] = { 1, 21, 43, 78, 89, 91, 93 };
13 uint64_t l_b[7] = { 0 };
14
15 copy_asm( l_a,
16 l_b );
17
18 // copy_c( l_a,
19 // l_b );
20
21 return EXIT_SUCCESS;
22}
The two functions copy_asm
and copy_c
are supposed to do the same:
Copy seven values from one location in memory to another.
However, the input array i_a
has 32 bits per value while the output array o_b
uses 64 bits per value.
Tasks
Implement the function
copy_asm
in assembly language. Use the filenamecopy.s
for your implementation.Write “similar” code in C. Use the function-name
copy_c
and filenamecopy.c
.Compare your implementation to the one generated by the compiler. For the comparison, try two approaches:
Instruct the compiler to generate assembly code using the
-S
flag.Compile the code and use the disassembler to generate respective assembly code.
5.4. A Mini Matrix Kernel
Let’s write our first matrix kernel in assembly language. For this we’ll use general purpose registers and respective ops. In practice, however, one would typically use vector register and vector instructions. We’ll do this soon, but work exclusively on general purpose registers to get started.
Our targeted matrix kernel gemm_asm_gp
has the following signature:
void gemm_asm_gp( uint32_t const * i_a,
uint32_t const * i_b,
uint32_t * io_c );
and performs the operation
on 32-bit unsigned integer data with
The following template puts all callee-saved registers on the stack and restores them at the end of the function. This allows us to use all general purpose registers in our implementation:
1 .text
2 .type gemm_asm_gp, %function
3 .global gemm_asm_gp
4 /*
5 * Performs the matrix-multiplication C+=A*B
6 * with the shapes (4x2) = (4x2) * (2x2).
7 * The input-data is of type uint32_t.
8 *
9 * @param x0 pointer to A.
10 * @param x1 pointer to B.
11 * @param x2 pointer to C.
12 */
13gemm_asm_gp:
14 // store
15 stp x19, x20, [sp, #-16]!
16 stp x21, x22, [sp, #-16]!
17 stp x23, x24, [sp, #-16]!
18 stp x25, x26, [sp, #-16]!
19 stp x27, x28, [sp, #-16]!
20 stp x29, x30, [sp, #-16]!
21
22 // your matrix-kernel goes here!
23
24 // restore
25 ldp x29, x30, [sp], #16
26 ldp x27, x28, [sp], #16
27 ldp x25, x26, [sp], #16
28 ldp x23, x24, [sp], #16
29 ldp x21, x22, [sp], #16
30 ldp x19, x20, [sp], #16
31
32 ret
33 .size gemm_asm_gp, (. - gemm_asm_gp)
Tasks
The multiply-add instruction (MADD) performs a scalar multiplication and addition on general purpose registers. Look it up and try it out!
Implement the
gemm_asm_gp
kernel above! In your implementation completely unroll the kernel, i.e., write every instruction explicitly. There’s no need to write any loops or other control structures. This means that you may implement the entire kernel by using the template above and by adding loads (ldr
orldp
), multiply-adds (madd
) and stores (str
orstp
).Embed your implementation in a driver and ensure its correctness!