5. Writing Assembly Code: AArch64

Writing code in assembly language gives us fine-grained control over the executed instructions. This section will cover the basics and our first math kernel. Once we are firm in using assembly language, we’ll use these skills to write high performance kernels for small GEMMs.

5.1. A New World

We get started by repeating the concepts introduced in the lectures. This part is freestyle and aims at getting more comfortable with our new knowledge and tools.

Tasks

Choose one or two assembly examples from the lectures. For these examples, follow the lectures to perform the following steps:

  • Use the assembler to generate machine code.

  • Run and test your obtained machine code by writing C/C++ drivers.

  • Create hex-dumps of the assembled code.

  • Use the disassembler.

  • Modify the code to illustrate and test the following two concepts:

    • Aliases are syntactic sugar, we can replace them with the underlying instructions. Of course, this implies that you chose an example with an alias.

    • Instead of mnemonics one can also write machine code directly.

5.2. GDB and Valgrind

Being beginners, writing code in assembly language can be error-prone. The tools GDB and Valgrind, are helpful when debugging our code.

We’ll get used to both tools by trying out the following code:

 1        .text
 2        .align 4
 3        .type   load_asm, %function
 4        .global load_asm
 5load_asm:
 6        ldr x1,     [x0, #8]!
 7        ldp x2, x3, [x0]
 8        ldp x4, x5, [x0, #16]
 9
10        ret
11        .size load_asm, (. - load_asm)
 1#include <cstdint>
 2#include <cstdlib>
 3
 4extern "C" {
 5  void load_asm( uint64_t const * i_a );
 6}
 7
 8int main() {
 9  uint64_t * l_a = new uint64_t[10];
10  for( unsigned short l_va = 0; l_va < 10; l_va++ ) {
11    l_a[l_va] = (l_va+1)*100;
12  }
13
14  // ok
15  load_asm( l_a+2 );
16
17  // not ok #1
18  // load_asm( l_a+12 );
19
20  // not ok #2
21  // load_asm( l_a+8 );
22
23  // not ok #3
24  // load_asm( l_a+6 );
25
26  delete[] l_a;
27
28  return EXIT_SUCCESS;
29}

Tasks

  1. Explain the assembly code. When executing the function-call load_asm( l_a+2 ) in the driver, what are the contents of registers X1-X5 before ret is executed in line 10?

  2. Compile and execute the code. Use -g as compile flag. Now, run the code through GDB:

    • Set a break-point when entering the function load_asm: break load_asm.

    • Show the contents of the registers: info registers.

    • Now step through the load instructions by using step and show the registers’ contents after every step.

  3. Why are lines 18, 21, and 24 in the driver troublesome? Run the uncommented troublemakers through Valgrind and explain the output!

5.3. Copying Data

Now, let us load and store some data. Assume the following piece of code:

 1#include <cstdint>
 2#include <cstdlib>
 3
 4extern "C" {
 5  void copy_asm( uint32_t const * i_a,
 6                 uint64_t       * o_b );
 7  void copy_c( uint32_t const * i_a,
 8               uint64_t       * o_b );
 9}
10
11int main() {
12  uint32_t l_a[7] = { 1, 21, 43, 78, 89, 91, 93 };
13  uint64_t l_b[7] = { 0 };
14
15  copy_asm( l_a,
16            l_b );
17
18  // copy_c( l_a,
19  //         l_b );
20
21  return EXIT_SUCCESS;
22}

The two functions copy_asm and copy_c are supposed to do the same: Copy seven values from one location in memory to another. However, the input array i_a has 32 bits per value while the output array o_b uses 64 bits per value.

Tasks

  1. Implement the function copy_asm in assembly language. Use the filename copy.s for your implementation.

  2. Write “similar” code in C. Use the function-name copy_c and filename copy.c.

  3. Compare your implementation to the one generated by the compiler. For the comparison, try two approaches:

    1. Instruct the compiler to generate assembly code using the -S flag.

    2. Compile the code and use the disassembler to generate respective assembly code.

5.4. A Mini Matrix Kernel

Let’s write our first matrix kernel in assembly language. For this we’ll use general purpose registers and respective ops. In practice, however, one would typically use vector register and vector instructions. We’ll do this soon, but work exclusively on general purpose registers to get started.

Our targeted matrix kernel gemm_asm_gp has the following signature:

void gemm_asm_gp( uint32_t const * i_a,
                  uint32_t const * i_b,
                  uint32_t       * io_c );

and performs the operation

\[C \mathrel{+}= A B\]

on 32-bit unsigned integer data with

\[M = 4, \; N = 2, \; K = 2, \quad ldA = 4, \; ldB = 2, \; ldC = 4.\]

The following template puts all callee-saved registers on the stack and restores them at the end of the function. This allows us to use all general purpose registers in our implementation:

 1	.text
 2        .type gemm_asm_gp, %function
 3        .global gemm_asm_gp
 4        /*
 5         * Performs the matrix-multiplication C+=A*B
 6         * with the shapes (4x2) = (4x2) * (2x2).
 7         * The input-data is of type uint32_t.
 8         *
 9         * @param x0 pointer to A.
10         * @param x1 pointer to B.
11         * @param x2 pointer to C.
12         */ 
13gemm_asm_gp:
14        // store
15        stp x19, x20, [sp, #-16]!
16        stp x21, x22, [sp, #-16]!
17        stp x23, x24, [sp, #-16]!
18        stp x25, x26, [sp, #-16]!
19        stp x27, x28, [sp, #-16]!
20        stp x29, x30, [sp, #-16]!
21
22        // your matrix-kernel goes here!
23
24        // restore
25        ldp x29, x30, [sp], #16
26        ldp x27, x28, [sp], #16
27        ldp x25, x26, [sp], #16
28        ldp x23, x24, [sp], #16
29        ldp x21, x22, [sp], #16
30        ldp x19, x20, [sp], #16
31
32        ret
33        .size gemm_asm_gp, (. - gemm_asm_gp)

Tasks

  1. The multiply-add instruction (MADD) performs a scalar multiplication and addition on general purpose registers. Look it up and try it out!

  2. Implement the gemm_asm_gp kernel above! In your implementation completely unroll the kernel, i.e., write every instruction explicitly. There’s no need to write any loops or other control structures. This means that you may implement the entire kernel by using the template above and by adding loads (ldr or ldp), multiply-adds (madd) and stores (str or stp).

  3. Embed your implementation in a driver and ensure its correctness!