Writing Assembly Code: AArch64
==============================
Writing code in assembly language gives us fine-grained control over the executed instructions.
This section will cover the basics and our first math kernel.
Once we are firm in using assembly language, we'll use these skills to write high performance kernels for small GEMMs.

A New World
-----------
We get started by repeating the concepts introduced in the lectures.
This part is freestyle and aims at getting more comfortable with our new knowledge and tools.

.. admonition:: Tasks

   Choose one or two assembly examples from the lectures.
   For these examples, follow the lectures to perform the following steps:
   
   * Use the assembler to generate machine code.
   * Run and test your obtained machine code by writing C/C++ drivers.
   * Create hex-dumps of the assembled code.
   * Use the disassembler.
   * Modify the code to illustrate and test the following two concepts:

     * Aliases are syntactic sugar, we can replace them with the underlying instructions.
       Of course, this implies that you chose an example with an alias.
     * Instead of mnemonics one can also write machine code directly.

GDB and Valgrind
----------------
Being beginners, writing code in assembly language can be error-prone.
The tools `GDB <https://www.gnu.org/software/gdb/>`__ and `Valgrind <https://valgrind.org/>`__, are helpful when debugging our code.

We'll get used to both tools by trying out the following code:

.. literalinclude:: data_aarch64/load_asm.s
    :linenos:
    :language: asm

.. literalinclude:: data_aarch64/driver_bugs.cpp
    :linenos:
    :language: cpp

.. admonition:: Tasks

   #. Explain the assembly code.
      When executing the function-call ``load_asm( l_a+2 )`` in the driver,
      what are the contents of registers X1-X5 before ``ret`` is executed in line 10?
   #. Compile and execute the code.
      Use ``-g`` as compile flag.
      Now, run the code through GDB:

      * Set a break-point when entering the function ``load_asm``: ``break load_asm``.
      * Show the contents of the registers: ``info registers``.
      * Now step through the load instructions by using ``step`` and show the registers' contents after every step.
   #. Why are lines 18, 21, and 24 in the driver troublesome?
      Run the uncommented troublemakers through Valgrind and explain the output!

Copying Data
------------
Now, let us load and store some data.
Assume the following piece of code:

.. literalinclude:: data_aarch64/driver_copy.cpp
    :linenos:
    :language: cpp

The two functions ``copy_asm`` and ``copy_c`` are supposed to do the same:
Copy seven values from one location in memory to another.
However, the input array ``i_a`` has 32 bits per value while the output array ``o_b`` uses 64 bits per value.

.. admonition:: Tasks

   #. Implement the function ``copy_asm`` in assembly language.
      Use the filename ``copy.s`` for your implementation.
   #. Write "similar" code in C.
      Use the function-name ``copy_c`` and filename ``copy.c``.
   #. Compare your implementation to the one generated by the compiler.
      For the comparison, try two approaches:

      a) Instruct the compiler to generate assembly code using the ``-S`` flag.
      b) Compile the code and use the disassembler to generate respective assembly code.

A Mini Matrix Kernel
--------------------
Let's write our first matrix kernel in assembly language.
For this we'll use general purpose registers and respective ops.
In practice, however, one would typically use vector register and vector instructions.
We'll do this soon, but work exclusively on general purpose registers to get started.

Our targeted matrix kernel ``gemm_asm_gp`` has the following signature:

.. code-block:: c++

     void gemm_asm_gp( uint32_t const * i_a,
                       uint32_t const * i_b,
                       uint32_t       * io_c );

and performs the operation

.. math::

  C \mathrel{+}= A B

on 32-bit unsigned integer data with

.. math::

  M = 4, \; N = 2, \; K = 2, \quad ldA = 4, \; ldB = 2, \; ldC = 4.

The following template puts all callee-saved registers on the stack and restores them at the end of the function.
This allows us to use all general purpose registers in our implementation:

.. literalinclude:: data_aarch64/template.s
    :linenos:
    :language: s

.. admonition:: Tasks

   #. The multiply-add instruction (`MADD <https://developer.arm.com/documentation/ddi0602/latest/Base-Instructions/MADD--Multiply-Add->`__) performs a scalar multiplication and addition on general purpose registers.
      Look it up and try it out!
   #. Implement the ``gemm_asm_gp`` kernel above!
      In your implementation completely unroll the kernel, i.e., write every instruction explicitly.
      There's no need to write any loops or other control structures.
      This means that you may implement the entire kernel by using the template above and by adding loads (``ldr`` or ``ldp``), multiply-adds (``madd``) and stores (``str`` or ``stp``).
   #. Embed your implementation in a driver and ensure its correctness!