.. _ch:aarch64:

AArch64
=======
This lab represents a short interlude from our hardware focus.
For the time being we'll shift our attention to the AArch64 Instruction Set Architecture (ISA).
An AArch64 instruction can be written in two ways:

1) Through human-readable assembly code; or
2) Through machine code which is understood directly by the processor.

In this lab we'll first get a feeling for the ISA by writing a few simple functions in assembly code.
Once this is accomplished, we'll learn how to directly use and understand machine code in :numref:`ch:single_cycle`.
Understanding machine code is not only helpful but crucial when studying the structure of processors.

.. _ch:aarch64_copy:

Copying Data
------------
.. figure:: /chapters/data_aarch64/load_store_gprs.svg
   :name: fig:aarch64_load_store_gprs
   :align: center
   :width: 70%

   Left: Illustration of a load-store architecture.
   The ALU is only able to access data in registers.
   Data residing in the memory subsystem has to be loaded to registers first before it can be processed.
   Right: Illustration of AArch64's 31 64-bit `general purpose registers <https://developer.arm.com/documentation/den0024/a/ARMv8-Registers>`__, the `special registers <https://developer.arm.com/documentation/den0024/a/ARMv8-Registers/AArch64-special-registers>`__ ZR, SP, PC, and `PSTATE <https://developer.arm.com/documentation/den0024/a/ARMv8-Registers/Processor-state>`__.
   The architectural names of the general purpose registers are given through R0 - R30.

AArch64 is a load-store architecture (see :numref:`fig:aarch64_load_store_gprs`).
This means that instructions either perform memory accesses or operate on data in registers.
Note that an instruction may not do both, i.e., access memory and process data, at the same time.

A memory access instruction transfers data from memory to the registers (load) or transfers data from the registers to memory (store).
In this task we'll copy data which is located at one memory location to another memory location.
Since we cannot directly move data between two memory locations, we first load the data from the first location to the registers and then write it back to the target memory location.
For this task we'll use AArch64's general purpose registers which are shown in :numref:`fig:aarch64_load_store_gprs`.

.. admonition:: Optional Note

   Certain recent extensions of the Arm architecture violate the concept of a strict "load-store architecture" ðŸ™„.
   One such example is the `LDADD <https://developer.arm.com/documentation/ddi0602/2021-09/Base-Instructions/LDADD--LDADDA--LDADDAL--LDADDL--Atomic-add-on-word-or-doubleword-in-memory->`_ instruction which loads data from memory, adds a value in a register to it, and writes the result back to memory.

.. literalinclude:: data_aarch64/driver_copy.cpp
    :language: cpp
    :caption: C++ driver for the C and assembly copy kernels.
    :name: listing:driver_copy

.. literalinclude:: data_aarch64/copy_c.c
    :language: c
    :caption: C copy kernel.
    :name: listing:copy_c

.. literalinclude:: data_aarch64/copy_asm.s
    :language: asm
    :caption: Template for the copy kernel in assembly language.
    :name: listing:copy_asm

The code in :numref:`listing:driver_copy` and :numref:`listing:copy_asm` provides the required boilerplate for your kernel.
Further, a reference implementation of the copy function in C is given in :numref:`listing:copy_c`.
Your task is to copy :numref:`listing:driver_copy`'s seven `64-bit unsigned integer <https://en.cppreference.com/w/cpp/types/integer>`_ in array ``l_a`` to array ``l_b_1`` by implementing the function ``copy_asm`` in assembly language.

.. note::
  Use the instructions `LDR (immediate) <https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/LDR--immediate---Load-Register--immediate-->`_ and `STR (immediate) <https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/STR--immediate---Store-Register--immediate-->`_ for the loads and stores in your implementation.
  Do not implement any stack transfers and only use the first 18 general purpose registers, i.e., ``R0`` - ``R17`` to adhere the `procedure call standard <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst>`_.

  Use the flags ``-pedantic -Wall -Wextra -Werror`` whenever invoking ``gcc`` or ``g++``.
  Do this not only here but in all tasks.

.. admonition:: Tasks

  #. Implement the function ``copy_asm`` in the file ``copy_asm.s``.
     Use the template in :numref:`listing:copy_asm` for your implementation.
     Follow the ideas of the C implementation in :numref:`listing:copy_c`, i.e., do not use any loops in your code.

  #. Compile the C kernel ``copy_c`` given in :numref:`listing:copy_c` using the optimization flag ``-O2``.
     Disassemble the compiler-generated machine code.
     Briefly explain the obtained assembly code.

  #. Implement a new function ``copy_asm_loop`` in the file ``copy_asm.s``.
     In this implementation use a loop to copy the seven values.

Adding Two Arrays
-----------------
Great, we are able to move data from A to B.
Even better if we could process our data, don't you think?
Let's do another simple example for this!

Assume that you have two memory addresses which are stored in the pointers ``l_a`` and ``l_b``.
Each address is the start of some 64-bit unsigned integer values consecutively stored in memory.
For example, if you have 10 values, each array is 10 :math:`\times` 64 bits = 640 bits large.
This is the same as 80 bytes per array or 160 bytes for all values together.

Now, our goal is to add the values in the two arrays ``l_a`` and ``l_b``, and store the result at a third location in memory.
Getting the data into the general purpose registers and back to memory is simple, we just programmed a kernel for this in :numref:`ch:aarch64_copy`.
The only missing piece of the puzzle is an instruction which processes the data and effectively adds the values in two general purpose registers.
For this, we once again have a look at the `base instructions <https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions>`__ of the ISA.
`ADD (shifted register) <https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/ADD--shifted-register---Add--shifted-register-->`__ is a suitable instruction.

.. literalinclude:: data_aarch64/driver_add.cpp
    :language: cpp
    :caption: C++ driver for the C and assembly addition kernels.
    :name: listing:driver_add

.. literalinclude:: data_aarch64/add_c.c
    :language: c
    :caption: C kernel which adds the ``i_n_values`` values of the two arrays ``i_a`` and ``i_b`` and writes them to ``o_c``.
    :name: listing:add_c

Once again, to supercharge your coding, a template for the required C++ driver is given in :numref:`listing:driver_add`.
Further, a reference C implementation of the addition kernel is given in :numref:`listing:add_c`.
Thus, the only missing part is the assembly kernel: Time to get to work!


.. admonition:: Tasks

  #. Implement the function ``add_asm`` in assembly language and use the file ``add_asm.s`` for your implementation.
     Follow the ideas of the C implementation in :numref:`listing:add_c`.

  #. Compile the C kernel ``add_c`` given in :numref:`listing:add_c` using the optimization flag ``-O2``.
     Disassemble the compiler-generated machine code.
     Briefly explain the obtained assembly code!

Computing Fibonacci Numbers
---------------------------
Let's program something useful for a change ðŸ˜‚.
The Fibonacci numbers are given by the following sequence:

.. math::
  F_0 &= 0,\\
  F_1 &= 1,\\
  F_n &= F_{n-1} + F_{n-2} \quad \forall n \ge 2.

:numref:`listing:driver_fibonacci` provides the usual C++ driver.
As shown in line 6 and 7, the C and assembly functions take the id :math:`n` as input and return the respective Fibonacci number, i.e., :math:`F_n`.
Once again, we'll get started by implementing a C function which is somewhat close to assembly code.
This will then be our recipe for the assembly variant.

.. literalinclude:: data_aarch64/driver_fibonacci.cpp
    :linenos:
    :language: cpp
    :caption: Driver for the C and assembly kernels which compute Fibonacci numbers.
    :name: listing:driver_fibonacci

.. admonition:: Tasks

  #. Implement the reference version ``fibonacci_c`` in the file ``fibonacci_c.c``.
     Try to keep your implementation close to what you would do in assembly language.

  #. Implement the assembly version ``fibonacci_asm`` in the file ``fibonacci_asm.s``.
     Keep your implementation dynamic, i.e., the function should accept :math:`n` as input argument.
     This is also underlined by the function declaration's argument ``uint64_t i_id`` in line 7 of :numref:`listing:driver_fibonacci`.

     .. hint::

        Keep in mind the `procedure call standard <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst>`_, i.e., the compiler will make the input ``i_id`` available in ``X0``. You have to return the ``uint64_t`` result in ``X0`` as well.