.. _ch:assembly_building_blocks:

Assembly Building Blocks
========================
Several code constructs appear frequently in source code.
Examples are conditional code execution based on if-then-else statements or loops.
In :numref:`ch:conds_and_loops` we'll implement some of these building blocks in assembly code.
This not only extends our toolbox but also helps to understand the impact of their high-level equivalents in hardware.

:numref:`ch:assembly_building_blocks_unrolled` interprets "code constructs" by means of BLAS3.
Here, we'll develop a microkernel for small GEMMS using the Advanced SIMD (ASIMD) extension.
This microkernel may function as building block for more general GEMMs harnessing ASIMD.
However, in this class, we'll only generalize the kernel in the contraction dimension :math:`K`.
We'll move to the Scalable Vector Extension (SVE) for a more general GEMM implementation in :numref:`ch:gemms_sve`.
This allows us to get a solid understanding of the differences between ASIMD and SVE.

.. _ch:conds_and_loops:

Conditions and Loops
--------------------

First, we'll have a look at some small C/C++ functions and formulate them in assembly language.
For this, three source files and respective output are given:

.. literalinclude:: data_building_blocks/building_blocks/high_level.h
    :linenos:
    :language: cpp
    :caption: File ``high_level.h`` which defines the C/C++ functions' signatures.

.. literalinclude:: data_building_blocks/building_blocks/high_level.cpp
    :linenos:
    :language: cpp
    :caption: File ``high_level.cpp`` which implements the C/C++ functions.

.. literalinclude:: data_building_blocks/building_blocks/driver.cpp
    :linenos:
    :language: cpp
    :caption: File ``driver.cpp`` which calls the given C/C++ functions.

.. literalinclude:: data_building_blocks/building_blocks/run.log
    :linenos:
    :caption: Output when running the high-level implementation.

.. admonition:: Tasks

   #. Explain in 1-2 sentences what every of the eight functions does.
   #. Implement the functions in assembly language.
      Use the file names ``low_level.h`` and ``low_level.cpp`` and matching names for the functions, i.e.,
      ``low_lvl_0``, ``low_lvl_1``, ..., ``low_lvl_7``.
   #. Verify your low-level versions by extending the driver.

.. _ch:assembly_building_blocks_unrolled:

Small GEMM: ASIMD
-----------------
In this part of the lab we'll write our first high-performance matrix kernel relying on floating point math.
Our targeted kernel ``gemm_asm_asimd_16_6_1`` has the following signature:

.. code-block:: c++

     void gemm_asm_asimd_16_6_1( float const * i_a,
                                 float const * i_b,
                                 float       * io_c );

and performs the operation

.. math::

  C \mathrel{+}= A B

on 32-bit floating point-data with

.. math::

  M = 16, \; N = 6, \; K = 1, \quad ldA = 16, \; ldB = 1, \; ldC = 16.

In our implementation we'll completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops.
Once done, we'll add a :math:`K` loop to increase the size of the kernel's contraction dimension.

As for the general purpose registers, we have to preserve some SIMD registers to adhere to the procedure call standard `AAPCS64`_.
The template in :numref:`lst:gemm_asm_asimd_16_6_4` extends the one we had before when working solely on general purpose registers.
Now, we also write the lowest 64 bits of registers V8-V15 to the stack and restore them at the end of the function.

.. literalinclude:: data_building_blocks/template.s
    :name: lst:gemm_asm_asimd_16_6_4
    :linenos:
    :language: asm
    :caption: Template for the :math:`(16\times6)=(16\times1)(1\times6)` ASIMD matrix kernel.
              The template temporarily saves the general purpose registers X19-X30 and the lowest 64 bits of the SIMD registers V8-V15 on the stack.

Since we are hunting for performance, we'll do a little competition.
Prize? The three best-performing teams will receive an honorable mention on the :ref:`ch:performace_board` of this homepage 😉.

.. admonition:: Rules

  * Respect the procedure call standard. But: if you don't modify a register, you don't have to save it to the stack.
  * Verify your kernels.
  * Instrument your kernels for at least 1 second in your performance measurements through repeated executions.

For now, the current best performing teams are Alex's ASM and LIBXSMM:

.. _tab:blocks_gemm_16_6_1:

.. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.

  +---------------------------+----------+-------------+--------+-------+
  | Team                      | Time (s) | #executions | GFLOPS | %peak |
  +===========================+==========+=============+========+=======+
  | Alex's ASM                |    1.287 |   100000000 |  14.92 |  23.3 |
  +---------------------------+----------+-------------+--------+-------+
  | LIBXSMM, 59410c81 (ASIMD) |    1.811 |   100000000 |  10.60 |  16.6 |
  +---------------------------+----------+-------------+--------+-------+

.. _tab:blocks_gemm_16_6_48:

.. table:: Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.

  +---------------------------+----------+-------------+--------+-------+
  | Team                      | Time (s) | #executions | GFLOPS | %peak |
  +===========================+==========+=============+========+=======+
  | Alex's ASM                |    18.08 |   100000000 |  50.97 |  79.6 |
  +---------------------------+----------+-------------+--------+-------+
  | LIBXSMM, 59410c81 (ASIMD) |    18.74 |   100000000 |  49.17 |  76.8 |
  +---------------------------+----------+-------------+--------+-------+

.. admonition:: Tasks

   #. Implement and verify the unrolled matrix kernel C += AB for M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.
   #. Tune your kernel to squeeze more performance out of a core.
      You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules above. Report and document your optimizations.
   #. Add a loop over K to realize C += AB for M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.
   #. Come up with a creative team name and submit it together with your entries for "Time (s)", "#executions", "GFLOPS" and "%peak" in :numref:`tab:blocks_gemm_16_6_1` and :numref:`tab:blocks_gemm_16_6_48`. Assume a theoretical single-core peak of 64 GFLOPS for the used `c7g.xlarge <https://aws.amazon.com/ec2/instance-types/c7g/>`__ instance.

.. _AAPCS64: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#simd-and-floating-point-registers