.. _ch:jit:

JITed Kernels in Praxis
=======================
Up to this point we implemented a collection of functions and worked on boosting their performance.
We would either write them in high-level C code or use assembly language.
Then, to run our functions, we used a compiler or an assembler and translated our code to machine code which we could execute.

Now, we'll go down a different route and use Just-In-Time (JIT) code generation:
Instead of generating an executable where our functions are hardwired, we'll generate the functions at runtime.
This makes our life more difficult.
Without the help of the assembler or compiler, we have to generate the machine code ourselves.
On the plus-side: Instead of being forced to write functions upfront which fit all of our needs, we can tailor them to specific use cases at runtime.
Through JITing, we are heading for an automated way for our journey from compiler-generated code for a generic matrix kernel to single-purpose but high-performing assembly kernels supporting only a few configurations.

Getting Started
---------------
In this section we'll use the small JITter :download:`mini_jit <data_jit/mini_jit.tar.xz>`, which provides a few classes to ease our first steps:

* ``mini_jit::backend::Kernel`` allows us to collect the machine code of our instructions in a code buffer, copy it to a region in memory, and set this region executable. Additionally, the class remember locations in the JITted code as required by loops which jump relative to the program counter.
* ``mini_jit::instructions::Base`` provides a few functions which generate the AArch64 base instructions used in the shipped generators ``Simple`` and ``Loop`` (see below). You'll have to write your own wrapping functions for missing instructions.
* ``mini_jit::instructions::Asimd`` provides a few functions which generate AArch64 ASIMD&FP instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of ASIMD kernels in the tasks below.
* ``mini_jit::instructions::Sve`` provides a few functions which generate some SVE instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of SVE kernels in the tasks below.
* ``mini_jit::generators::Simple`` is a simple generator which generates a kernel consisting of the two instructions ``mov w0, #3`` and ``ret``.
* ``mini_jit::generators::Loop`` illustrates how one might generate a loop at runtime by using branches relative to the program counter.
  The generator takes the number of iterations as an ``uint32_t`` input and then generates a loop which performs the given number of iterations.
  The generated kernels counts the number of performed iterations in ``w0`` and then returns this as a result.
  Note, that the number of iterations is hardwired in the generated kernel, thus the actual kernel call takes no arguments.

.. admonition:: Tasks

   #. Compile mini_jit and test the two provided generators "Simple" and "Loop".
      Compile and run the unit tests!
   #. Disassemble the dumped code buffers.
      This might by done through the following ``objdump`` options:

      .. code-block:: bash

         objdump -m aarch64 -b binary -D my_dump.bin
   #. Implement a new generator ``mini_jit::generators::MyExample`` which uses at least one new base instruction which is not yet wrapped in ``mini_jit::instructions::Base``.
      You are free to implement any functionality through your generator but provide a short description of your new generator.
   #. Build and run mini_jit on your host-system.
      Note, that you can't execute AArch64 instructions on x86 systems if not emulated.
      However, you can still generate the code and inspect it through ``objdump``.
      If using Fedora you might install the AArch64 GCC tools through the package ``gcc-c++-aarch64-linux-gnu`` and then use ``aarch64-linux-gnu-objdump`` for the disassembly.
      For the LLVM toolchain you can disassemble the binary dump via

      .. code-block:: bash

         llvm-objcopy -I binary -O elf64-littleaarch64 --rename-section=.data=.text,code my_example.bin my_example.elf
         objdump -d my_example.elf

ASIMD: First Steps
------------------
Our JITter only ships with support for a few base and ASIMD&FP instructions of the AArch64 ISA.
Additional instructions we might require have to be added by us.
In this section we'll have another look at the triad example used in the SVE chapter already.
The high-level representation of the triad was given as:

.. literalinclude:: data_sneak_peek/vla/triad_high.cpp
    :linenos:
    :language: cpp
    :caption: File ``triad_high.cpp`` which implements the triad-function in C/C++.

When writing SVE code, we could use predicate instructions and the concept of VLA programming.
Now, we'll use mini_jit to generate tailored ASIMD code.

.. admonition:: Tasks

   #. Extend the class ``mini_jit::instructions::Asimd`` with member functions which generate machine code for `FMADD <https://developer.arm.com/documentation/ddi0602/latest/SIMD-FP-Instructions/FMADD--Floating-point-fused-Multiply-Add--scalar--?lang=en>`_ and `FMLA (vector) <https://developer.arm.com/documentation/ddi0602/latest/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector-->`_. It is sufficient if you only support FP32 and FP64.
   #. Implement and verify a new generator ``mini_jit::generators::Triad`` which takes the number of values as input to the ``generate`` function.
      This means, once the code is generated, it is tailored to a specific number of elements.
      Thus, ``generate`` has the following signature:

      .. code-block:: bash

         void ( *generate( uint64_t i_n_values ) )( float const * i_a,
                                                    float const * i_b,
                                                    float       * o_c );

      Now, if we would, for example, call ``generate( 7 )``, the generated function would always operate on seven values.
      It is sufficient to support values for ``i_n_values`` which are below :math:`2^{16}=65,536`.
   #. How would you extend your kernel generation to support arrays with arbitrary sizes?
      If your kernel generation already does support these: How did you do it?

Small GEMMs
-----------
We'll now use our new JITting skills to generate small matrix kernels.
This is, in a simpler way, also what the LIBXSMM library does.
Before, the goal was to obtain maximum performance for a single matrix kernel with fixed sizes and leading dimensions.
Now, the goal is to write a generic generator which can generate fast code for different specs: We have to combine our knowledge on writing fast assembly code with our JITting knowledge.

We call our new generator ``mini_jit::generators::SmallGemmSve``.
The generator implements the operation:

.. math::

  C \mathrel{+}= A B

with different matrix shapes.
It should at least support the following values for M, N, K, ldA, ldB and ldC on 32-bit floating point-data:

  +------+------+------+------+------+------+------+
  |   id |    M |    N |    K |  ldA |  ldB |  ldC |
  +======+======+======+======+======+======+======+
  |    0 |   32 |    6 |    1 |   32 |    1 |   32 |
  +------+------+------+------+------+------+------+
  |    1 |   32 |    6 |   48 |   32 |   48 |   32 |
  +------+------+------+------+------+------+------+
  |    2 |  128 |    6 |   48 |  128 |   48 |  128 |
  +------+------+------+------+------+------+------+
  |    3 |  128 |   48 |   48 |  128 |   48 |  128 |
  +------+------+------+------+------+------+------+

In this part we assume that the leading dimensions match M, N and K.
Thus, the signature of the ``generate`` function is:

.. code-block:: bash

  void ( *mini_jit::generators::SmallGemmSve::generate( uint32_t i_m,
                                                        uint32_t i_n,
                                                        uint32_t i_k ) )( float const * i_a,
                                                                          float const * i_b,
                                                                          float       * io_c )

.. admonition:: Tasks

   #. Design, implement, verify and optimize ``mini_jit::generators::SmallGemmSve``.
   #. Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name.
      Your submission should include individual numbers for the four variants above and the arithmetic mean over all of them.
      The ranking will be based on the highest obtained mean value.
   #. Have a look at `LIBXSMM <https://github.com/libxsmm/libxsmm>`__ and identify the parts you just implemented standalone.

      .. hint::

         The generation of SVE microkernels  is located in `generator_gemm_aarch64.c <https://github.com/libxsmm/libxsmm/blob/main_stable/src/generator_gemm_aarch64.c#L499>`__. The AArch64 ISA is wrapped in `generator_aarch64_instructions.h <https://github.com/libxsmm/libxsmm/blob/main_stable/src/generator_aarch64_instructions.h>`__ and `generator_aarch64_instructions.c <https://github.com/libxsmm/libxsmm/blob/main_stable/src/generator_aarch64_instructions.c>`__.

.. hint::

   You may use llvm-mc to get the machine code corresponding to a line of assembly code.
   For example, run the following to assemble ``ldr z0, [x0]``:

   .. code-block:: bash

      echo "ldr z0, [x0]" | llvm-mc -triple=aarch64 -mattr=+sve --show-encoding