10. JITed Kernels in Praxis

Up to this point we implemented a collection of functions and worked on boosting their performance. We would either write them in high-level C code or use assembly language. Then, to run our functions, we used a compiler or an assembler and translated our code to machine code which we could execute.

Now, we’ll go down a different route and use Just-In-Time (JIT) code generation: Instead of generating an executable where our functions are hardwired, we’ll generate the functions at runtime. This makes our life more difficult. Without the help of the assembler or compiler, we have to generate the machine code ourselves. On the plus-side: Instead of being forced to write functions upfront which fit all of our needs, we can tailor them to specific use cases at runtime. Through JITing, we are heading for an automated way for our journey from compiler-generated code for a generic matrix kernel to single-purpose but high-performing assembly kernels supporting only a few configurations.

10.1. Getting Started

In this section we’ll use the small JITter mini_jit, which provides a few classes to ease our first steps:

mini_jit::backend::Kernel allows us to collect the machine code of our instructions in a code buffer, copy it to a region in memory, and set this region executable. Additionally, the class remember locations in the JITted code as required by loops which jump relative to the program counter.
mini_jit::instructions::Base provides a few functions which generate the AArch64 base instructions used in the shipped generators Simple and Loop (see below). You’ll have to write your own wrapping functions for missing instructions.
mini_jit::instructions::Asimd provides a few functions which generate AArch64 ASIMD&FP instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of ASIMD kernels in the tasks below.
mini_jit::instructions::Sve provides a few functions which generate some SVE instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of SVE kernels in the tasks below.
mini_jit::generators::Simple is a simple generator which generates a kernel consisting of the two instructions mov w0, #3 and ret.
mini_jit::generators::Loop illustrates how one might generate a loop at runtime by using branches relative to the program counter. The generator takes the number of iterations as an uint32_t input and then generates a loop which performs the given number of iterations. The generated kernels counts the number of performed iterations in w0 and then returns this as a result. Note, that the number of iterations is hardwired in the generated kernel, thus the actual kernel call takes no arguments.

Tasks

Compile mini_jit and test the two provided generators “Simple” and “Loop”. Compile and run the unit tests!
Disassemble the dumped code buffers. This might by done through the following objdump options:
```
objdump -m aarch64 -b binary -D my_dump.bin
```
Implement a new generator mini_jit::generators::MyExample which uses at least one new base instruction which is not yet wrapped in mini_jit::instructions::Base. You are free to implement any functionality through your generator but provide a short description of your new generator.
Build and run mini_jit on your host-system. Note, that you can’t execute AArch64 instructions on x86 systems if not emulated. However, you can still generate the code and inspect it through objdump. If using Fedora you might install the AArch64 GCC tools through the package gcc-c++-aarch64-linux-gnu and then use aarch64-linux-gnu-objdump for the disassembly. For the LLVM toolchain you can disassemble the binary dump via
```
llvm-objcopy -I binary -O elf64-littleaarch64 --rename-section=.data=.text,code my_example.bin my_example.elf
objdump -d my_example.elf
```

10.2. ASIMD: First Steps

Our JITter only ships with support for a few base and ASIMD&FP instructions of the AArch64 ISA. Additional instructions we might require have to be added by us. In this section we’ll have another look at the triad example used in the SVE chapter already. The high-level representation of the triad was given as:

Listing 10.2.1 File triad_high.cpp which implements the triad-function in C/C++.

#include "triad_high.h"

void triad_high( uint64_t         i_n_values,
                 float    const * i_a,
                 float    const * i_b,
                 float          * o_c ) {
  for( uint64_t l_va = 0; l_va < i_n_values; l_va++ ) {
    o_c[l_va] = i_a[l_va] + 2.0f * i_b[l_va];
  }
}

When writing SVE code, we could use predicate instructions and the concept of VLA programming. Now, we’ll use mini_jit to generate tailored ASIMD code.

Tasks

Extend the class mini_jit::instructions::Asimd with member functions which generate machine code for FMADD and FMLA (vector). It is sufficient if you only support FP32 and FP64.
Implement and verify a new generator mini_jit::generators::Triad which takes the number of values as input to the generate function. This means, once the code is generated, it is tailored to a specific number of elements. Thus, generate has the following signature:
```
void ( *generate( uint64_t i_n_values ) )( float const * i_a,
                                           float const * i_b,
                                           float       * o_c );
```
Now, if we would, for example, call generate( 7 ), the generated function would always operate on seven values. It is sufficient to support values for i_n_values which are below $2^{16} = 65, 536$ .
How would you extend your kernel generation to support arrays with arbitrary sizes? If your kernel generation already does support these: How did you do it?

10.3. Small GEMMs

We’ll now use our new JITting skills to generate small matrix kernels. This is, in a simpler way, also what the LIBXSMM library does. Before, the goal was to obtain maximum performance for a single matrix kernel with fixed sizes and leading dimensions. Now, the goal is to write a generic generator which can generate fast code for different specs: We have to combine our knowledge on writing fast assembly code with our JITting knowledge.

We call our new generator mini_jit::generators::SmallGemmSve. The generator implements the operation:

C + = A B

with different matrix shapes. It should at least support the following values for M, N, K, ldA, ldB and ldC on 32-bit floating point-data:

id

M

N

K

ldA

ldB

ldC

0

32

6

1

32

1

32

1

32

6

48

32

48

32

2

128

6

48

128

48

128

3

128

48

48

128

48

128

In this part we assume that the leading dimensions match M, N and K. Thus, the signature of the generate function is:

void ( *mini_jit::generators::SmallGemmSve::generate( uint32_t i_m,
                                                      uint32_t i_n,
                                                      uint32_t i_k ) )( float const * i_a,
                                                                        float const * i_b,
                                                                        float       * io_c )

Tasks

Design, implement, verify and optimize mini_jit::generators::SmallGemmSve.
Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name. Your submission should include individual numbers for the four variants above and the arithmetic mean over all of them. The ranking will be based on the highest obtained mean value.
Have a look at LIBXSMM and identify the parts you just implemented standalone.

Hint

The generation of SVE microkernels is located in generator_gemm_aarch64.c. The AArch64 ISA is wrapped in generator_aarch64_instructions.h and generator_aarch64_instructions.c.

Hint

You may use llvm-mc to get the machine code corresponding to a line of assembly code. For example, run the following to assemble ldr z0, [x0]:

echo "ldr z0, [x0]" | llvm-mc -triple=aarch64 -mattr=+sve --show-encoding