10. JITed Kernels in Praxis
Up to this point we implemented a collection of functions and worked on boosting their performance. We would either write them in high-level C code or use assembly language. Then, to run our functions, we used a compiler or an assembler and translated our code to machine code which we could execute.
Now, we’ll go down a different route and use Just-In-Time (JIT) code generation: Instead of generating an executable where our functions are hardwired, we’ll generate the functions at runtime. This makes our life more difficult. Without the help of the assembler or compiler, we have to generate the machine code ourselves. On the plus-side: Instead of being forced to write functions upfront which fit all of our needs, we can tailor them to specific use cases at runtime. Through JITing, we are heading for an automated way for our journey from compiler-generated code for a generic matrix kernel to single-purpose but high-performing assembly kernels supporting only a few configurations.
10.1. Getting Started
In this section we’ll use the small JITter mini_jit
, which provides a few classes to ease our first steps:
mini_jit::backend::Kernel
allows us to collect the machine code of our instructions in a code buffer, copy it to a region in memory, and set this region executable. Additionally, the class remember locations in the JITted code as required by loops which jump relative to the program counter.mini_jit::instructions::Base
provides a few functions which generate the AArch64 base instructions used in the shipped generatorsSimple
andLoop
(see below). You’ll have to write your own wrapping functions for missing instructions.mini_jit::instructions::Asimd
provides a few functions which generate AArch64 ASIMD&FP instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of ASIMD kernels in the tasks below.mini_jit::instructions::Sve
provides a few functions which generate some SVE instructions. These wrappers are not used in any of the provided generators. They are intended to kickstart the development of SVE kernels in the tasks below.mini_jit::generators::Simple
is a simple generator which generates a kernel consisting of the two instructionsmov w0, #3
andret
.mini_jit::generators::Loop
illustrates how one might generate a loop at runtime by using branches relative to the program counter. The generator takes the number of iterations as anuint32_t
input and then generates a loop which performs the given number of iterations. The generated kernels counts the number of performed iterations inw0
and then returns this as a result. Note, that the number of iterations is hardwired in the generated kernel, thus the actual kernel call takes no arguments.
Tasks
Compile mini_jit and test the two provided generators “Simple” and “Loop”. Compile and run the unit tests!
Disassemble the dumped code buffers. This might by done through the following
objdump
options:objdump -m aarch64 -b binary -D my_dump.bin
Implement a new generator
mini_jit::generators::MyExample
which uses at least one new base instruction which is not yet wrapped inmini_jit::instructions::Base
. You are free to implement any functionality through your generator but provide a short description of your new generator.Build and run mini_jit on your host-system. Note, that you can’t execute AArch64 instructions on x86 systems if not emulated. However, you can still generate the code and inspect it through
objdump
. If using Fedora you might install the AArch64 GCC tools through the packagegcc-c++-aarch64-linux-gnu
and then useaarch64-linux-gnu-objdump
for the disassembly. For the LLVM toolchain you can disassemble the binary dump viallvm-objcopy -I binary -O elf64-littleaarch64 --rename-section=.data=.text,code my_example.bin my_example.elf objdump -d my_example.elf
10.2. ASIMD: First Steps
Our JITter only ships with support for a few base and ASIMD&FP instructions of the AArch64 ISA. Additional instructions we might require have to be added by us. In this section we’ll have another look at the triad example used in the SVE chapter already. The high-level representation of the triad was given as:
1#include "triad_high.h"
2
3void triad_high( uint64_t i_n_values,
4 float const * i_a,
5 float const * i_b,
6 float * o_c ) {
7 for( uint64_t l_va = 0; l_va < i_n_values; l_va++ ) {
8 o_c[l_va] = i_a[l_va] + 2.0f * i_b[l_va];
9 }
10}
When writing SVE code, we could use predicate instructions and the concept of VLA programming. Now, we’ll use mini_jit to generate tailored ASIMD code.
Tasks
Extend the class
mini_jit::instructions::Asimd
with member functions which generate machine code for FMADD and FMLA (vector). It is sufficient if you only support FP32 and FP64.Implement and verify a new generator
mini_jit::generators::Triad
which takes the number of values as input to thegenerate
function. This means, once the code is generated, it is tailored to a specific number of elements. Thus,generate
has the following signature:void ( *generate( uint64_t i_n_values ) )( float const * i_a, float const * i_b, float * o_c );
Now, if we would, for example, call
generate( 7 )
, the generated function would always operate on seven values. It is sufficient to support values fori_n_values
which are below \(2^{16}=65,536\).How would you extend your kernel generation to support arrays with arbitrary sizes? If your kernel generation already does support these: How did you do it?
10.3. Small GEMMs
We’ll now use our new JITting skills to generate small matrix kernels. This is, in a simpler way, also what the LIBXSMM library does. Before, the goal was to obtain maximum performance for a single matrix kernel with fixed sizes and leading dimensions. Now, the goal is to write a generic generator which can generate fast code for different specs: We have to combine our knowledge on writing fast assembly code with our JITting knowledge.
We call our new generator mini_jit::generators::SmallGemmSve
.
The generator implements the operation:
with different matrix shapes. It should at least support the following values for M, N, K, ldA, ldB and ldC on 32-bit floating point-data:
id
M
N
K
ldA
ldB
ldC
0
32
6
1
32
1
32
1
32
6
48
32
48
32
2
128
6
48
128
48
128
3
128
48
48
128
48
128
In this part we assume that the leading dimensions match M, N and K.
Thus, the signature of the generate
function is:
void ( *mini_jit::generators::SmallGemmSve::generate( uint32_t i_m,
uint32_t i_n,
uint32_t i_k ) )( float const * i_a,
float const * i_b,
float * io_c )
Tasks
Design, implement, verify and optimize
mini_jit::generators::SmallGemmSve
.Submit the metrics “time (s)”, “#executions”, “GFLOPS” and “%peak” together with your team name. Your submission should include individual numbers for the four variants above and the arithmetic mean over all of them. The ranking will be based on the highest obtained mean value.
Have a look at LIBXSMM and identify the parts you just implemented standalone.
Hint
The generation of SVE microkernels is located in generator_gemm_aarch64.c. The AArch64 ISA is wrapped in generator_aarch64_instructions.h and generator_aarch64_instructions.c.
Hint
You may use llvm-mc to get the machine code corresponding to a line of assembly code.
For example, run the following to assemble ldr z0, [x0]
:
echo "ldr z0, [x0]" | llvm-mc -triple=aarch64 -mattr=+sve --show-encoding