8. Code Generation#
At this point, we know how to write efficient AArch64 kernels that instantiate primitives. A limiting factor of our current approach is that each of our assembly kernels is still limited to instantiating a set of hardcoded parameters. Consider the BRGEMM primitive as an example. In this case, the sizes of the matrix dimensions and the datatype are hardcoded. This means that until now, we had to write a new kernel whenever we needed a kernel with new hardcoded parameters.
In this chapter, we discuss just-in-time (JIT) code generation as an approach to solving this problem. JIT code generation allows us to continue hardcoding performance-critical parameters into our kernels, but leaves the actual instantiation as a runtime decision. Unlike prewritten kernels, the JIT approach allows us to delay kernel optimization until we know the full problem specifications in our tensor compiler.
1. |
Write instruction words to a temporary buffer. |
2. |
Allocate writable pages with |
3. |
Copy data from the buffer to those pages. |
4. |
Flush the instruction cache. |
5. |
Set the pages to read+execute with |
6. |
Cast the page pointer to a function pointer and call it. |
Table 8.1 provides a high-level overview of the six JIT steps covered in this chapter. More specifically, Section 8.1 discusses steps 2-5 (setting up the executable memory), while Section 8.2 shows examples of steps 1 and 6 (building the buffer & calling kernels). Finally, in Section 8.3 discusses a structured approach to generating instruction words as part of our JIT toolchain.
8.1. Executable Memory#
At its core, the idea of just-in-time code generation is remarkably simple. Instead of loading instruction words from a file into memory and then executing the code from memory, we generate instruction words, write them directly into memory, and then execute them. We discussed the structure of instruction words in Section 3.2. Therefore, generating the instruction words that represent our kernel and writing them to memory is an application of that knowledge. The only missing piece is a way to get an executable region of memory that contains our instruction words.
Most operating systems enforce the write-xor-execute (W^X) security policy.
W^X means that a page of memory is either writable or executable, but not both.
In short, we will use mmap to request page-aligned, writable, and zero-initialized memory from our Linux operating system.
In practice, mmap
rounds the requested number of bytes up to full pages.
We can then write our kernel to the memory returned by mmap
.
Finally, we use mprotect to make the pages read-only and executable.
21void * mini_jit::Kernel::alloc_mmap( std::size_t num_bytes ) const {
22 void* l_mem = mmap( 0,
23 num_bytes,
24 PROT_READ | PROT_WRITE,
25 MAP_PRIVATE | MAP_ANONYMOUS,
26 -1,
27 0 );
28
29 if( l_mem == MAP_FAILED ) {
30 throw std::runtime_error( "Failed to allocate memory: "
31 + std::string( std::strerror(errno) ) );
32 }
33
34 return l_mem;
35}
Listing 8.1.1 shows the alloc_mmap
function, which wraps mmap
and returns a pointer to a writable memory region of the desired size.
The first mmap
parameter in line 22 is set to 0
, resulting in page-aligned memory, while the MAP_ANONYMOUS
bit in line 25 results in zero-initialized memory.
It is used in the minimal example code mini_jit
to allocate memory for the instruction words of the kernels.
47void mini_jit::Kernel::set_exec( std::size_t num_bytes,
48 void * mem ) const {
49 int l_res = mprotect( mem,
50 num_bytes,
51 PROT_READ | PROT_EXEC );
52
53 if( l_res == -1 ) {
54 throw std::runtime_error( "Failed to set memory executable: "
55 + std::string( std::strerror(errno) ) );
56 }
57}
Once the instruction words are copied to the memory pages allocated by mmap
, the mini_jit
code uses the set_exec
function shown in Listing 8.1.2 to make them executable.
set_exec
uses mprotect
to make the pages read-only and executable, thus following the W^X security model.
59void mini_jit::Kernel::set_kernel() {
60 release_memory();
61
62 if( m_buffer.empty() ) {
63 return;
64 }
65
66 // alloc kernel memory
67 m_size_alloc = m_buffer.size() * 4;
68 try {
69 m_kernel = (void *) alloc_mmap( m_size_alloc );
70 }
71 catch( std::runtime_error & e ) {
72 throw std::runtime_error( "Failed to allocate memory for kernel: "
73 + std::string(e.what()) );
74 }
75
76 // copy instruction words from buffer to kernel memory
77 for( std::size_t l_in = 0; l_in < m_buffer.size(); l_in++ ) {
78 reinterpret_cast< uint32_t * >(m_kernel)[l_in] = m_buffer[l_in];
79 }
80
81 // clear cache
82 char * l_kernel_ptr = reinterpret_cast< char * >(m_kernel);
83 __builtin___clear_cache( l_kernel_ptr,
84 l_kernel_ptr + m_buffer.size() * 4 );
85
86 // set executable
87 set_exec( m_size_alloc,
88 m_kernel );
89}
The code in Listing 8.1.3 shows how we can use the two functions together.
The mini_jit::Kernel
class has an internal buffer, m_buffer
, to which we can add the instruction words one at a time.
Once all the instruction words are added, we call the set_kernel
function, which allocates memory for the kernel by calling alloc_mmap
in line 69, copies the instruction words into the allocated memory in lines 77-79, and makes the memory read-only and executable by calling the set_exec
function in lines 87-88.
The built-in function __builtin___clear_cache is called to clear the instruction cache and ensure that the generated code is visible to the core that generated it.
8.2. Kernels#
We will now discuss some example kernels that illustrate the use of the mini_jit::Kernel
class.
These kernels are very simple by design, relying mostly on hardcoded instructions.
In real just-in-time code generation software, additional wrappers are used to avoid code bloat and improve readability.
Section 8.3 discusses a better way to generate instruction words.
All examples can also be found directly in the kernel_examples.cpp
file of mini_jit
.
7void example_0() {
8 std::cout << "example_0" << std::endl;
9
10 mini_jit::Kernel l_kernel;
11 l_kernel.add_instr( 0xd28000a0 ); // mov x0, #5
12 l_kernel.add_instr( 0xd65f03c0 ); // ret
13 l_kernel.write( "example_0.bin" );
14 l_kernel.set_kernel();
15
16 int64_t (* l_func)() = nullptr;
17 l_func = (int64_t (*)()) l_kernel.get_kernel();
18 std::cout << " result: " << l_func() << std::endl;
19}
Listing 8.2.1 shows our first example kernel.
We see that the two instruction words 0xd28000a0
and 0xd65f03c0
are passed to l_kernel
by calling the add_instr
instruction in lines 11 and 12.
Internally, the class adds the passed instruction words to the buffer m_buffer
.
After the two instruction words are added, the internal buffer has a size of 8 bytes.
Next, line 13 calls the write
function, which writes the 8 instruction bytes to the binary file example_0.bin
.
The write
function itself is fairly standard, and details on its implementation can be found in the Kernel.cpp
file.
We can use this file output along with standard command line tools for debugging.
One option is to use objdump
to disassemble the code:
objdump -m aarch64 -b binary -D example_0.bin
For the code in Listing 8.2.1, we get the following output:
example_0.bin: file format binary
Disassembly of section .data:
0000000000000000 <.data>:
0: d28000a0 mov x0, #0x5 // #5
4: d65f03c0 ret
We can see that the two instructions can be written in assembly code as mov x0, #5
and ret
, so the kernel simply writes the value 5
to the X0
register before branching back to the calling scope with ret
.
Since the file output is for debugging, we simply disable it once our just-in-time code generation works as expected.
Continuing with the code in Listing 8.2.1, we next call the set_kernel
function in line 14.
As discussed earlier, set_kernel
copies the instruction words into page-aligned memory and makes it executable.
In line 17, we use the get_kernel
function to get a void const
pointer to this memory region.
The C-style cast (int64_t (*)())
converts this pointer to a function pointer.
Note
Converting a void *
object pointer to a function pointer is implementation-defined in ISO C/C++. On Linux with GCC or Clang it works reliably, but in other cases it may not.
The function takes no arguments and returns an int64_t
value.
This function pointer is stored in the variable l_func
.
Finally, in line 18, the function l_func
is called, which means that our kernel is executed.
mov
instruction is set dynamically.#26void example_1( int16_t imm16 ) {
27 std::cout << "example_1" << std::endl;
28
29 mini_jit::Kernel l_kernel;
30 uint32_t l_ins = 0xd2800000;
31 l_ins |= imm16 << 5;
32 l_kernel.add_instr( l_ins ); // mov x0, #imm16
33 l_kernel.add_instr( 0xd65f03c0 ); // ret
34 l_kernel.write( "example_1.bin" );
35 l_kernel.set_kernel();
36
37 int64_t (* l_func)() = nullptr;
38 l_func = (int64_t (*)()) l_kernel.get_kernel();
39 std::cout << " result: " << l_func() << std::endl;
40}
The example in Listing 8.2.2 is very similar to the previous one.
However, this time the 16-bit immediate of the MOV (wide immediate) instruction is set to imm16
at runtime.
The immediate is encoded in the instruction bits with IDs 5-20.
We perform three steps to generate the desired instruction word:
Initialize the instruction word with
0xd2800000
and write the result tol_ins
(line 30). The 16 immediate bits are set to zero, that is we use the instruction word formov x0, #0
as the initial value.Shift
imm16
by 5 to the left (line 31). So if we had the 16-bit value0b0000'0000'0000'0011
before, we get the 32-bit value0b0000'0000'0000'0000'0000'0000'0110'0000
after the shift.Do a bitwise OR of
l_ins
and the shiftedimm16
(line 31). This effectively sets the 16 immediate bits to the desired value.
The rest follows the steps of the first example.
Now, in line 39, the immediate value is returned by the function call l_func()
.
This example already outlines the general approach for encoding instructions.
First, we initialize all free bits to zero.
Second, we set the free bits based on runtime parameters.
More detailed examples are given in Section 8.3.
int64_t
.#45void example_2() {
46 std::cout << "example_2" << std::endl;
47
48 mini_jit::Kernel l_kernel;
49 l_kernel.add_instr( 0x91001400 ); // add x0, x0, #5
50 l_kernel.add_instr( 0xd65f03c0 ); // ret
51 l_kernel.write( "example_2.bin" );
52 l_kernel.set_kernel();
53
54 int64_t (* l_func)( int64_t ) = nullptr;
55 l_func = (int64_t (*)( int64_t )) l_kernel.get_kernel();
56 std::cout << " result: " << l_func( 7 ) << std::endl;
57}
Our next example is shown in Listing 8.2.3.
Again, we generate a kernel consisting of two instruction words.
This time, however, the kernel takes a 64-bit integer as its only input parameter.
As a result, the l_func
function in lines 54-55 has a different signature.
In line 56, the value 7
is passed to the kernel through the X0
parameter register.
The kernel adds the immediate #5
to X0
and branches back to the surrounding scope.
Therefore, the return value of l_func( 7 )
is 12
.
62void example_3() {
63 std::cout << "example_3" << std::endl;
64
65 mini_jit::Kernel l_kernel;
66 l_kernel.add_instr( 0xd2804000 ); // mov x0, #512
67 l_kernel.add_instr( 0xd2800001 ); // mov x1, #0
68 l_kernel.add_instr( 0xd1000400 ); // sub x0, x0, #1
69 l_kernel.add_instr( 0x91000821 ); // add x1, x1, #2
70 l_kernel.add_instr( 0xb5ffffc0 ); // cbnz x0, #-8
71 l_kernel.add_instr( 0xaa0103e0 ); // mov x0, x1
72 l_kernel.add_instr( 0xd65f03c0 ); // ret
73 l_kernel.write( "example_3.bin" );
74 l_kernel.set_kernel();
75
76 int64_t (* l_func)() = nullptr;
77 l_func = (int64_t (*)()) l_kernel.get_kernel();
78 std::cout << " result: " << l_func() << std::endl;
79}
Our final example in Listing 8.2.4 shows just-in-time code generation of a loop.
The loop consists of a loop counter in register X0
and a branch instruction.
The instruction word 0xd2804000
in line 66 initializes the loop counter to 512
, while 0xd1000400
in line 68 decrements the loop counter.
The branch instruction is given by the instruction word 0xb5ffffc0
and jumps #-8
bytes relative to the program counter, that is two instructions back.
8.3. Instruction Word Generators#
This section discusses instruction word generators that set the free bits of an instruction based on user input. We have already seen a simple example of this in Section 8.2, where we dynamically set the 16-bit immediate of MOV (wide immediate) using a shift operation together with a bitwise OR.
Our goal in this section is to systematically raise the level of abstraction so that users of our just-in-time code generator can interface with instruction word generation instead of having to provide the instruction words manually. In summary, using instruction word generators will feel similar to writing assembly code.
Bit IDs |
31 |
30-24 |
23-5 |
4-0 |
---|---|---|---|---|
Field |
sf |
imm19 |
Rt |
|
Pattern |
|
|
|
|
|
|
|
|
|
We will discuss the implementation of generators for two instructions.
As before, all functions discussed are part of the mini_jit
example code.
The first instruction is CBNZ, and its general structure is shown in Table 8.3.1.
Bits 30-24 are fixed bits, while all other bits are free bits.
In particular, the sf
field determines whether we are using the 32-bit or 64-bit variant of the instruction, that is whether the instruction compares a W
register or an X
register to zero.
The 19 immediate bits are stored in the field imm19
and determine how far the instruction branches relative to the program counter.
imm19
is interpreted as a signed value in two’s complement form, and the jump in bytes is obtained by multiplying imm19
by 4.
In other words, imm19
stores how many instructions backward (negative value) or forward (positive value) we branch.
The remaining free bits are given by the field Rt
, which encodes the ID of the register, that is it is compared to zero.
23uint32_t mini_jit::InstGen::base_br_cbnz( gpr_t reg,
24 int32_t imm19 ) {
25 uint32_t l_ins = 0x35000000;
26
27 // set register id
28 uint32_t l_reg_id = reg & 0x1f;
29 l_ins |= l_reg_id;
30
31 // set size of the register
32 uint32_t l_reg_size = reg & 0x20;
33 l_ins |= l_reg_size << (32-6);
34
35 // set immediate
36 uint32_t l_imm = imm19 & 0x7ffff;
37 l_ins |= l_imm << 5;
38
39 return l_ins;
40}
Listing 8.3.1 shows the base_br_cbnz
function, which generates CBNZ instruction words.
The function takes two parameters, reg
and imm19
.
The first parameter reg
contains the ID of the general-purpose register used and the view of that register.
Both are encoded in the enumerator gpr_t
.
Specifically, we assign the enumerator values 0-30 to the registers W0-W30 and the values 32-62 to the registers X0-X30.
The second parameter imm19
contains the 19-bit immediate of the instruction.
Going through the code line by line, we see that the local variable l_ins
is initialized with the instruction word 0x35000000
, that is all free bits are set to zero, as shown in Table 8.3.1.
Next, lines 28 and 29 set the instruction bits of the Rt
field.
Specifically, we first mask the input parameter reg
with 0x1f
, which is 0b11111
in binary.
The mask sets all but the last five bits to zero.
Thus, regardless of the register type, X
or W
, we only set the Rt
bits by the bitwise OR in line 29.
Lines 32 and 33 set the size field sf
, which is stored in bit 31 of a CBNZ instruction word.
Our instructions must have a value of 0
in sf
when working with a W
register and a value of 1
when working with an X
register.
So, in line 32 we apply the mask 0x20
to isolate the size bit in our gpr_t
enumerator type.
Then, in line 33, we shift the size bit, now at local ID 5 of l_reg_size
, by 26 to the left so that the following bitwise OR modifies instruction bit 31 for an X
register.
The last two bit manipulations in lines 36 and 37 set the 19-bit immediate of the instruction.
As before, we first apply the mask 0x7ffff
to eliminate all but the lower 19 bits in imm19
.
The masked value in l_imm
is then written to bits 23-5 of l_ins
by left shifting and bitwise ORing.
Bit IDs |
31 |
30 |
29-23 |
22 |
21 |
20-16 |
15-10 |
9-5 |
4-0 |
---|---|---|---|---|---|---|---|---|---|
Field |
Q |
sz |
Rm |
Rn |
Rd |
||||
Pattern |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Our second example is the FP32 and FP64 versions of FMLA (vector).
The overall structure of the instruction is shown in Table 8.3.2.
We see that we need to set the five fields Q
, sz
, Rm
, Rn
, and Rd
.
Rn
contains the ID of the first SIMD&FP source register, Rm
contains the ID of the second SIMD&FP source register, and Rd
contains the ID of the SIMD&FP destination register.
The two fields Q
and sz
together encode the arrangement specifier of the instruction.
Specifically, both are zero for 2S
, only Q
is 1
for 4S
, and both are 1
for 2D
.
42uint32_t mini_jit::InstGen::neon_dp_fmla_vector( simd_fp_t reg_dest,
43 simd_fp_t reg_src1,
44 simd_fp_t reg_src2,
45 arr_spec_t arr_spec ) {
46 uint32_t l_ins = 0x0e20cc00;
47
48 // set destination register id
49 uint32_t l_reg_id = reg_dest & 0x1f;
50 l_ins |= l_reg_id;
51
52 // set first source register id
53 l_reg_id = reg_src1 & 0x1f;
54 l_ins |= l_reg_id << 5;
55
56 // set second source register id
57 l_reg_id = reg_src2 & 0x1f;
58 l_ins |= l_reg_id << 16;
59
60 // set arrangement specifier
61 uint32_t l_arr_spec = arr_spec & 0x40400000;
62 l_ins |= l_arr_spec;
63
64 return l_ins;
65}
Listing 8.3.2 shows the neon_dp_fmla_vector
function, which can generate FP32 and FP64 FMLA (vector) instruction words.
The register IDs are passed by the parameters reg_dest
, reg_src1
, and reg_src2
.
The simd_fp_t
enumeration simply uses the values 0-31 for the SIMD&FP registers V0-V31.
The arrangement specifier is passed through the arr_spec
parameter, where the arr_spec_t
enumerator is a bit special.
It enumerates the value 0x0
for 2S
, the value 0x40000000
for 4S
, and the value 0x40400000
for 2D
.
This means that for 2S
all bits are set to zero, for 4S
only the bit with ID 30 is set to 1
, and for 2D
only the two bits with ID 30 and 22 are set to 1
.
The following code sets the Rd
, Rn
and Rm
fields in lines 49-58.
Because of the arr_spec_t
values used, the generator can simply do a bitwise OR to set the Q
and sz
fields in line 62 after masking.
In summary, we have seen that the implementation of instruction word generators is straightforward. The two CBNZ and FMLA (vector) generators discussed use only left shifts, bitwise ANDs, and bitwise ORs. This reliance on simple bit operations results in fast just-in-time code generation. This can be an advantage in situations where code generation time is critical.