8. Code Generation#

At this point, we know how to write efficient AArch64 kernels that instantiate primitives. A limiting factor of our current approach is that each of our assembly kernels is still limited to instantiating a set of hardcoded parameters. Consider the BRGEMM primitive as an example. In this case, the sizes of the matrix dimensions and the datatype are hardcoded. This means that until now, we had to write a new kernel whenever we needed a kernel with new hardcoded parameters.

In this chapter, we discuss just-in-time (JIT) code generation as an approach to solving this problem. JIT code generation allows us to continue hardcoding performance-critical parameters into our kernels, but leaves the actual instantiation as a runtime decision. Unlike prewritten kernels, the JIT approach allows us to delay kernel optimization until we know the full problem specifications in our tensor compiler.

Table 8.1 Overview of our approach to just-in-time code generation for AArch64 and the Linux operating system.#

1.

Write instruction words to a temporary buffer.

2.

Allocate writable pages with mmap.

3.

Copy data from the buffer to those pages.

4.

Flush the instruction cache.

5.

Set the pages to read+execute with mprotect.

6.

Cast the page pointer to a function pointer and call it.

Table 8.1 provides a high-level overview of the six JIT steps covered in this chapter. More specifically, Section 8.1 discusses steps 2-5 (setting up the executable memory), while Section 8.2 shows examples of steps 1 and 6 (building the buffer & calling kernels). Finally, in Section 8.3 discusses a structured approach to generating instruction words as part of our JIT toolchain.

8.1. Executable Memory#

At its core, the idea of just-in-time code generation is remarkably simple. Instead of loading instruction words from a file into memory and then executing the code from memory, we generate instruction words, write them directly into memory, and then execute them. We discussed the structure of instruction words in Section 3.2. Therefore, generating the instruction words that represent our kernel and writing them to memory is an application of that knowledge. The only missing piece is a way to get an executable region of memory that contains our instruction words.

Most operating systems enforce the write-xor-execute (W^X) security policy. W^X means that a page of memory is either writable or executable, but not both. In short, we will use mmap to request page-aligned, writable, and zero-initialized memory from our Linux operating system. In practice, mmap rounds the requested number of bytes up to full pages. We can then write our kernel to the memory returned by mmap. Finally, we use mprotect to make the pages read-only and executable.

Listing 8.1.1 alloc_mmap function, which uses the POSIX.1 mmap function to allocate memory.#
21void * mini_jit::Kernel::alloc_mmap( std::size_t num_bytes ) const {
22  void* l_mem = mmap( 0,
23                      num_bytes,
24                      PROT_READ | PROT_WRITE,
25                      MAP_PRIVATE | MAP_ANONYMOUS,
26                      -1,
27                      0 );
28
29  if( l_mem == MAP_FAILED ) {
30    throw std::runtime_error( "Failed to allocate memory: "
31                              + std::string( std::strerror(errno) ) );
32  }
33
34  return l_mem;
35}

Listing 8.1.1 shows the alloc_mmap function, which wraps mmap and returns a pointer to a writable memory region of the desired size. The first mmap parameter in line 22 is set to 0, resulting in page-aligned memory, while the MAP_ANONYMOUS bit in line 25 results in zero-initialized memory. It is used in the minimal example code mini_jit to allocate memory for the instruction words of the kernels.

Listing 8.1.2 Function set_exec using the POSIX.1 mprotect function.#
47void mini_jit::Kernel::set_exec( std::size_t   num_bytes,
48                                 void        * mem ) const {
49  int l_res = mprotect( mem,
50                        num_bytes,
51                        PROT_READ | PROT_EXEC );
52
53  if( l_res == -1 ) {
54    throw std::runtime_error( "Failed to set memory executable: "
55                              + std::string( std::strerror(errno) ) );
56  }  
57}

Once the instruction words are copied to the memory pages allocated by mmap, the mini_jit code uses the set_exec function shown in Listing 8.1.2 to make them executable. set_exec uses mprotect to make the pages read-only and executable, thus following the W^X security model.

Listing 8.1.3 Function set_kernel, using the two functions alloc_mmap and set_exec.#
59void mini_jit::Kernel::set_kernel() {
60  release_memory();
61
62  if( m_buffer.empty() ) {
63    return;
64  }
65
66  // alloc kernel memory
67  m_size_alloc = m_buffer.size() * 4;
68  try {
69    m_kernel = (void *) alloc_mmap( m_size_alloc );
70  }
71  catch( std::runtime_error & e ) {
72    throw std::runtime_error( "Failed to allocate memory for kernel: "
73                              + std::string(e.what()) );
74  }
75
76  // copy instruction words from buffer to kernel memory
77  for( std::size_t l_in = 0; l_in < m_buffer.size(); l_in++ ) {
78    reinterpret_cast< uint32_t * >(m_kernel)[l_in] = m_buffer[l_in];
79  }
80
81  // clear cache
82  char * l_kernel_ptr = reinterpret_cast< char * >(m_kernel);
83  __builtin___clear_cache( l_kernel_ptr,
84                           l_kernel_ptr + m_buffer.size() * 4 );
85
86  // set executable
87  set_exec( m_size_alloc,
88            m_kernel );
89}

The code in Listing 8.1.3 shows how we can use the two functions together. The mini_jit::Kernel class has an internal buffer, m_buffer, to which we can add the instruction words one at a time. Once all the instruction words are added, we call the set_kernel function, which allocates memory for the kernel by calling alloc_mmap in line 69, copies the instruction words into the allocated memory in lines 77-79, and makes the memory read-only and executable by calling the set_exec function in lines 87-88. The built-in function __builtin___clear_cache is called to clear the instruction cache and ensure that the generated code is visible to the core that generated it.

8.2. Kernels#

We will now discuss some example kernels that illustrate the use of the mini_jit::Kernel class. These kernels are very simple by design, relying mostly on hardcoded instructions. In real just-in-time code generation software, additional wrappers are used to avoid code bloat and improve readability. Section 8.3 discusses a better way to generate instruction words. All examples can also be found directly in the kernel_examples.cpp file of mini_jit.

Listing 8.2.1 Example generation of a kernel with two instructions.#
 7void example_0() {
 8  std::cout << "example_0" << std::endl;
 9
10  mini_jit::Kernel l_kernel;
11  l_kernel.add_instr( 0xd28000a0 ); // mov x0, #5
12  l_kernel.add_instr( 0xd65f03c0 ); // ret
13  l_kernel.write( "example_0.bin" );
14  l_kernel.set_kernel();
15
16  int64_t (* l_func)() = nullptr;
17  l_func = (int64_t (*)()) l_kernel.get_kernel();
18  std::cout << "  result: " << l_func() << std::endl;
19}

Listing 8.2.1 shows our first example kernel. We see that the two instruction words 0xd28000a0 and 0xd65f03c0 are passed to l_kernel by calling the add_instr instruction in lines 11 and 12. Internally, the class adds the passed instruction words to the buffer m_buffer. After the two instruction words are added, the internal buffer has a size of 8 bytes.

Next, line 13 calls the write function, which writes the 8 instruction bytes to the binary file example_0.bin. The write function itself is fairly standard, and details on its implementation can be found in the Kernel.cpp file. We can use this file output along with standard command line tools for debugging. One option is to use objdump to disassemble the code:

objdump -m aarch64 -b binary -D example_0.bin

For the code in Listing 8.2.1, we get the following output:

example_0.bin:     file format binary


Disassembly of section .data:

0000000000000000 <.data>:
   0:	d28000a0 	mov	x0, #0x5                   	// #5
   4:	d65f03c0 	ret

We can see that the two instructions can be written in assembly code as mov x0, #5 and ret, so the kernel simply writes the value 5 to the X0 register before branching back to the calling scope with ret. Since the file output is for debugging, we simply disable it once our just-in-time code generation works as expected.

Continuing with the code in Listing 8.2.1, we next call the set_kernel function in line 14. As discussed earlier, set_kernel copies the instruction words into page-aligned memory and makes it executable. In line 17, we use the get_kernel function to get a void const pointer to this memory region. The C-style cast (int64_t (*)()) converts this pointer to a function pointer.

Note

Converting a void * object pointer to a function pointer is implementation-defined in ISO C/C++. On Linux with GCC or Clang it works reliably, but in other cases it may not.

The function takes no arguments and returns an int64_t value. This function pointer is stored in the variable l_func. Finally, in line 18, the function l_func is called, which means that our kernel is executed.

Listing 8.2.2 Example generation of a kernel with two instructions. The 16-bit immediate of the mov instruction is set dynamically.#
26void example_1( int16_t imm16 ) {
27  std::cout << "example_1" << std::endl;
28
29  mini_jit::Kernel l_kernel;
30  uint32_t l_ins = 0xd2800000;
31  l_ins |= imm16 << 5;
32  l_kernel.add_instr( l_ins ); // mov x0, #imm16
33  l_kernel.add_instr( 0xd65f03c0 ); // ret
34  l_kernel.write( "example_1.bin" );
35  l_kernel.set_kernel();
36
37  int64_t (* l_func)() = nullptr;
38  l_func = (int64_t (*)()) l_kernel.get_kernel();
39  std::cout << "  result: " << l_func() << std::endl;
40}

The example in Listing 8.2.2 is very similar to the previous one. However, this time the 16-bit immediate of the MOV (wide immediate) instruction is set to imm16 at runtime. The immediate is encoded in the instruction bits with IDs 5-20. We perform three steps to generate the desired instruction word:

  1. Initialize the instruction word with 0xd2800000 and write the result to l_ins (line 30). The 16 immediate bits are set to zero, that is we use the instruction word for mov x0, #0 as the initial value.

  2. Shift imm16 by 5 to the left (line 31). So if we had the 16-bit value 0b0000'0000'0000'0011 before, we get the 32-bit value 0b0000'0000'0000'0000'0000'0000'0110'0000 after the shift.

  3. Do a bitwise OR of l_ins and the shifted imm16 (line 31). This effectively sets the 16 immediate bits to the desired value.

The rest follows the steps of the first example. Now, in line 39, the immediate value is returned by the function call l_func(). This example already outlines the general approach for encoding instructions. First, we initialize all free bits to zero. Second, we set the free bits based on runtime parameters. More detailed examples are given in Section 8.3.

Listing 8.2.3 Example generation of a kernel with two instructions. The kernel has a single function parameter of type int64_t.#
45void example_2() {
46  std::cout << "example_2" << std::endl;
47
48  mini_jit::Kernel l_kernel;
49  l_kernel.add_instr( 0x91001400 ); // add x0, x0, #5
50  l_kernel.add_instr( 0xd65f03c0 ); // ret
51  l_kernel.write( "example_2.bin" );
52  l_kernel.set_kernel();
53
54  int64_t (* l_func)( int64_t ) = nullptr;
55  l_func = (int64_t (*)( int64_t )) l_kernel.get_kernel();
56  std::cout << "  result: " << l_func( 7 ) << std::endl;
57}

Our next example is shown in Listing 8.2.3. Again, we generate a kernel consisting of two instruction words. This time, however, the kernel takes a 64-bit integer as its only input parameter. As a result, the l_func function in lines 54-55 has a different signature. In line 56, the value 7 is passed to the kernel through the X0 parameter register. The kernel adds the immediate #5 to X0 and branches back to the surrounding scope. Therefore, the return value of l_func( 7 ) is 12.

Listing 8.2.4 Example generation of a kernel with seven instructions. The kernel performs 512 loop iterations.#
62void example_3() {
63  std::cout << "example_3" << std::endl;
64
65  mini_jit::Kernel l_kernel;
66  l_kernel.add_instr( 0xd2804000 ); // mov x0, #512
67  l_kernel.add_instr( 0xd2800001 ); // mov x1, #0
68  l_kernel.add_instr( 0xd1000400 ); // sub x0, x0, #1
69  l_kernel.add_instr( 0x91000821 ); // add x1, x1, #2
70  l_kernel.add_instr( 0xb5ffffc0 ); // cbnz x0, #-8
71  l_kernel.add_instr( 0xaa0103e0 ); // mov x0, x1
72  l_kernel.add_instr( 0xd65f03c0 ); // ret
73  l_kernel.write( "example_3.bin" );
74  l_kernel.set_kernel();
75
76  int64_t (* l_func)() = nullptr;
77  l_func = (int64_t (*)()) l_kernel.get_kernel();
78  std::cout << "  result: " << l_func() << std::endl;
79}

Our final example in Listing 8.2.4 shows just-in-time code generation of a loop. The loop consists of a loop counter in register X0 and a branch instruction. The instruction word 0xd2804000 in line 66 initializes the loop counter to 512, while 0xd1000400 in line 68 decrements the loop counter. The branch instruction is given by the instruction word 0xb5ffffc0 and jumps #-8 bytes relative to the program counter, that is two instructions back.

8.3. Instruction Word Generators#

This section discusses instruction word generators that set the free bits of an instruction based on user input. We have already seen a simple example of this in Section 8.2, where we dynamically set the 16-bit immediate of MOV (wide immediate) using a shift operation together with a bitwise OR.

Our goal in this section is to systematically raise the level of abstraction so that users of our just-in-time code generator can interface with instruction word generation instead of having to provide the instruction words manually. In summary, using instruction word generators will feel similar to writing assembly code.

Table 8.3.1 Structure of the CBNZ instruction.#

Bit IDs

31

30-24

23-5

4-0

Field

sf

imm19

Rt

Pattern

s

0110101

iiiiiiiiiiiiiiiiiii

ttttt

cbnz w0, #0

0

0110101

0000000000000000000

00000

We will discuss the implementation of generators for two instructions. As before, all functions discussed are part of the mini_jit example code. The first instruction is CBNZ, and its general structure is shown in Table 8.3.1. Bits 30-24 are fixed bits, while all other bits are free bits. In particular, the sf field determines whether we are using the 32-bit or 64-bit variant of the instruction, that is whether the instruction compares a W register or an X register to zero. The 19 immediate bits are stored in the field imm19 and determine how far the instruction branches relative to the program counter. imm19 is interpreted as a signed value in two’s complement form, and the jump in bytes is obtained by multiplying imm19 by 4. In other words, imm19 stores how many instructions backward (negative value) or forward (positive value) we branch. The remaining free bits are given by the field Rt, which encodes the ID of the register, that is it is compared to zero.

Listing 8.3.1 Instruction word generator for the base instruction CBNZ.#
23uint32_t mini_jit::InstGen::base_br_cbnz( gpr_t   reg,
24                                          int32_t imm19 ) {
25  uint32_t l_ins = 0x35000000;
26
27  // set register id
28  uint32_t l_reg_id = reg & 0x1f;
29  l_ins |= l_reg_id;
30
31  // set size of the register
32  uint32_t l_reg_size = reg & 0x20;
33  l_ins |= l_reg_size << (32-6);
34
35  // set immediate
36  uint32_t l_imm = imm19 & 0x7ffff;
37  l_ins |= l_imm << 5;
38
39  return l_ins;
40}

Listing 8.3.1 shows the base_br_cbnz function, which generates CBNZ instruction words. The function takes two parameters, reg and imm19. The first parameter reg contains the ID of the general-purpose register used and the view of that register. Both are encoded in the enumerator gpr_t. Specifically, we assign the enumerator values 0-30 to the registers W0-W30 and the values 32-62 to the registers X0-X30. The second parameter imm19 contains the 19-bit immediate of the instruction.

Going through the code line by line, we see that the local variable l_ins is initialized with the instruction word 0x35000000, that is all free bits are set to zero, as shown in Table 8.3.1. Next, lines 28 and 29 set the instruction bits of the Rt field. Specifically, we first mask the input parameter reg with 0x1f, which is 0b11111 in binary. The mask sets all but the last five bits to zero. Thus, regardless of the register type, X or W, we only set the Rt bits by the bitwise OR in line 29.

Lines 32 and 33 set the size field sf, which is stored in bit 31 of a CBNZ instruction word. Our instructions must have a value of 0 in sf when working with a W register and a value of 1 when working with an X register. So, in line 32 we apply the mask 0x20 to isolate the size bit in our gpr_t enumerator type. Then, in line 33, we shift the size bit, now at local ID 5 of l_reg_size, by 26 to the left so that the following bitwise OR modifies instruction bit 31 for an X register.

The last two bit manipulations in lines 36 and 37 set the 19-bit immediate of the instruction. As before, we first apply the mask 0x7ffff to eliminate all but the lower 19 bits in imm19. The masked value in l_imm is then written to bits 23-5 of l_ins by left shifting and bitwise ORing.

Table 8.3.2 Structure of the FP32 and FP64 versions of the FMLA (vector) instruction.#

Bit IDs

31

30

29-23

22

21

20-16

15-10

9-5

4-0

Field

Q

sz

Rm

Rn

Rd

Pattern

0

q

0011100

s

1

mmmmm

110011

nnnnn

ddddd

fmla v0.2s, v0.2s, v0.2s

0

0

0011100

0

1

00000

110011

00000

00000

fmla v0.4s, v0.4s, v0.4s

0

1

0011100

0

1

00000

110011

00000

00000

fmla v0.2d, v0.2d, v0.2d

0

1

0011100

1

1

00000

110011

00000

00000

Our second example is the FP32 and FP64 versions of FMLA (vector). The overall structure of the instruction is shown in Table 8.3.2. We see that we need to set the five fields Q, sz, Rm, Rn, and Rd. Rn contains the ID of the first SIMD&FP source register, Rm contains the ID of the second SIMD&FP source register, and Rd contains the ID of the SIMD&FP destination register. The two fields Q and sz together encode the arrangement specifier of the instruction. Specifically, both are zero for 2S, only Q is 1 for 4S, and both are 1 for 2D.

Listing 8.3.2 Instruction generator for the FP32 and FP64 versions of the FMLA (vector) instruction.#
42uint32_t mini_jit::InstGen::neon_dp_fmla_vector( simd_fp_t  reg_dest,
43                                                 simd_fp_t  reg_src1,
44                                                 simd_fp_t  reg_src2,
45                                                 arr_spec_t arr_spec ) {
46  uint32_t l_ins = 0x0e20cc00;
47
48  // set destination register id
49  uint32_t l_reg_id = reg_dest & 0x1f;
50  l_ins |= l_reg_id;
51
52  // set first source register id
53  l_reg_id = reg_src1 & 0x1f;
54  l_ins |= l_reg_id << 5;
55
56  // set second source register id
57  l_reg_id = reg_src2 & 0x1f;
58  l_ins |= l_reg_id << 16;
59
60  // set arrangement specifier
61  uint32_t l_arr_spec = arr_spec & 0x40400000;
62  l_ins |= l_arr_spec;
63
64  return l_ins;
65}

Listing 8.3.2 shows the neon_dp_fmla_vector function, which can generate FP32 and FP64 FMLA (vector) instruction words. The register IDs are passed by the parameters reg_dest, reg_src1, and reg_src2. The simd_fp_t enumeration simply uses the values 0-31 for the SIMD&FP registers V0-V31. The arrangement specifier is passed through the arr_spec parameter, where the arr_spec_t enumerator is a bit special. It enumerates the value 0x0 for 2S, the value 0x40000000 for 4S, and the value 0x40400000 for 2D. This means that for 2S all bits are set to zero, for 4S only the bit with ID 30 is set to 1, and for 2D only the two bits with ID 30 and 22 are set to 1.

The following code sets the Rd, Rn and Rm fields in lines 49-58. Because of the arr_spec_t values used, the generator can simply do a bitwise OR to set the Q and sz fields in line 62 after masking.

In summary, we have seen that the implementation of instruction word generators is straightforward. The two CBNZ and FMLA (vector) generators discussed use only left shifts, bitwise ANDs, and bitwise ORs. This reliance on simple bit operations results in fast just-in-time code generation. This can be an advantage in situations where code generation time is critical.