10. Datapath

We have all parts together and are ready to start designing our own processors. In theory, we could do this in SystemVerilog and support AArch64 in its entirety. Our FPGA skills would then prove extremely helpful to conduct first tests and take our processor designs for a spin. If only there was some more time.. 😅

Instead we don’t pursue the SystemVerilog avenue and also limit our efforts to a simple single cycle core which only supports a few A64 instructions. This lab studies the datapath of this microarchitecture. In detail, Section 10.1 discusses machine code and the underlying instruction encoding used in the Arm architecture. An Arm processor is able to parse this machine code and behaves accordingly. Second, in Section 10.2 we’ll study how our single cycle design interprets and executes machine code. This allows us to get an idea about how instructions are fetched, decoded and executed by a CPU.

10.1. Machine Code

In Section 8 and Section 9 we wrote simple functions in assembly language. Until this point the assembler took care of translating our assembly code to machine code. The machine code is how our executable programs are actually stored in memory. In this exercise we’ll manually translate a program from assembly language to machine code. This is nothing one would typically do in practice but crucial to understand the encoding of A64 instructions.

AArch64 has a fixed instruction size of 32 bits. Thus, a function with 10 instructions fits in 10 $\times$ 32 bits = 320 bits = 40 bytes. Logically an AArch64 processor reads bits 0-31, decodes the instruction and executes a corresponding operation. Next, the processor reads and decodes bits 32-63 and executes the respective operation. After that bits 64-95 are processed and so on. Branches behave differently from that. Here, potentially based on the result of a previous instruction, the processor might not simply jump to the next 32 bits but to another program position.

Optional Note

Certain recent extensions of the Arm architecture violate the idea that every operation is encoded in 32 bits. An example for more “complex” operations are those using prefix instructions of the Scalable Vector Extension (SVE). MOVPRFX (unpredicated), for example, allows us to perform a four-operand Fused-Multiply-Add (FMA4) operation. Effectively one would use two instructions, i.e., 2 $\times$ 32 bits = 64 bits, to realize FMA4. Some details on this are available from Fujitsu’s 2018 Hot Chips (HC30) presentation on A64FX.

Listing 10.1.1, Listing 10.1.2 and Listing 10.1.3 contain our usual structure: A C++ driver, a C reference implementation and the corresponding part in assembly language. This time, however, the assembly code is already provided in Listing 10.1.3’s function machine_code_asm_0. Our goal is the manual translation of the human-readable instructions to machine code. This allows us to re-implement the function in machine_code_asm_1 by using only machine code.

We’ll do this by looking up the instructions of machine_code_asm_0 in the ISA and writing down the respective machine code in Table 10.1.2. Good news, you are in luck! Table 10.1.1 already contains most of the relevant links to the ISA and a short form of respective instruction encoding. Only SUBS (immediate) is missing.

Listing 10.1.1 C++ driver for the machine code example.

#include <cstdint>
#include <cstdlib>
#include <iostream>

extern "C" {
  uint64_t machine_code_c();
  uint64_t machine_code_asm_0();
  uint64_t machine_code_asm_1();
}

int main() {
  uint64_t l_result_c = 0;
  uint64_t l_result_asm_0 = 0;
  uint64_t l_result_asm_1 = 0;

  // machine_code_c
  std::cout << "### calling machine_code_c ###" << std::endl;
  l_result_c = machine_code_c();
  std::cout << l_result_c << std::endl;

  // machine_code_asm_0
  std::cout << "### calling machine_code_asm_0 ###" << std::endl;
  l_result_asm_0 = machine_code_asm_0();
  std::cout << l_result_asm_0 << std::endl;

  // machine_code_asm_1
  std::cout << "### calling machine_code_asm_1 ###" << std::endl;
  l_result_asm_1 = machine_code_asm_1();
  std::cout << l_result_asm_1 << std::endl;

  return EXIT_SUCCESS;
}

Listing 10.1.2 C kernel which provides the high-level reference of our machine code example.

#include <stdint.h>

uint64_t machine_code_c() {
  uint64_t l_tmp_0 = 3;
  uint64_t l_tmp_1 = 0;
  uint64_t l_tmp_2 = 0;

  while( 1 ) {
    l_tmp_0--;
    l_tmp_1 = l_tmp_1 + 3;
    l_tmp_2 = l_tmp_2 + 7;

    if( l_tmp_0 == 0 ) break;
  }

  l_tmp_0 = l_tmp_1 + l_tmp_2;

  return l_tmp_0;
}

Listing 10.1.3 Assembly kernel machine_code_asm_0 which serves as the reference implementation.

        .text
        .align 4
        .type   machine_code_asm_0, %function
        .global machine_code_asm_0
machine_code_asm_0:
        orr x0, xzr, #3
        and x1, x1, xzr
        and x2, x2, xzr

my_loop:
        subs x0, x0, #1
        add x1, x1, #3
        add x2, x2, #7
        b.ne my_loop

        add x0, x1, x2

        ret
        .size   machine_code_asm_0, (. - machine_code_asm_0)


        .text
        .align 4
        .type   machine_code_asm_1, %function
        .global machine_code_asm_1
machine_code_asm_1:
        // TODO: implement machine code version
        .size   machine_code_asm_1, (. - machine_code_asm_1)

Table 10.1.1 Instruction encodings for the function `machine_code_asm_0`.
Instruction	Instruction Encoding
ORR (immediate)	`x011 0010 0nrr rrrr ssss ssnn nnnd dddd`
AND (shifted register)	`x000 1010 hh0m mmmm iiii iinn nnnd dddd`
SUBS (immediate)	`???? ???? ???? ???? ???? ???? ???? ????`
ADD (immediate)	`x001 0001 0hii iiii iiii iinn nnnd dddd`
B.cond	`0101 0100 iiii iiii iiii iiii iii0 cccc`
ADD (shifted register)	`x000 1011 hh0m mmmm iiii iinn nnnd dddd`
RET	`1101 0110 0101 1111 0000 00nn nnn0 0000`

Table 10.1.2 Instructions written in assembly language and machine code for the function `machine_code_asm_0`.
Assembly Language	Machine Code (binary)	Machine Code (hex)
`orr x0, xzr, #3`	`1011 0010 0100 0000 0000 0111 1110 0000`	`b24007e0`
`and x1, x1, xzr`	`1000 1010 0001 1111 0000 0000 0010 0001`	`8a1f0021`
`and x2, x2, xzr`
`subs x0, x0, #1`
`add x1, x1, #3`	`1001 0001 0000 0000 0000 1100 0010 0001`	`91000c21`
`add x2, x2, #7`
`b.ne my_loop`
`add x0, x1, x2`
`ret`	`1101 0110 0101 1111 0000 0011 1100 0000`	`d65f03c0`

Tasks

Complete Table 10.1.1 by looking up the instruction encoding of SUBS (immediate). Use a short format similar to the already provided instructions.
Look up the instruction encoding of SUB (immediate). How does it differentiate from SUBS (immediate)?
Complete Table 10.1.2 by providing the binary and hexadecimal machine code of all instructions.
Hint
- The immediate imm19 of B.cond is in two’s complement representation. This means that 1111 1111 1111 1111 111 represents the decimal value -1.
- A description of the condition codes is available from the Arm Architecture Reference Manual Armv8, for A-profile architecture in Chapter C1.2.4 / Table C1-1.
Implement the function machine_code_asm_1 in Listing 10.1.3 by only using your hex version of the machine code.

Hint

The directive .inst allows you to provide machine code for an instruction in your assembly files. For example, for the first instruction orr x0, xzr, #3 one may alternatively write .inst 0xb24007e0.

Hint

You may assemble a single instruction by using the tool llvm-mc. For example, you could use the following command to assemble the instruction adds x1, x1, #3:

echo "add x1, x1, #3" | llvm-mc -triple=aarch64 --show-encoding

Conversely, you may also use llvm-mc to disassemble a single instruction. For example, to disassemble 0x91000c21 you could use:

echo "0x21 0x0c 0x00 0x91" | llvm-mc -triple=aarch64 -disassemble --show-encoding

10.2. Simulation

This subsection simulates a simple program using the datapath of the single cycle core developed in the lectures. “Single cycle” means that within one cycle the core fetches an instruction from instruction memory, decodes it and executes it entirely. The simulation allows us to study the behavior of the datapath down to individual gates.

Optional Note

Modern processors work on many instructions in parallel and often require more than a single cycle to execute a single instruction. One heavily used mechanism for Instruction-Level Parallelism (ILP) is called pipelining.

Listing 10.2.1 Simple program comprising five instructions.

orr x0, xzr, #7
and x1, x1, xzr
add x2, x0, #3
add x3, x1, #4
sub x0, x2, x3

Tasks

For the code in Listing 10.2.1, provide the values of registers x0, x1, x2 and x3 after every executed instruction. Use - for undefined values and base 16 for valid values in your description, example:
- Initially we have x0: -, x1: -, x2: -, x3: -.
- After executing orr x0, xzr, #7, we obtain x0: 7, x1: -, x2: -, x3: -
Use the (incomplete) datapath part_4_shifted_register.dig of the lectures’ single cycle core to execute the instructions in Listing 10.2.1. Export the core’s configuration as SVG including your control signals (RegSrc, RegW, ImmType, AluSrc, AluCtrl, MemW, Mem2Reg) whenever the clock signal is low, i.e., before completing the respective instruction.