10. Datapath
We have gathered all the parts and are ready to start designing our own processors. In theory, we could do this in SystemVerilog and support AArch64 in its entirety. Our FPGA skills would then prove extremely helpful to conduct initial tests and take our processor designs for a spin. If only there were some more time…
Instead, we will limit our efforts to a simple single-cycle core that supports only a few A64 instructions. This lab studies the datapath of this microarchitecture. In more detail, Section 10.1 discusses machine code and the underlying instruction encoding used in the Arm architecture. An Arm processor parses this machine code and behaves accordingly. Second, in Section 10.2, we will study how our single-cycle design interprets and executes machine code. This allows us to get an idea about how instructions are fetched, decoded, and executed by a CPU.
10.1. Machine Code
In Section 8 and Section 9, we wrote simple functions in assembly language. Up to this point, the assembler has taken care of translating our assembly code into machine code. Machine code is how our executable programs are actually stored in memory. In this exercise, we will manually translate a program from assembly language to machine code. This is not something one would typically do in practice, but it is crucial for understanding the encoding of A64 instructions.
AArch64 has a fixed instruction size of 32 bits. Thus, a function with 10 instructions occupies 10 \(\times\) 32 bits = 320 bits = 40 bytes. Logically, an AArch64 processor reads bits 0–31, decodes the instruction, and executes the corresponding operation. Next, the processor reads and decodes bits 32–63, and executes the respective operation. After that, bits 64–95 are processed, and so on. Branches behave differently. Here, potentially based on the result of a previous instruction, the processor may not proceed with the next 32 bits, but instead branch to another program position.
Optional Note
Certain recent extensions of the Arm architecture deviate from the principle that every operation is encoded in 32 bits. An example of more “complex” operations are prefix instructions of the Scalable Vector Extension (SVE). MOVPRFX (unpredicated), for example, allows us to perform a four-operand Fused-Multiply-Add (FMA4) operation. Effectively, one would use two instructions, i.e., 2 \(\times\) 32 bits = 64 bits, to realize FMA4. Some details on this are available from Fujitsu’s 2018 Hot Chips (HC30) presentation on A64FX.
Listing 10.1.1, Listing 10.1.2, and Listing 10.1.3 contain our usual structure: a C++ driver, a C reference implementation, and the corresponding part in assembly language.
This time, however, the assembly code is already provided in the function machine_code_asm_0 of Listing 10.1.3.
Our goal is the manual translation of the human-readable instructions into machine code.
This allows us to re-implement the function in machine_code_asm_1 using only machine code.
We will do this by looking up the instructions of machine_code_asm_0 in the ISA and writing down the respective machine code in Table 10.1.2.
Good news: you are in luck!
Table 10.1.1 already contains most of the relevant links to the ISA and a short form of the respective instruction encodings.
Only SUBS (immediate) is missing.
#include <cstdint>
#include <cstdlib>
#include <iostream>
extern "C" {
uint64_t machine_code_c();
uint64_t machine_code_asm_0();
uint64_t machine_code_asm_1();
}
int main() {
uint64_t l_result_c = 0;
uint64_t l_result_asm_0 = 0;
uint64_t l_result_asm_1 = 0;
// machine_code_c
std::cout << "### calling machine_code_c ###" << std::endl;
l_result_c = machine_code_c();
std::cout << l_result_c << std::endl;
// machine_code_asm_0
std::cout << "### calling machine_code_asm_0 ###" << std::endl;
l_result_asm_0 = machine_code_asm_0();
std::cout << l_result_asm_0 << std::endl;
// machine_code_asm_1
std::cout << "### calling machine_code_asm_1 ###" << std::endl;
l_result_asm_1 = machine_code_asm_1();
std::cout << l_result_asm_1 << std::endl;
return EXIT_SUCCESS;
}
#include <stdint.h>
uint64_t machine_code_c() {
uint64_t l_tmp_0 = 3;
uint64_t l_tmp_1 = 0;
uint64_t l_tmp_2 = 0;
while( 1 ) {
l_tmp_0--;
l_tmp_1 = l_tmp_1 + 3;
l_tmp_2 = l_tmp_2 + 7;
if( l_tmp_0 == 0 ) break;
}
l_tmp_0 = l_tmp_1 + l_tmp_2;
return l_tmp_0;
}
.text
.align 4
.type machine_code_asm_0, %function
.global machine_code_asm_0
machine_code_asm_0:
orr x0, xzr, #3
and x1, x1, xzr
and x2, x2, xzr
my_loop:
subs x0, x0, #1
add x1, x1, #3
add x2, x2, #7
b.ne my_loop
add x0, x1, x2
ret
.size machine_code_asm_0, (. - machine_code_asm_0)
.text
.align 4
.type machine_code_asm_1, %function
.global machine_code_asm_1
machine_code_asm_1:
// TODO: implement machine code version
.size machine_code_asm_1, (. - machine_code_asm_1)
Instruction |
Instruction Encoding |
|---|---|
|
|
|
|
SUBS (immediate) |
|
|
|
|
|
|
|
|
Assembly Language |
Machine Code (binary) |
Machine Code (hex) |
|---|---|---|
|
|
|
|
|
|
|
||
|
||
|
|
|
|
||
|
||
|
||
|
|
|
Tasks
Complete Table 10.1.1 by looking up the instruction encoding of SUBS (immediate). Use a short format similar to the already provided instructions.
Look up the instruction encoding of SUB (immediate). How does it differ from SUBS (immediate)?
Complete Table 10.1.2 by providing the binary and hexadecimal machine code of all instructions.
Hint
The immediate value imm19 of B.cond is in two’s complement representation. This means that
1111 1111 1111 1111 111represents the decimal value -1.A description of the condition codes is available from the Arm Architecture Reference Manual Armv8, for A-profile architecture in Chapter C1.2.4 / Table C1-1.
Implement the function
machine_code_asm_1in Listing 10.1.3 using only your hex version of the machine code.
Hint
The directive
.installows you to provide machine code for an instruction in your assembly files. For example, for the first instructionorr x0, xzr, #3one may alternatively write.inst 0xb24007e0.
Hint
You may assemble a single instruction using the tool llvm-mc.
For example, you could use the following command to assemble the instruction add x1, x1, #3:
echo "add x1, x1, #3" | llvm-mc -triple=aarch64 --show-encoding
Conversely, you may also use llvm-mc to disassemble a single instruction.
For example, to disassemble 0x91000c21 you could use:
echo "0x21 0x0c 0x00 0x91" | llvm-mc -triple=aarch64 -disassemble --show-encoding
10.2. Simulation
This section simulates a simple program using the datapath of the single-cycle core developed in the lectures. “Single-cycle” means that within one cycle, the core fetches an instruction from instruction memory, decodes it, and executes it. The simulation allows us to study the behavior of the datapath down to individual gates.
Optional Note
Modern processors work on many instructions in parallel and often require more than a single cycle to complete a single instruction. One heavily used mechanism for Instruction-Level Parallelism (ILP) is pipelining.
add x0, x0, #32
add x1, x0, #8
add x2, x2, #16
str x0, [x1]
ldr x1, [x3, #40]
sub x3, x1, #1
Tasks
For the code in Listing 10.2.1, provide the values of registers
X0,X1,X2, andX3after every executed instruction. Assume that all registers have been zeroed initially. Use base 16 for values in your description, example:Initially we have
X0: 0, X1: 0, X2: 0, X3: 0.After executing
add x0, x0, #32, we obtainX0: 0x20, X1: 0, X2: 0, X3: 0
Use the (incomplete) datapath
part_2_addi_subi.digof the lecture’s single-cycle core to execute the instructions in Listing 10.2.1. Export the core’s configurations as SVGs, when the clock signal is low, that is before the respective instruction completes. Include your control signals (RegW, ImmType, AluCtrl, MemW, Mem2Reg) in the exports.