8. AArch64
In this lab we’ll shift our attention to the AArch64 Instruction Set Architecture (ISA). An AArch64 instruction can be written in two ways:
Through human-readable assembly code; or
Through machine code which is understood directly by the processor.
We’ll first get a feeling for the ISA by writing a few simple functions in assembly code. Once this is accomplished, Section 9 introduces vector instructions in the context of floating point arithmetic. Then, Section 10 brings us again closer to the hardware by discussing machine code and its execution by the control and datapath of a processor.
8.1. Copying Data
AArch64 is a load-store architecture (see Fig. 8.1.1). This means that instructions either perform memory accesses or operate on data in registers. Note that an instruction may not do both, i.e., access memory and process data, at the same time.
A memory access instruction transfers data from memory to the registers (load) or transfers data from the registers to memory (store). In this task we’ll copy data which is located at one memory location to another memory location. Since we cannot directly move data between two memory locations, we first load the data from the first location to the registers and then write it back to the target memory location. For this task we’ll use AArch64’s general purpose registers which are shown in Fig. 8.1.1.
Optional Note
Certain recent extensions of the Arm architecture violate the concept of a strict “load-store architecture” 🙄. One such example is the LDADD instruction which loads data from memory, adds a value in a register to it, and writes the result back to memory.
#include <cstdint>
#include <cstdlib>
#include <iostream>
extern "C" {
void copy_c( uint64_t const * i_a,
uint64_t * o_b );
// TODO: uncomment
//void copy_asm( uint64_t const * i_a,
// uint64_t * o_b );
}
int main() {
uint64_t l_a[7] = { 1, 21, 43, 78, 89, 91, 93 };
uint64_t l_b_0[7] = { 0 };
uint64_t l_b_1[7] = { 0 };
// copy_c
std::cout << "### calling copy_c ###" << std::endl;
copy_c( l_a,
l_b_0 );
for( unsigned short l_va = 0; l_va < 7; l_va++ ) {
std::cout << l_a[l_va] << " / " << l_b_0[l_va] << std::endl;
}
// copy_asm
std::cout << "### calling copy_asm ###" << std::endl;
// TODO: uncomment
// copy_asm( l_a,
// l_b_1 );
for( unsigned short l_va = 0; l_va < 7; l_va++ ) {
std::cout << l_a[l_va] << " / " << l_b_1[l_va] << std::endl;
}
return EXIT_SUCCESS;
}
#include <stdint.h>
void copy_c( uint64_t const * i_a,
uint64_t * i_b ) {
uint64_t l_tmp_0 = i_a[0];
uint64_t l_tmp_1 = i_a[1];
uint64_t l_tmp_2 = i_a[2];
uint64_t l_tmp_3 = i_a[3];
uint64_t l_tmp_4 = i_a[4];
uint64_t l_tmp_5 = i_a[5];
uint64_t l_tmp_6 = i_a[6];
i_b[0] = l_tmp_0;
i_b[1] = l_tmp_1;
i_b[2] = l_tmp_2;
i_b[3] = l_tmp_3;
i_b[4] = l_tmp_4;
i_b[5] = l_tmp_5;
i_b[6] = l_tmp_6;
}
.text
.align 4
.type copy_asm, %function
.global copy_asm
copy_asm:
// TODO: Implement copy_asm
ret
.size copy_asm, (. - copy_asm)
The code in Listing 8.1.1 and Listing 8.1.3 provides the required boilerplate for your kernel.
Further, a reference implementation of the copy function in C is given in Listing 8.1.2.
Your task is to copy Listing 8.1.1’s seven 64-bit unsigned integers in array l_a
to array l_b_1
by implementing the function copy_asm
in assembly language.
Note
Use the instructions LDR (immediate) and STR (immediate) for the loads and stores in your implementation.
Do not implement any stack transfers and only use the first 18 general purpose registers, i.e., R0
- R17
to adhere the procedure call standard.
Use the flags -pedantic -Wall -Wextra -Werror
whenever invoking gcc
or g++
.
Do this not only here but in all tasks.
Tasks
Implement the function
copy_asm
in the filecopy_asm.s
. Use the template in Listing 8.1.3 for your implementation. Follow the ideas of the C implementation in Listing 8.1.2, i.e., do not use any loops in your code.Compile the C kernel
copy_c
given in Listing 8.1.2 using the optimization flag-O2
. Disassemble the compiler-generated machine code. Briefly explain the obtained assembly code.Implement a new function
copy_asm_loop
in the filecopy_asm.s
. In this implementation use a loop to copy the seven values.
8.2. Adding Two Arrays
Great, we are able to move data from A to B. Even better if we could process our data, don’t you think? Let’s do another simple example for this!
Assume that you have two memory addresses which are stored in the pointers l_a
and l_b
.
Each address is the start of some 64-bit unsigned integer values consecutively stored in memory.
For example, if you have 10 values, each array is 10 \(\times\) 64 bits = 640 bits large.
This is the same as 80 bytes per array or 160 bytes for all values together.
Now, our goal is to add the values in the two arrays l_a
and l_b
, and store the result at a third location in memory.
Getting the data into the general purpose registers and back to memory is simple, we just programmed a kernel for this in Section 8.1.
The only missing piece of the puzzle is an instruction which processes the data and effectively adds the values in two general purpose registers.
For this, we once again have a look at the base instructions of the ISA.
ADD (shifted register) is a suitable instruction.
#include <cstdint>
#include <cstdlib>
#include <iostream>
extern "C" {
void add_c( uint64_t i_n_values,
uint64_t const * i_a,
uint64_t const * i_b,
uint64_t * o_c );
void add_asm( uint64_t i_n_values,
uint64_t const * i_a,
uint64_t const * i_b,
uint64_t * o_c );
}
int main() {
uint64_t l_n_values = 10;
// init pointers
uint64_t * l_a = nullptr;
uint64_t * l_b = nullptr;
uint64_t * l_c_0 = nullptr;
uint64_t * l_c_1 = nullptr;
// allocate memory
l_a = (uint64_t *) new uint64_t[ l_n_values ];
l_b = (uint64_t *) new uint64_t[ l_n_values ];
l_c_0 = (uint64_t *) new uint64_t[ l_n_values ];
l_c_1 = (uint64_t *) new uint64_t[ l_n_values ];
// init arrays
for( std::size_t l_va = 0; l_va < l_n_values; l_va++ ) {
l_a[l_va] = l_va;
l_b[l_va] = l_va*2;
l_c_0[l_va] = 0;
l_c_1[l_va] = 0;
}
// add_c
std::cout << "### calling add_c ###" << std::endl;
add_c( l_n_values,
l_a,
l_b,
l_c_0 );
for( std::size_t l_va = 0; l_va < l_n_values; l_va++ ) {
std::cout << l_a[l_va] << " / " << l_b[l_va] << " / " << l_c_0[l_va] << std::endl;
}
// add_asm
std::cout << "### calling add_asm ###" << std::endl;
add_asm( l_n_values,
l_a,
l_b,
l_c_1 );
for( std::size_t l_va = 0; l_va < l_n_values; l_va++ ) {
std::cout << l_a[l_va] << " / " << l_b[l_va] << " / " << l_c_1[l_va] << std::endl;
}
// free memory
delete [] l_a;
delete [] l_b;
delete [] l_c_0;
delete [] l_c_1;
return EXIT_SUCCESS;
}
#include <stdint.h>
void add_c( uint64_t i_n_values,
uint64_t const * i_a,
uint64_t const * i_b,
uint64_t * o_c ) {
for( uint64_t l_va = 0; l_va < i_n_values; l_va++ ) {
uint64_t l_tmp_a = i_a[l_va];
uint64_t l_tmp_b = i_b[l_va];
uint64_t l_tmp_c = l_tmp_a + l_tmp_b;
o_c[l_va] = l_tmp_c;
}
}
Once again, to supercharge your coding, a template for the required C++ driver is given in Listing 8.2.1. Further, a reference C implementation of the addition kernel is given in Listing 8.2.2. Thus, the only missing part is the assembly kernel: Time to get to work!
Tasks
Implement the function
add_asm
in assembly language and use the fileadd_asm.s
for your implementation. Follow the ideas of the C implementation in Listing 8.2.2.Compile the C kernel
add_c
given in Listing 8.2.2 using the optimization flag-O2
. Disassemble the compiler-generated machine code. Briefly explain the obtained assembly code!
8.3. Computing Fibonacci Numbers
Let’s program something useful for a change 😂. The Fibonacci numbers are given by the following sequence:
Listing 8.3.1 provides the usual C++ driver. As shown in line 6 and 7, the C and assembly functions take the id \(n\) as input and return the respective Fibonacci number, i.e., \(F_n\). Once again, we’ll get started by implementing a C function which is somewhat close to assembly code. This will then be our recipe for the assembly variant.
1#include <cstdint>
2#include <cstdlib>
3#include <iostream>
4
5extern "C" {
6 uint64_t fibonacci_c( uint64_t i_id );
7 uint64_t fibonacci_asm( uint64_t i_id );
8}
9
10int main() {
11 uint64_t l_id = 5;
12 uint64_t l_number_0 = 0;
13 uint64_t l_number_1 = 0;
14
15 // fibonacci_c
16 std::cout << "### fibonacci_c ###" << std::endl;
17 l_number_0 = fibonacci_c( l_id );
18
19 std::cout << "id / number: " << l_id << " / " << l_number_0 << std::endl;
20
21 // fibonacci_asm
22 std::cout << "### fibonacci_asm ###" << std::endl;
23 l_number_1 = fibonacci_asm( l_id );
24
25 std::cout << "id / number: " << l_id << " / " << l_number_1 << std::endl;
26
27 return EXIT_SUCCESS;
28}
Tasks
Implement the reference version
fibonacci_c
in the filefibonacci_c.c
. Try to keep your implementation close to what you would do in assembly language.Implement the assembly version
fibonacci_asm
in the filefibonacci_asm.s
. Keep your implementation dynamic, i.e., the function should accept \(n\) as input argument. This is also underlined by the function declaration’s argumentuint64_t i_id
in line 7 of Listing 8.3.1.Hint
Keep in mind the procedure call standard, i.e., the compiler will make the input
i_id
available inX0
. You have to return theuint64_t
result inX0
as well.
8.4. Assembly Building Blocks
In this part of the lab, we’ll have a look at some common building blocks appearing in C/C++ code and formulate them in assembly language. For this, three source files and respective output are given:
1#include <cstdint>
2
3int32_t high_lvl_0( int32_t i_value );
4
5uint64_t high_lvl_1( uint64_t );
6
7int32_t high_lvl_2( int32_t i_option );
8
9void high_lvl_3( int32_t * i_option,
10 int32_t * o_result );
11
12uint32_t high_lvl_4( uint32_t i_x,
13 uint32_t i_y,
14 uint32_t i_z );
15
16void high_lvl_5( uint32_t i_nIters,
17 int32_t * io_value );
18
19void high_lvl_6( uint64_t i_nIters,
20 int64_t i_inc,
21 int64_t * io_value );
22
23void high_lvl_7( uint64_t i_nValues,
24 int64_t * i_valuesIn,
25 int64_t * i_valuesOut );
1#include "high_level.h"
2
3int32_t high_lvl_0( int32_t i_value ) {
4 return i_value;
5}
6
7uint64_t high_lvl_1( uint64_t ) {
8 return 0;
9}
10
11int32_t high_lvl_2( int32_t i_option ) {
12 int32_t l_result = 0;
13
14 if( i_option < 32 ) {
15 l_result = 1;
16 }
17
18 return l_result;
19}
20
21void high_lvl_3( int32_t * i_option,
22 int32_t * o_result ) {
23 if( *i_option < 25 ) {
24 *o_result = 1;
25 }
26 else {
27 *o_result = 0;
28 }
29}
30
31uint32_t high_lvl_4( uint32_t i_x,
32 uint32_t i_y,
33 uint32_t i_z ) {
34 uint32_t l_ret = 0;
35
36 if( i_x < i_y && i_x < i_z ) {
37 l_ret = 1;
38 }
39 else if( i_y < i_z ) {
40 l_ret = 2;
41 }
42 else {
43 l_ret = 3;
44 }
45
46 return l_ret;
47}
48
49void high_lvl_5( uint32_t i_nIters,
50 int32_t * io_value ) {
51 for( uint32_t l_i = 0; l_i < i_nIters; l_i++ ) {
52 *io_value += 1;
53 }
54}
55
56void high_lvl_6( uint64_t i_nIters,
57 int64_t i_inc,
58 int64_t * io_value ) {
59 uint64_t l_va = i_nIters;
60 do {
61 *io_value += i_inc;
62 l_va--;
63 } while( l_va != 0 );
64}
65
66void high_lvl_7( uint64_t i_nValues,
67 int64_t * i_valuesIn,
68 int64_t * i_valuesOut ) {
69 for( uint64_t l_va = 0; l_va < i_nValues; l_va++ ) {
70 i_valuesOut[l_va] = i_valuesIn[l_va];
71 }
72}
1#include <iostream>
2#include "high_level.h"
3
4int main() {
5 std::cout << "running driver" << std::endl;
6
7 std::cout << "high_lvl_0(10): "
8 << high_lvl_0( 10 )
9 << std::endl;
10 std::cout << "high_lvl_1(10): "
11 << high_lvl_1( 10 ) << std::endl;
12 std::cout << "high_lvl_2(32): "
13 << high_lvl_2( 32 ) << std::endl;
14 std::cout << "high_lvl_2( 5): "
15 << high_lvl_2( 5 ) << std::endl;
16
17 int32_t l_highLvlOpt3 = 17;
18 int32_t l_highLvlRes3 = -1;
19 high_lvl_3( &l_highLvlOpt3,
20 &l_highLvlRes3 );
21 std::cout << "high_lvl_3 #1: "
22 << l_highLvlRes3 << std::endl;
23
24 l_highLvlOpt3 = 43;
25 high_lvl_3( &l_highLvlOpt3,
26 &l_highLvlRes3 );
27 std::cout << "high_lvl_3 #2: "
28 << l_highLvlRes3 << std::endl;
29 std::cout << "high_lvl_4(1,2,3): "
30 << high_lvl_4( 1, 2, 3 ) << std::endl;
31 std::cout << "high_lvl_4(4,2,3): "
32 << high_lvl_4( 4, 2, 3 ) << std::endl;
33 std::cout << "high_lvl_4(4,3,3): "
34 << high_lvl_4( 4, 3, 3 ) << std::endl;
35
36 int32_t l_highLvlValue5 = 500;
37 high_lvl_5( 17,
38 &l_highLvlValue5 );
39 std::cout << "high_lvl_5: " << l_highLvlValue5 << std::endl;
40
41 int64_t l_highLvlValue6 = 23;
42 high_lvl_6( 5,
43 13,
44 &l_highLvlValue6 );
45 std::cout << "high_lvl_6: "
46 << l_highLvlValue6 << std::endl;
47
48 int64_t l_highLvlVasIn7[10] = { 0, 7, 7, 4, 3,\
49 -10, -50, 40, 2, 3 };
50 int64_t l_highLvlVasOut7[10] = { 0 };
51 high_lvl_7( 10,
52 l_highLvlVasIn7,
53 l_highLvlVasOut7 );
54
55 std::cout << "high_lvl_7: "
56 << l_highLvlVasOut7[0] << " / "
57 << l_highLvlVasOut7[1] << " / "
58 << l_highLvlVasOut7[2] << " / "
59 << l_highLvlVasOut7[3] << " / "
60 << l_highLvlVasOut7[4] << " / "
61 << l_highLvlVasOut7[5] << " / "
62 << l_highLvlVasOut7[6] << " / "
63 << l_highLvlVasOut7[7] << " / "
64 << l_highLvlVasOut7[8] << " / "
65 << l_highLvlVasOut7[9] << std::endl;
66
67 // low-level part goes here
68
69 std::cout << "finished, exiting" << std::endl;
70 return EXIT_SUCCESS;
71}
1running driver
2high_lvl_0(10): 10
3high_lvl_1(10): 0
4high_lvl_2(32): 0
5high_lvl_2( 5): 1
6high_lvl_3 #1: 1
7high_lvl_3 #2: 0
8high_lvl_4(1,2,3): 1
9high_lvl_4(4,2,3): 2
10high_lvl_4(4,3,3): 3
11high_lvl_5: 517
12high_lvl_6: 88
13high_lvl_7: 0 / 7 / 7 / 4 / 3 / -10 / -50 / 40 / 2 / 3
14finished, exiting
Tasks
Explain in 1-2 sentences what every of the eight functions does.
Implement the functions in assembly language. Use the file names
low_level.h
andlow_level.s
and matching names for the functions, i.e.,low_lvl_0
,low_lvl_1
, …,low_lvl_7
. Document your source code extensively!Verify your low-level versions by extending the driver.