6. Assembly Building Blocks
Several code constructs appear frequently in source code. Examples are conditional code execution based on if-then-else statements or loops. In Section 6.1 we’ll implement some of these building blocks in assembly code. This not only extends our toolbox but also helps to understand the impact of their high-level equivalents in hardware.
Section 6.2 interprets “code constructs” by means of BLAS3. Here, we’ll develop a microkernel for small GEMMS using the Advanced SIMD (ASIMD) extension. This microkernel may function as building block for more general GEMMs harnessing ASIMD. However, in this class, we’ll only generalize the kernel in the contraction dimension \(K\). We’ll move to the Scalable Vector Extension (SVE) for a more general GEMM implementation in Section 7. This allows us to get a solid understanding of the differences between ASIMD and SVE.
6.1. Conditions and Loops
First, we’ll have a look at some small C/C++ functions and formulate them in assembly language. For this, three source files and respective output are given:
1#include <cstdint>
2
3int32_t high_lvl_0( int32_t i_value );
4
5uint64_t high_lvl_1( uint64_t );
6
7int32_t high_lvl_2( int32_t i_option );
8
9void high_lvl_3( int32_t * i_option,
10 int32_t * o_result );
11
12uint32_t high_lvl_4( uint32_t i_x,
13 uint32_t i_y,
14 uint32_t i_z );
15
16void high_lvl_5( uint32_t i_nIters,
17 int32_t * io_value );
18
19void high_lvl_6( uint64_t i_nIters,
20 int64_t i_inc,
21 int64_t * io_value );
22
23void high_lvl_7( uint64_t i_nValues,
24 int64_t * i_valuesIn,
25 int64_t * i_valuesOut );
1#include "high_level.h"
2
3int32_t high_lvl_0( int32_t i_value ) {
4 return i_value;
5}
6
7uint64_t high_lvl_1( uint64_t ) {
8 return 0;
9}
10
11int32_t high_lvl_2( int32_t i_option ) {
12 int32_t l_result = 0;
13
14 if( i_option < 32 ) {
15 l_result = 1;
16 }
17
18 return l_result;
19}
20
21void high_lvl_3( int32_t * i_option,
22 int32_t * o_result ) {
23 if( *i_option < 25 ) {
24 *o_result = 1;
25 }
26 else {
27 *o_result = 0;
28 }
29}
30
31uint32_t high_lvl_4( uint32_t i_x,
32 uint32_t i_y,
33 uint32_t i_z ) {
34 uint32_t l_ret = 0;
35
36 if( i_x < i_y && i_x < i_z ) {
37 l_ret = 1;
38 }
39 else if( i_y < i_z ) {
40 l_ret = 2;
41 }
42 else {
43 l_ret = 3;
44 }
45
46 return l_ret;
47}
48
49void high_lvl_5( uint32_t i_nIters,
50 int32_t * io_value ) {
51 for( uint32_t l_i = 0; l_i < i_nIters; l_i++ ) {
52 *io_value += 1;
53 }
54}
55
56void high_lvl_6( uint64_t i_nIters,
57 int64_t i_inc,
58 int64_t * io_value ) {
59 uint64_t l_va = i_nIters;
60 do {
61 *io_value += i_inc;
62 l_va--;
63 } while( l_va != 0 );
64}
65
66void high_lvl_7( uint64_t i_nValues,
67 int64_t * i_valuesIn,
68 int64_t * i_valuesOut ) {
69 for( uint64_t l_va = 0; l_va < i_nValues; l_va++ ) {
70 i_valuesOut[l_va] = i_valuesIn[l_va];
71 }
72}
1#include <iostream>
2#include "high_level.h"
3
4int main() {
5 std::cout << "running driver" << std::endl;
6
7 std::cout << "high_lvl_0(10): "
8 << high_lvl_0( 10 )
9 << std::endl;
10 std::cout << "high_lvl_1(10): "
11 << high_lvl_1( 10 ) << std::endl;
12 std::cout << "high_lvl_2(32): "
13 << high_lvl_2( 32 ) << std::endl;
14 std::cout << "high_lvl_2( 5): "
15 << high_lvl_2( 5 ) << std::endl;
16
17 int32_t l_highLvlOpt3 = 17;
18 int32_t l_highLvlRes3 = -1;
19 high_lvl_3( &l_highLvlOpt3,
20 &l_highLvlRes3 );
21 std::cout << "high_lvl_3 #1: "
22 << l_highLvlRes3 << std::endl;
23
24 l_highLvlOpt3 = 43;
25 high_lvl_3( &l_highLvlOpt3,
26 &l_highLvlRes3 );
27 std::cout << "high_lvl_3 #2: "
28 << l_highLvlRes3 << std::endl;
29 std::cout << "high_lvl_4(1,2,3): "
30 << high_lvl_4( 1, 2, 3 ) << std::endl;
31 std::cout << "high_lvl_4(4,2,3): "
32 << high_lvl_4( 4, 2, 3 ) << std::endl;
33 std::cout << "high_lvl_4(4,3,3): "
34 << high_lvl_4( 4, 3, 3 ) << std::endl;
35
36 int32_t l_highLvlValue5 = 500;
37 high_lvl_5( 17,
38 &l_highLvlValue5 );
39 std::cout << "high_lvl_5: " << l_highLvlValue5 << std::endl;
40
41 int64_t l_highLvlValue6 = 23;
42 high_lvl_6( 5,
43 13,
44 &l_highLvlValue6 );
45 std::cout << "high_lvl_6: "
46 << l_highLvlValue6 << std::endl;
47
48 int64_t l_highLvlVasIn7[10] = { 0, 7, 7, 4, 3,\
49 -10, -50, 40, 2, 3 };
50 int64_t l_highLvlVasOut7[10] = { 0 };
51 high_lvl_7( 10,
52 l_highLvlVasIn7,
53 l_highLvlVasOut7 );
54
55 std::cout << "high_lvl_7: "
56 << l_highLvlVasOut7[0] << " / "
57 << l_highLvlVasOut7[1] << " / "
58 << l_highLvlVasOut7[2] << " / "
59 << l_highLvlVasOut7[3] << " / "
60 << l_highLvlVasOut7[4] << " / "
61 << l_highLvlVasOut7[5] << " / "
62 << l_highLvlVasOut7[6] << " / "
63 << l_highLvlVasOut7[7] << " / "
64 << l_highLvlVasOut7[8] << " / "
65 << l_highLvlVasOut7[9] << std::endl;
66
67 // low-level part goes here
68
69 std::cout << "finished, exiting" << std::endl;
70 return EXIT_SUCCESS;
71}
1running driver
2high_lvl_0(10): 10
3high_lvl_1(10): 0
4high_lvl_2(32): 0
5high_lvl_2( 5): 1
6high_lvl_3 #1: 1
7high_lvl_3 #2: 0
8high_lvl_4(1,2,3): 1
9high_lvl_4(4,2,3): 2
10high_lvl_4(4,3,3): 3
11high_lvl_5: 517
12high_lvl_6: 88
13high_lvl_7: 0 / 7 / 7 / 4 / 3 / -10 / -50 / 40 / 2 / 3
14finished, exiting
Tasks
Explain in 1-2 sentences what every of the eight functions does.
Implement the functions in assembly language. Use the file names
low_level.h
andlow_level.cpp
and matching names for the functions, i.e.,low_lvl_0
,low_lvl_1
, …,low_lvl_7
.Verify your low-level versions by extending the driver.
6.2. Small GEMM: ASIMD
In this part of the lab we’ll write our first high-performance matrix kernel relying on floating point math.
Our targeted kernel gemm_asm_asimd_16_6_1
has the following signature:
void gemm_asm_asimd_16_6_1( float const * i_a,
float const * i_b,
float * io_c );
and performs the operation
on 32-bit floating point-data with
In our implementation we’ll completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops. Once done, we’ll add a \(K\) loop to increase the size of the kernel’s contraction dimension.
As for the general purpose registers, we have to preserve some SIMD registers to adhere to the procedure call standard AAPCS64. The template in Listing 6.2.1 extends the one we had before when working solely on general purpose registers. Now, we also write the lowest 64 bits of registers V8-V15 to the stack and restore them at the end of the function.
1 .text
2 .type gemm_asm_asimd_16_6_1, %function
3 .global gemm_asm_asimd_16_6_1
4 /*
5 * Performs the matrix-multiplication C+=A*B
6 * with the shapes (16x6) = (16x1) * (1x6).
7 * The input-data is of type float.
8 *
9 * @param x0 pointer to A.
10 * @param x1 pointer to B.
11 * @param x2 pointer to C.
12 */
13gemm_asm_asimd_16_6_1:
14 // store
15 stp x19, x20, [sp, #-16]!
16 stp x21, x22, [sp, #-16]!
17 stp x23, x24, [sp, #-16]!
18 stp x25, x26, [sp, #-16]!
19 stp x27, x28, [sp, #-16]!
20 stp x29, x30, [sp, #-16]!
21
22 stp d8, d9, [sp, #-16]!
23 stp d10, d11, [sp, #-16]!
24 stp d12, d13, [sp, #-16]!
25 stp d14, d15, [sp, #-16]!
26
27 // your matrix kernel goes here!
28
29 // restore
30 ldp d14, d15, [sp], #16
31 ldp d12, d13, [sp], #16
32 ldp d10, d11, [sp], #16
33 ldp d8, d9, [sp], #16
34
35 ldp x29, x30, [sp], #16
36 ldp x27, x28, [sp], #16
37 ldp x25, x26, [sp], #16
38 ldp x23, x24, [sp], #16
39 ldp x21, x22, [sp], #16
40 ldp x19, x20, [sp], #16
41
42 ret
43 .size gemm_asm_asimd_16_6_1, (. - gemm_asm_asimd_16_6_1)
Since we are hunting for performance, we’ll do a little competition. Prize? The three best-performing teams will receive an honorable mention on the Performance Board of this homepage 😉.
Rules
Respect the procedure call standard. But: if you don’t modify a register, you don’t have to save it to the stack.
Verify your kernels.
Instrument your kernels for at least 1 second in your performance measurements through repeated executions.
For now, the current best performing teams are Alex’s ASM and LIBXSMM:
Team |
Time (s) |
#executions |
GFLOPS |
%peak |
---|---|---|---|---|
Alex’s ASM |
1.287 |
100000000 |
14.92 |
23.3 |
LIBXSMM, 59410c81 (ASIMD) |
1.811 |
100000000 |
10.60 |
16.6 |
Team |
Time (s) |
#executions |
GFLOPS |
%peak |
---|---|---|---|---|
Alex’s ASM |
18.08 |
100000000 |
50.97 |
79.6 |
LIBXSMM, 59410c81 (ASIMD) |
18.74 |
100000000 |
49.17 |
76.8 |
Tasks
Implement and verify the unrolled matrix kernel C += AB for M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.
Tune your kernel to squeeze more performance out of a core. You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules above. Report and document your optimizations.
Add a loop over K to realize C += AB for M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.
Come up with a creative team name and submit it together with your entries for “Time (s)”, “#executions”, “GFLOPS” and “%peak” in Table 6.2.1 and Table 6.2.2. Assume a theoretical single-core peak of 64 GFLOPS for the used c7g.xlarge instance.