6. Assembly Building Blocks

Several code constructs appear frequently in source code. Examples are conditional code execution based on if-then-else statements or loops. In Section 6.1 we’ll implement some of these building blocks in assembly code. This not only extends our toolbox but also helps to understand the impact of their high-level equivalents in hardware.

Section 6.2 interprets “code constructs” by means of BLAS3. Here, we’ll develop a microkernel for small GEMMS using the Advanced SIMD (ASIMD) extension. This microkernel may function as building block for more general GEMMs harnessing ASIMD. However, in this class, we’ll only generalize the kernel in the contraction dimension \(K\). We’ll move to the Scalable Vector Extension (SVE) for a more general GEMM implementation in Section 7. This allows us to get a solid understanding of the differences between ASIMD and SVE.

6.1. Conditions and Loops

First, we’ll have a look at some small C/C++ functions and formulate them in assembly language. For this, three source files and respective output are given:

Listing 6.1.1 File high_level.h which defines the C/C++ functions’ signatures.
 1#include <cstdint>
 2
 3int32_t high_lvl_0( int32_t i_value );
 4
 5uint64_t high_lvl_1( uint64_t );
 6
 7int32_t high_lvl_2( int32_t i_option );
 8
 9void high_lvl_3( int32_t * i_option,
10                 int32_t * o_result );
11
12uint32_t high_lvl_4( uint32_t i_x,
13                     uint32_t i_y,
14                     uint32_t i_z );
15
16void high_lvl_5( uint32_t   i_nIters,
17                 int32_t  * io_value );
18
19void high_lvl_6( uint64_t   i_nIters,
20                 int64_t    i_inc,
21                 int64_t  * io_value );
22
23void high_lvl_7( uint64_t   i_nValues,
24                 int64_t  * i_valuesIn,
25                 int64_t  * i_valuesOut );
Listing 6.1.2 File high_level.cpp which implements the C/C++ functions.
 1#include "high_level.h"
 2
 3int32_t high_lvl_0( int32_t i_value ) {
 4  return i_value;
 5}
 6
 7uint64_t high_lvl_1( uint64_t ) {
 8  return 0;
 9}
10
11int32_t high_lvl_2( int32_t i_option ) {
12  int32_t l_result = 0;
13
14  if( i_option < 32 ) {
15    l_result = 1;
16  }
17
18  return l_result;
19}
20
21void high_lvl_3( int32_t * i_option,
22                 int32_t * o_result ) {
23  if( *i_option < 25 ) {
24    *o_result = 1;
25  }
26  else {
27    *o_result = 0;
28  }
29}
30
31uint32_t high_lvl_4( uint32_t i_x,
32                     uint32_t i_y,
33                     uint32_t i_z ) {
34  uint32_t l_ret = 0;
35
36  if( i_x < i_y && i_x < i_z ) {
37    l_ret = 1;
38  }
39  else if( i_y < i_z ) {
40    l_ret = 2;
41  }
42  else {
43    l_ret = 3;
44  }
45
46  return l_ret;
47}
48
49void high_lvl_5( uint32_t   i_nIters,
50                 int32_t  * io_value ) {
51  for( uint32_t l_i = 0; l_i < i_nIters; l_i++ ) {
52    *io_value += 1;
53  }
54}
55
56void high_lvl_6( uint64_t   i_nIters,
57                 int64_t    i_inc,
58                 int64_t  * io_value ) {
59  uint64_t l_va = i_nIters;
60  do {
61    *io_value += i_inc;
62    l_va--;
63  } while( l_va != 0 );
64}
65
66void high_lvl_7( uint64_t   i_nValues,
67                 int64_t  * i_valuesIn,
68                 int64_t  * i_valuesOut ) {
69  for( uint64_t l_va = 0; l_va < i_nValues; l_va++ ) {
70    i_valuesOut[l_va] = i_valuesIn[l_va];
71  }
72}
Listing 6.1.3 File driver.cpp which calls the given C/C++ functions.
 1#include <iostream>
 2#include "high_level.h"
 3
 4int main() {
 5  std::cout << "running driver" << std::endl;
 6
 7  std::cout << "high_lvl_0(10): "
 8            << high_lvl_0( 10 )
 9            << std::endl;
10  std::cout << "high_lvl_1(10): "
11            << high_lvl_1( 10 ) << std::endl;
12  std::cout << "high_lvl_2(32): "
13            << high_lvl_2( 32 ) << std::endl;
14  std::cout << "high_lvl_2( 5): "
15            << high_lvl_2(  5 ) << std::endl;
16
17  int32_t l_highLvlOpt3 = 17;
18  int32_t l_highLvlRes3 = -1;
19  high_lvl_3( &l_highLvlOpt3,
20              &l_highLvlRes3 );
21  std::cout << "high_lvl_3 #1: "
22            << l_highLvlRes3 << std::endl;
23
24  l_highLvlOpt3 = 43;
25  high_lvl_3( &l_highLvlOpt3,
26              &l_highLvlRes3 );
27  std::cout << "high_lvl_3 #2: "
28            << l_highLvlRes3 << std::endl;
29  std::cout << "high_lvl_4(1,2,3): "
30            << high_lvl_4( 1, 2, 3 ) << std::endl;
31  std::cout << "high_lvl_4(4,2,3): "
32            << high_lvl_4( 4, 2, 3 ) << std::endl;
33  std::cout << "high_lvl_4(4,3,3): "
34            << high_lvl_4( 4, 3, 3 ) << std::endl;
35
36  int32_t l_highLvlValue5 = 500;
37  high_lvl_5(  17,
38              &l_highLvlValue5 );
39  std::cout << "high_lvl_5: " << l_highLvlValue5 << std::endl;
40
41  int64_t l_highLvlValue6 = 23;
42  high_lvl_6( 5,
43              13,
44              &l_highLvlValue6 );
45  std::cout << "high_lvl_6: "
46            << l_highLvlValue6 << std::endl;
47
48  int64_t l_highLvlVasIn7[10] = { 0, 7, 7, 4, 3,\
49                                 -10, -50, 40, 2, 3 };
50  int64_t l_highLvlVasOut7[10] = { 0 };
51  high_lvl_7( 10,
52              l_highLvlVasIn7,
53              l_highLvlVasOut7 );
54
55  std::cout << "high_lvl_7: "
56            << l_highLvlVasOut7[0] << " / "
57            << l_highLvlVasOut7[1] << " / "
58            << l_highLvlVasOut7[2] << " / "
59            << l_highLvlVasOut7[3] << " / "
60            << l_highLvlVasOut7[4] << " / "
61            << l_highLvlVasOut7[5] << " / "
62            << l_highLvlVasOut7[6] << " / "
63            << l_highLvlVasOut7[7] << " / "
64            << l_highLvlVasOut7[8] << " / "
65            << l_highLvlVasOut7[9] << std::endl;
66
67  // low-level part goes here
68
69  std::cout << "finished, exiting" << std::endl;
70  return EXIT_SUCCESS;
71}
Listing 6.1.4 Output when running the high-level implementation.
 1running driver
 2high_lvl_0(10): 10
 3high_lvl_1(10): 0
 4high_lvl_2(32): 0
 5high_lvl_2( 5): 1
 6high_lvl_3 #1: 1
 7high_lvl_3 #2: 0
 8high_lvl_4(1,2,3): 1
 9high_lvl_4(4,2,3): 2
10high_lvl_4(4,3,3): 3
11high_lvl_5: 517
12high_lvl_6: 88
13high_lvl_7: 0 / 7 / 7 / 4 / 3 / -10 / -50 / 40 / 2 / 3
14finished, exiting

Tasks

  1. Explain in 1-2 sentences what every of the eight functions does.

  2. Implement the functions in assembly language. Use the file names low_level.h and low_level.cpp and matching names for the functions, i.e., low_lvl_0, low_lvl_1, …, low_lvl_7.

  3. Verify your low-level versions by extending the driver.

6.2. Small GEMM: ASIMD

In this part of the lab we’ll write our first high-performance matrix kernel relying on floating point math. Our targeted kernel gemm_asm_asimd_16_6_1 has the following signature:

void gemm_asm_asimd_16_6_1( float const * i_a,
                            float const * i_b,
                            float       * io_c );

and performs the operation

\[C \mathrel{+}= A B\]

on 32-bit floating point-data with

\[M = 16, \; N = 6, \; K = 1, \quad ldA = 16, \; ldB = 1, \; ldC = 16.\]

In our implementation we’ll completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops. Once done, we’ll add a \(K\) loop to increase the size of the kernel’s contraction dimension.

As for the general purpose registers, we have to preserve some SIMD registers to adhere to the procedure call standard AAPCS64. The template in Listing 6.2.1 extends the one we had before when working solely on general purpose registers. Now, we also write the lowest 64 bits of registers V8-V15 to the stack and restore them at the end of the function.

Listing 6.2.1 Template for the \((16\times6)=(16\times1)(1\times6)\) ASIMD matrix kernel. The template temporarily saves the general purpose registers X19-X30 and the lowest 64 bits of the SIMD registers V8-V15 on the stack.
 1        .text
 2        .type gemm_asm_asimd_16_6_1, %function
 3        .global gemm_asm_asimd_16_6_1
 4        /*
 5         * Performs the matrix-multiplication C+=A*B
 6         * with the shapes (16x6) = (16x1) * (1x6).
 7         * The input-data is of type float.
 8         *
 9         * @param x0 pointer to A.
10         * @param x1 pointer to B.
11         * @param x2 pointer to C.
12         */ 
13gemm_asm_asimd_16_6_1:
14        // store
15        stp x19, x20, [sp, #-16]!
16        stp x21, x22, [sp, #-16]!
17        stp x23, x24, [sp, #-16]!
18        stp x25, x26, [sp, #-16]!
19        stp x27, x28, [sp, #-16]!
20        stp x29, x30, [sp, #-16]!
21
22        stp  d8,  d9, [sp, #-16]!
23        stp d10, d11, [sp, #-16]!
24        stp d12, d13, [sp, #-16]!
25        stp d14, d15, [sp, #-16]!
26
27        // your matrix kernel goes here!
28
29        // restore
30        ldp d14, d15, [sp], #16
31        ldp d12, d13, [sp], #16
32        ldp d10, d11, [sp], #16
33        ldp  d8,  d9, [sp], #16
34
35        ldp x29, x30, [sp], #16
36        ldp x27, x28, [sp], #16
37        ldp x25, x26, [sp], #16
38        ldp x23, x24, [sp], #16
39        ldp x21, x22, [sp], #16
40        ldp x19, x20, [sp], #16
41
42        ret
43        .size gemm_asm_asimd_16_6_1, (. - gemm_asm_asimd_16_6_1)

Since we are hunting for performance, we’ll do a little competition. Prize? The three best-performing teams will receive an honorable mention on the Performance Board of this homepage 😉.

Rules

  • Respect the procedure call standard. But: if you don’t modify a register, you don’t have to save it to the stack.

  • Verify your kernels.

  • Instrument your kernels for at least 1 second in your performance measurements through repeated executions.

For now, the current best performing teams are Alex’s ASM and LIBXSMM:

Table 6.2.1 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.

Team

Time (s)

#executions

GFLOPS

%peak

Alex’s ASM

1.287

100000000

14.92

23.3

LIBXSMM, 59410c81 (ASIMD)

1.811

100000000

10.60

16.6

Table 6.2.2 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.

Team

Time (s)

#executions

GFLOPS

%peak

Alex’s ASM

18.08

100000000

50.97

79.6

LIBXSMM, 59410c81 (ASIMD)

18.74

100000000

49.17

76.8

Tasks

  1. Implement and verify the unrolled matrix kernel C += AB for M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.

  2. Tune your kernel to squeeze more performance out of a core. You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules above. Report and document your optimizations.

  3. Add a loop over K to realize C += AB for M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.

  4. Come up with a creative team name and submit it together with your entries for “Time (s)”, “#executions”, “GFLOPS” and “%peak” in Table 6.2.1 and Table 6.2.2. Assume a theoretical single-core peak of 64 GFLOPS for the used c7g.xlarge instance.