6. Assembly Building Blocks

Several code constructs appear frequently in source code. Examples are conditional code execution based on if-then-else statements or loops. In Section 6.1 we’ll implement some of these building blocks in assembly code. This not only extends our toolbox but also helps to understand the impact of their high-level equivalents in hardware.

Section 6.2 interprets “code constructs” by means of BLAS3. Here, we’ll develop a microkernel for small GEMMS using the Advanced SIMD (ASIMD) extension. This microkernel may function as building block for more general GEMMs harnessing ASIMD. However, in this class, we’ll only generalize the kernel in the contraction dimension \(K\). We’ll move to the Scalable Vector Extension (SVE) for a more general GEMM implementation in Section 7. This allows us to get a solid understanding of the differences between ASIMD and SVE.

6.1. Conditions and Loops

First, we’ll have a look at some small C/C++ functions and formulate them in assembly language. For this, three source files and respective output are given:

Listing 6.1.1 File high_level.h which defines the C/C++ functions’ signatures.

#include <cstdint>

int32_t high_lvl_0( int32_t i_value );

uint64_t high_lvl_1( uint64_t );

int32_t high_lvl_2( int32_t i_option );

void high_lvl_3( int32_t * i_option,
                 int32_t * o_result );

uint32_t high_lvl_4( uint32_t i_x,
                     uint32_t i_y,
                     uint32_t i_z );

void high_lvl_5( uint32_t   i_nIters,
                 int32_t  * io_value );

void high_lvl_6( uint64_t   i_nIters,
                 int64_t    i_inc,
                 int64_t  * io_value );

void high_lvl_7( uint64_t   i_nValues,
                 int64_t  * i_valuesIn,
                 int64_t  * i_valuesOut );

Listing 6.1.2 File high_level.cpp which implements the C/C++ functions.

#include "high_level.h"

int32_t high_lvl_0( int32_t i_value ) {
  return i_value;
}

uint64_t high_lvl_1( uint64_t ) {
  return 0;
}

int32_t high_lvl_2( int32_t i_option ) {
  int32_t l_result = 0;

  if( i_option < 32 ) {
    l_result = 1;
  }

  return l_result;
}

void high_lvl_3( int32_t * i_option,
                 int32_t * o_result ) {
  if( *i_option < 25 ) {
    *o_result = 1;
  }
  else {
    *o_result = 0;
  }
}

uint32_t high_lvl_4( uint32_t i_x,
                     uint32_t i_y,
                     uint32_t i_z ) {
  uint32_t l_ret = 0;

  if( i_x < i_y && i_x < i_z ) {
    l_ret = 1;
  }
  else if( i_y < i_z ) {
    l_ret = 2;
  }
  else {
    l_ret = 3;
  }

  return l_ret;
}

void high_lvl_5( uint32_t   i_nIters,
                 int32_t  * io_value ) {
  for( uint32_t l_i = 0; l_i < i_nIters; l_i++ ) {
    *io_value += 1;
  }
}

void high_lvl_6( uint64_t   i_nIters,
                 int64_t    i_inc,
                 int64_t  * io_value ) {
  uint64_t l_va = i_nIters;
  do {
    *io_value += i_inc;
    l_va--;
  } while( l_va != 0 );
}

void high_lvl_7( uint64_t   i_nValues,
                 int64_t  * i_valuesIn,
                 int64_t  * i_valuesOut ) {
  for( uint64_t l_va = 0; l_va < i_nValues; l_va++ ) {
    i_valuesOut[l_va] = i_valuesIn[l_va];
  }
}

Listing 6.1.3 File driver.cpp which calls the given C/C++ functions.

#include <iostream>
#include "high_level.h"

int main() {
  std::cout << "running driver" << std::endl;

  std::cout << "high_lvl_0(10): "
            << high_lvl_0( 10 )
            << std::endl;
  std::cout << "high_lvl_1(10): "
            << high_lvl_1( 10 ) << std::endl;
  std::cout << "high_lvl_2(32): "
            << high_lvl_2( 32 ) << std::endl;
  std::cout << "high_lvl_2( 5): "
            << high_lvl_2(  5 ) << std::endl;

  int32_t l_highLvlOpt3 = 17;
  int32_t l_highLvlRes3 = -1;
  high_lvl_3( &l_highLvlOpt3,
              &l_highLvlRes3 );
  std::cout << "high_lvl_3 #1: "
            << l_highLvlRes3 << std::endl;

  l_highLvlOpt3 = 43;
  high_lvl_3( &l_highLvlOpt3,
              &l_highLvlRes3 );
  std::cout << "high_lvl_3 #2: "
            << l_highLvlRes3 << std::endl;
  std::cout << "high_lvl_4(1,2,3): "
            << high_lvl_4( 1, 2, 3 ) << std::endl;
  std::cout << "high_lvl_4(4,2,3): "
            << high_lvl_4( 4, 2, 3 ) << std::endl;
  std::cout << "high_lvl_4(4,3,3): "
            << high_lvl_4( 4, 3, 3 ) << std::endl;

  int32_t l_highLvlValue5 = 500;
  high_lvl_5(  17,
              &l_highLvlValue5 );
  std::cout << "high_lvl_5: " << l_highLvlValue5 << std::endl;

  int64_t l_highLvlValue6 = 23;
  high_lvl_6( 5,
              13,
              &l_highLvlValue6 );
  std::cout << "high_lvl_6: "
            << l_highLvlValue6 << std::endl;

  int64_t l_highLvlVasIn7[10] = { 0, 7, 7, 4, 3,\
                                 -10, -50, 40, 2, 3 };
  int64_t l_highLvlVasOut7[10] = { 0 };
  high_lvl_7( 10,
              l_highLvlVasIn7,
              l_highLvlVasOut7 );

  std::cout << "high_lvl_7: "
            << l_highLvlVasOut7[0] << " / "
            << l_highLvlVasOut7[1] << " / "
            << l_highLvlVasOut7[2] << " / "
            << l_highLvlVasOut7[3] << " / "
            << l_highLvlVasOut7[4] << " / "
            << l_highLvlVasOut7[5] << " / "
            << l_highLvlVasOut7[6] << " / "
            << l_highLvlVasOut7[7] << " / "
            << l_highLvlVasOut7[8] << " / "
            << l_highLvlVasOut7[9] << std::endl;

  // low-level part goes here

  std::cout << "finished, exiting" << std::endl;
  return EXIT_SUCCESS;
}

Listing 6.1.4 Output when running the high-level implementation.

running driver
high_lvl_0(10): 10
high_lvl_1(10): 0
high_lvl_2(32): 0
high_lvl_2( 5): 1
high_lvl_3 #1: 1
high_lvl_3 #2: 0
high_lvl_4(1,2,3): 1
high_lvl_4(4,2,3): 2
high_lvl_4(4,3,3): 3
high_lvl_5: 517
high_lvl_6: 88
high_lvl_7: 0 / 7 / 7 / 4 / 3 / -10 / -50 / 40 / 2 / 3
finished, exiting

Tasks

Explain in 1-2 sentences what every of the eight functions does.
Implement the functions in assembly language. Use the file names low_level.h and low_level.cpp and matching names for the functions, i.e., low_lvl_0, low_lvl_1, …, low_lvl_7.
Verify your low-level versions by extending the driver.

6.2. Small GEMM: ASIMD

In this part of the lab we’ll write our first high-performance matrix kernel relying on floating point math. Our targeted kernel gemm_asm_asimd_16_6_1 has the following signature:

void gemm_asm_asimd_16_6_1( float const * i_a,
                            float const * i_b,
                            float       * io_c );

and performs the operation

\[C \mathrel{+}= A B\]

on 32-bit floating point-data with

\[M = 16, \; N = 6, \; K = 1, \quad ldA = 16, \; ldB = 1, \; ldC = 16.\]

In our implementation we’ll completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops. Once done, we’ll add a \(K\) loop to increase the size of the kernel’s contraction dimension.

As for the general purpose registers, we have to preserve some SIMD registers to adhere to the procedure call standard AAPCS64. The template in Listing 6.2.1 extends the one we had before when working solely on general purpose registers. Now, we also write the lowest 64 bits of registers V8-V15 to the stack and restore them at the end of the function.

Listing 6.2.1 Template for the \((16\times6)=(16\times1)(1\times6)\) ASIMD matrix kernel. The template temporarily saves the general purpose registers X19-X30 and the lowest 64 bits of the SIMD registers V8-V15 on the stack.

        .text
        .type gemm_asm_asimd_16_6_1, %function
        .global gemm_asm_asimd_16_6_1
        /*
         * Performs the matrix-multiplication C+=A*B
         * with the shapes (16x6) = (16x1) * (1x6).
         * The input-data is of type float.
         *
         * @param x0 pointer to A.
         * @param x1 pointer to B.
         * @param x2 pointer to C.
         */ 
gemm_asm_asimd_16_6_1:
        // store
        stp x19, x20, [sp, #-16]!
        stp x21, x22, [sp, #-16]!
        stp x23, x24, [sp, #-16]!
        stp x25, x26, [sp, #-16]!
        stp x27, x28, [sp, #-16]!
        stp x29, x30, [sp, #-16]!

        stp  d8,  d9, [sp, #-16]!
        stp d10, d11, [sp, #-16]!
        stp d12, d13, [sp, #-16]!
        stp d14, d15, [sp, #-16]!

        // your matrix kernel goes here!

        // restore
        ldp d14, d15, [sp], #16
        ldp d12, d13, [sp], #16
        ldp d10, d11, [sp], #16
        ldp  d8,  d9, [sp], #16

        ldp x29, x30, [sp], #16
        ldp x27, x28, [sp], #16
        ldp x25, x26, [sp], #16
        ldp x23, x24, [sp], #16
        ldp x21, x22, [sp], #16
        ldp x19, x20, [sp], #16

        ret
        .size gemm_asm_asimd_16_6_1, (. - gemm_asm_asimd_16_6_1)

Since we are hunting for performance, we’ll do a little competition. Prize? The three best-performing teams will receive an honorable mention on the Performance Board of this homepage 😉.

Rules

Respect the procedure call standard. But: if you don’t modify a register, you don’t have to save it to the stack.
Verify your kernels.
Instrument your kernels for at least 1 second in your performance measurements through repeated executions.

For now, the current best performing teams are Alex’s ASM and LIBXSMM:

Table 6.2.1 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+=AB with M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.
Team	Time (s)	#executions	GFLOPS	%peak
Alex’s ASM	1.287	100000000	14.92	23.3
LIBXSMM, 59410c81 (ASIMD)	1.811	100000000	10.60	16.6

Table 6.2.2 Sustained performance on the Graviton3 processor for the single precision matrix kernel C+= AB with M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.
Team	Time (s)	#executions	GFLOPS	%peak
Alex’s ASM	18.08	100000000	50.97	79.6
LIBXSMM, 59410c81 (ASIMD)	18.74	100000000	49.17	76.8

Tasks

Implement and verify the unrolled matrix kernel C += AB for M=16, N=6, K=1, ldA=16, ldB=1, ldC=16.
Tune your kernel to squeeze more performance out of a core. You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules above. Report and document your optimizations.
Add a loop over K to realize C += AB for M=16, N=6, K=48, ldA=16, ldB=48, ldC=16.
Come up with a creative team name and submit it together with your entries for “Time (s)”, “#executions”, “GFLOPS” and “%peak” in Table 6.2.1 and Table 6.2.2. Assume a theoretical single-core peak of 64 GFLOPS for the used c7g.xlarge instance.