4. Small GEMMs: Getting Started

../_images/gemm.svg — Fig. 4.1 Illustration of the matrices’ sizes given through $M, N, K$ and their leading dimensions $l d a, l d b, l d c$ in the operation $C + = A B$ 

Small matrix-matrix multiplications build the backbone of many high-performance applications. A typical upper-bound for the dimensions $M, N, K$ of a small matrix-matrix multiplication might be:

\sqrt[3]{M \cdot N \cdot K} \leq 64.

As part of this class, we’ll discuss and implement matrix-matrix multiplication kernels. An often used kernel performs the operation $C + = A B$ with $C \in R^{M \times N}$ , $A \in R^{M \times K}$ and $B \in R^{K \times N}$ .

../_images/matrix_column_major.svg — Fig. 4.2 Illustration of column-major storage for a matrix. The red line shows the linear storage in memory.

Many different storage formats exist for matrices. Use column-major storage in all your implementations!

4.1. Matrix Kernels: Reference

Before looking into the performance of matrix kernels, we’ll write a reference implementation. This implementation is our baseline and its results will be the gold standard for our verification efforts.

Tasks

Write a driver in the file driver.cpp which acts as point of entry in your future work.

Implement a C/C++ reference kernel for the operation $C + = A B$ . Implement the respective function in the file gemm_ref.cpp and use the following signature:

void gemm_ref( float        const * i_a,
               float        const * i_b,
               float              * io_c,
               int64_t              i_m,
               int64_t              i_n,
               int64_t              i_k,
               int64_t              i_lda,
               int64_t              i_ldb,
               int64_t              i_ldc );

Implement unit tests for gemm_ref. These should exemplary cover different sizes and leading dimensions, about five variants are sufficient. Consider using Catch2 for your tests.

4.2. Matrix Kernels: Measuring Performance

Achieving high performance is the #1 goal of this class! The performance of matrix kernels is typically given in terms of Giga Floating Point Operations Per Second (GFLOPS). Thus, we have to derive two numbers for our kernels:

How many floating point operations are done per kernel call?
How many seconds are needed per kernel call?

Tasks

Given the matrix dimensions $M$ , $N$ , $K$ , how many floating point operations are required to perform the (dense) matrix-matrix multiplication?
Add a simple loop in your driver which repeatedly calls your reference kernel implemented in gemm_ref. The kernel should be executed often enough such that the loop takes about one second to complete.
Use std::chrono outside of your loop doing repetitions to measure the average time per kernel call:
- Use std::chrono::steady_clock::now() to get the current time before and after the loop.
- Use std::chrono::duration_cast< std::chrono::duration< double> >() to get the duration between your two measurements.
Print the average duration and the sustained GFLOPS of the repeated kernel calls.
Try optimization flags, e.g., -O2 or -O3, and report performance numbers for the following matrix-sizes and leading dimensions:

$M = N = K = l d A = l d B = l d C = λ,$

with

$λ \in {4, 8, 12, 16, 24, 32, 48, 64} .$

4.3. Matrix Kernels: Helping the Compiler

When writing C/C++ code, we are transferring a high-level construct, e.g., $C + = A B$ , into a couple of source code lines. The code is then translated into (possibly) optimized machine code by a compiler. Often times our high-level knowledge gets lost by writing the source code. This forces the compiler to operate in a tight regime of possible valid optimization and is often the reason for low performance. In this class we’ll explore two pathways to improve the situation:

Rewrite and/or annotate the code such that we obtain better machine code.
Share the high-level description of the kernel with a library which exploits its structure to generate better machine code.

We’ll get started by writing our source code in a (slightly) different way and, for now, by observing some “magic” when looking at the performance. Don’t worry about the details, we’ll get to the bottom of this later in the class.

Tasks

Implement a C/C++ matrix-matrix multiplication kernel through three nested loops over $M$ , $N$ and $K$ :
- The loop over $M$ should be the outer loop.
- The loop over $N$ should be the in the middle.
- The loop over $K$ should be the inner loop.
The kernel should assume fixed matrix sizes and leading dimensions:

$M = N = K = 32$

$l d A = l d B = l d C = 32.$

Implement the kernel in the file gemm_compiler_32_32_32_32_32_32.cpp and use the following signature:
```
void gemm_compiler_32_32_32_32_32_32_mnk( float const * i_a,
                                          float const * i_b,
                                          float       * io_c );
```
Implement a C/C++ matrix-matrix multiplication kernel through three nested loops over $M$ , $N$ and $K$ . Make the same assumptions as before, but use a different loop ordering:
- The loop over $N$ should be the outer loop.
- The loop over $K$ should be the in the middle.
- The loop over $M$ should be the inner loop.
As before, implement the kernel in the file gemm_compiler_32_32_32_32_32_32.cpp but use the following signature:
void gemm_compiler_32_32_32_32_32_32_nkm( float const * i_a, float const * i_b, float * io_c );
Make sure that the two kernels produce the same results (up to a small epsilon) as your reference kernel gemm_ref.
Try optimization flags, e.g., -O2 or -O3, and report performance numbers for gemm_compiler_32_32_32_32_32_32_mnk and gemm_compiler_32_32_32_32_32_32_nkm. What other ideas do you have to help the compiler?

4.4. Code Generation

As discussed in the previous section, another way of getting higher performance is to share a high-level description of our kernels with a library. The library then combines this knowledge on the structure of our kernels with knowledge on the targeted microarchitecture to accelerate the kernel execution. One such library is LIBXSMM, which is capable of accelerating small matrix-matrix multiplications through the generation of machine code. LIBXSMM supports a large variety of computer architectures including ASIMD and SVE instructions in the context of AArch64. In this exercise, we’ll be users of LIBXSMM and benchmark the performance of the library. Later, we’ll shift gears and move on to the backend. This will allow us to harness our own library for Just-In-Time (JIT) code generation.

Tasks

Make yourself familiar with LIBXSMM. Read the piece “LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel” in The Parallel Universe, Issue 34. Install the library from source on the Graviton3 instance.
Navigate to LIBXSMM’s samples/hello-directory and explain the call to libxsmm_dmmdispatch in the C implementation hello.c. Build and run the “Hello LIBXSMM” example.
Add support for LIBXSMM kernels to your matrix-multiplication driver. Verify the results of the library! Benchmark the performance of the library! This should include at least single-precision tests for $λ \in {4, 8, 12, 16, 24, 32, 48, 64}$ in the formula above.