4. Small GEMMs: Getting Started
Small matrix-matrix multiplications build the backbone of many high-performance applications. A typical upper-bound for the dimensions \(M, N, K\) of a small matrix-matrix multiplication might be:
As part of this class, we’ll discuss and implement matrix-matrix multiplication kernels. An often used kernel performs the operation \(C \mathrel{+}= A B\) with \(C \in \mathbb{R}^{M \times N}\), \(A \in \mathbb{R}^{M \times K}\) and \(B \in \mathbb{R}^{K \times N}\).
Many different storage formats exist for matrices. Use column-major storage in all your implementations!
4.1. Matrix Kernels: Reference
Before looking into the performance of matrix kernels, we’ll write a reference implementation. This implementation is our baseline and its results will be the gold standard for our verification efforts.
Tasks
Write a driver in the file
driver.cpp
which acts as point of entry in your future work.Implement a C/C++ reference kernel for the operation \(C \mathrel{+}= A B\). Implement the respective function in the file
gemm_ref.cpp
and use the following signature:void gemm_ref( float const * i_a, float const * i_b, float * io_c, int64_t i_m, int64_t i_n, int64_t i_k, int64_t i_lda, int64_t i_ldb, int64_t i_ldc );
Implement unit tests for
gemm_ref
. These should exemplary cover different sizes and leading dimensions, about five variants are sufficient. Consider using Catch2 for your tests.
4.2. Matrix Kernels: Measuring Performance
Achieving high performance is the #1 goal of this class! The performance of matrix kernels is typically given in terms of Giga Floating Point Operations Per Second (GFLOPS). Thus, we have to derive two numbers for our kernels:
How many floating point operations are done per kernel call?
How many seconds are needed per kernel call?
Tasks
Given the matrix dimensions \(M\), \(N\), \(K\), how many floating point operations are required to perform the (dense) matrix-matrix multiplication?
Add a simple loop in your driver which repeatedly calls your reference kernel implemented in
gemm_ref
. The kernel should be executed often enough such that the loop takes about one second to complete.Use
std::chrono
outside of your loop doing repetitions to measure the average time per kernel call:Use
std::chrono::steady_clock::now()
to get the current time before and after the loop.Use
std::chrono::duration_cast< std::chrono::duration< double> >()
to get the duration between your two measurements.
Print the average duration and the sustained GFLOPS of the repeated kernel calls.
Try optimization flags, e.g.,
-O2
or-O3
, and report performance numbers for the following matrix-sizes and leading dimensions:\[M = N = K = ldA = ldB = ldC = \lambda,\]with
\[\lambda \in \{ 4, 8, 12, 16, 24, 32, 48, 64 \}.\]
4.3. Matrix Kernels: Helping the Compiler
When writing C/C++ code, we are transferring a high-level construct, e.g., \(C \mathrel{+}= A B\), into a couple of source code lines. The code is then translated into (possibly) optimized machine code by a compiler. Often times our high-level knowledge gets lost by writing the source code. This forces the compiler to operate in a tight regime of possible valid optimization and is often the reason for low performance. In this class we’ll explore two pathways to improve the situation:
Rewrite and/or annotate the code such that we obtain better machine code.
Share the high-level description of the kernel with a library which exploits its structure to generate better machine code.
We’ll get started by writing our source code in a (slightly) different way and, for now, by observing some “magic” when looking at the performance. Don’t worry about the details, we’ll get to the bottom of this later in the class.
Tasks
Implement a C/C++ matrix-matrix multiplication kernel through three nested loops over \(M\), \(N\) and \(K\):
The loop over \(M\) should be the outer loop.
The loop over \(N\) should be the in the middle.
The loop over \(K\) should be the inner loop.
The kernel should assume fixed matrix sizes and leading dimensions:
\[M = N = K = 32\]\[ldA = ldB = ldC = 32.\]Implement the kernel in the file
gemm_compiler_32_32_32_32_32_32.cpp
and use the following signature:void gemm_compiler_32_32_32_32_32_32_mnk( float const * i_a, float const * i_b, float * io_c );
Implement a C/C++ matrix-matrix multiplication kernel through three nested loops over \(M\), \(N\) and \(K\). Make the same assumptions as before, but use a different loop ordering:
The loop over \(N\) should be the outer loop.
The loop over \(K\) should be the in the middle.
The loop over \(M\) should be the inner loop.
As before, implement the kernel in the file
gemm_compiler_32_32_32_32_32_32.cpp
but use the following signature:void gemm_compiler_32_32_32_32_32_32_nkm( float const * i_a, float const * i_b, float * io_c );
Make sure that the two kernels produce the same results (up to a small epsilon) as your reference kernel
gemm_ref
.Try optimization flags, e.g.,
-O2
or-O3
, and report performance numbers forgemm_compiler_32_32_32_32_32_32_mnk
andgemm_compiler_32_32_32_32_32_32_nkm
. What other ideas do you have to help the compiler?
4.4. Code Generation
As discussed in the previous section, another way of getting higher performance is to share a high-level description of our kernels with a library. The library then combines this knowledge on the structure of our kernels with knowledge on the targeted microarchitecture to accelerate the kernel execution. One such library is LIBXSMM, which is capable of accelerating small matrix-matrix multiplications through the generation of machine code. LIBXSMM supports a large variety of computer architectures including ASIMD and SVE instructions in the context of AArch64. In this exercise, we’ll be users of LIBXSMM and benchmark the performance of the library. Later, we’ll shift gears and move on to the backend. This will allow us to harness our own library for Just-In-Time (JIT) code generation.
Tasks
Make yourself familiar with LIBXSMM. Read the piece “LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel” in The Parallel Universe, Issue 34. Install the library from source on the Graviton3 instance.
Navigate to LIBXSMM’s
samples/hello
-directory and explain the call to libxsmm_dmmdispatch in the C implementationhello.c
. Build and run the “Hello LIBXSMM” example.Add support for LIBXSMM kernels to your matrix-multiplication driver. Verify the results of the library! Benchmark the performance of the library! This should include at least single-precision tests for \(\lambda \in \{ 4, 8, 12, 16, 24, 32, 48, 64 \}\) in the formula above.