11. Data Locality

Data locality is an important concept in computing. Within a compute node access to main memory can slow down the execution speed of programs drastically. Keeping the required data in fast cache layers by utilizing temporal and spatial data locality is advisable.

11.1. Matrix-Matrix Multiplication

This chapter explores and benchmarks the effects of data locality with respect to matrix-matrix multiplication workloads. A simple implementation has three nested loops over the matrix sizes M, N and K. The order of the loops can be interchanged which results in different access patterns to the underlying memory. This influences the utilization of caches through data locality properties and can result in vastly different runtimes of the implementations.

Listing 11.1.1 Matrix-matrix multiplication with nested loops over M, N and K
void matmul_mnk(double* i_A,
                double* i_B,
                double* io_C,
                std::size_t i_m,
                std::size_t i_n,
                std::size_t i_k ){

  for( std::size_t l_m = 0; l_m < i_m; l_m++ )
    for( std::size_t l_n = 0; l_n < i_n; l_n++ )
      for( std::size_t l_k = 0; l_k < i_k; l_k++ )
        io_C[l_m*i_n + l_n] += i_A[l_m*i_k + l_k] * i_B[l_k*i_n + l_n];
}

Task

Part 1: Benchmarking matrix-matrix multiplication performance.

  1. Develop a C/C++ program that measures the performance of matrix-matrix-multiplications.

  2. Execute your programm on a full node of the Ara-Cluster.

  • Your sbatch job should compute the performance for increasing matrix sizes M = N = K from 2 to 1024.

  • Your sbatch job should measure the performance of every possible permutation of the three loops.

  • Store the results in a CSV file named performance.csv with Implementation, M=N=K and GFLOPS as column names.

  • Use several iterations for your measurement to obtain reliable results.

  • Use the compile flags -march=native and -O2.

Part 2: Visualizing and evaluating the results.

  1. Develop a script to read the CSV file, plot your measurements and save them as an png image.

  2. Visualize the mkn access pattern. How does it perform in your measurements for large matrices? Explain briefly.

11.2. Blocking

Loop blocking is a technique to increase data locality. The general idea of blocking is to chunk data into blocks which fit into different CPU cache levels. Once loaded as much operations as possible are performed on the block. Then the block is written back to memory and the next one gets loaded. This chapter explores the effectiveness of blocking for matrix-vector multiplications.

Listing 11.2.1 Blocked matrix-vector multiplication
void impl_blocked_k(double* i_A,
                    double* i_b,
                    double* io_c,
                    std::size_t i_k,
                    std::size_t i_m,
                    std::size_t i_blockSize){

  std::size_t l_nBlocks = i_k / i_blockSize;
  for( std::size_t l_bk = 0; l_bk < l_nBlocks; l_bk++ )
    for( std::size_t l_m = 0; l_m < i_m; l_m++ )
      for( std::size_t l_k = l_bk*i_blockSize; l_k < (l_bk+1)*i_blockSize; l_k++)
        io_c[l_m] += i_A[l_m*i_k + l_k] * i_b[l_k];
}

Task

Part 1: Benchmarking matrix-vector multiplication performance.

  1. Develop a C/C++ program that measures the performance of matrix-vector multiplications with and without blocking.

  2. Execute your programm on a full node of the Ara-Cluster.

  • Your sbatch should compute the performance for increasing matrix sizes M = K from 32 to 8192.

  • Benchmark the performance for blocking in k and m dimension, with a block size of 4.

  • Store your results in a CSV file.

  • Use the compile flags -march=native and -O2.

Part 2: Visualizing and evaluating the results.

  1. Plot your measurements and save them as a png image.

  2. Explain your results. What is the influence of the different cache levels?

11.3. BLAS

This chapter benchmarks the performance of the optimized BLAS (Basic Linear Algebra Subprograms) library OpenBLAS. Level 1 BLAS perform scalar, vector and vector-vector operations, Level 2 BLAS perform matrix-vector operations, and Level 3 BLAS perform matrix-matrix operations. In this task we benchmark the matrix-matrix multiplication, which is a Level 3 BLAS operation. The matrix-matrix multiplications are accessed with the following interface in OpenBLAS:

xGEMM (ORDER, TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

and calculates:

\[C \leftarrow \alpha A B + \beta C\]
  • ORDER defines how the matrices are saved. It can be set to CblasColMajor or CblasRowMajor

  • TRANSA and TRANSB allow to transpose the input matrices A and B. In OpenBLAS the constant CblasNoTrans prevents a transposition.

  • M, N and K are integer values, which define the size of the matrices.

  • A, B and C are pointers to the respective matrices.

  • ALPHA and BETA are scalar values. For matrix-matrix multiplications similar to the previous implementations they should be set to 1.

  • LDA, LDB and LDC are the strides of the respective matrices. In case of compact storage these variables should be set to the size of the leading dimension of the matrices.

Optional Task

Benchmark the performance of the matrix-matrix-multiplication from OpenBLAS.

  • Use the OpenBLAS method cblas_dgemm to perform a double precision matrix-matrix-multiplication.

  • Benchmark the performance for square matrices of size M = N = K from 2 to 1024.

  • Compare the performance of OpenBLAS with your own implementations.

Hint

To install OpenBlas you can execute the following commands:

wget https://github.com/OpenMathLib/OpenBLAS/archive/refs/tags/v0.3.26.tar.gz
tar -xvf v0.3.26.tar.gz
cd OpenBLAS-0.3.26
make
make PREFIX=$(pwd)/../openblas_x86 install
cd ..

You can link with OpenBlas by using the flags -I path/to/openblas_x86/include, -L path/to/openblas_x86/lib and -lopenblas.