11. Data Locality
Data locality is an important concept in computing. Within a compute node access to main memory can slow down the execution speed of programs drastically. Keeping the required data in fast cache layers by utilizing temporal and spatial data locality is advisable.
11.1. Matrix-Matrix Multiplication
This chapter explores and benchmarks the effects of data locality with respect to matrix-matrix multiplication workloads. A simple implementation has three nested loops over the matrix sizes M, N and K. The order of the loops can be interchanged which results in different access patterns to the underlying memory. This influences the utilization of caches through data locality properties and can result in vastly different runtimes of the implementations.
void matmul_mnk(double* i_A,
double* i_B,
double* io_C,
std::size_t i_m,
std::size_t i_n,
std::size_t i_k ){
for( std::size_t l_m = 0; l_m < i_m; l_m++ )
for( std::size_t l_n = 0; l_n < i_n; l_n++ )
for( std::size_t l_k = 0; l_k < i_k; l_k++ )
io_C[l_m*i_n + l_n] += i_A[l_m*i_k + l_k] * i_B[l_k*i_n + l_n];
}
Task
Part 1: Benchmarking matrix-matrix multiplication performance.
Develop a C/C++ program that measures the performance of matrix-matrix-multiplications.
Execute your programm on a full node of the Ara-Cluster.
Your sbatch job should compute the performance for increasing matrix sizes M = N = K from 2 to 1024.
Your sbatch job should measure the performance of every possible permutation of the three loops.
Store the results in a CSV file named
performance.csv
withImplementation
,M=N=K
andGFLOPS
as column names.Use several iterations for your measurement to obtain reliable results.
Use the compile flags
-march=native
and-O2
.
Part 2: Visualizing and evaluating the results.
Develop a script to read the CSV file, plot your measurements and save them as an png image.
Visualize the mkn access pattern. How does it perform in your measurements for large matrices? Explain briefly.
11.2. Blocking
Loop blocking is a technique to increase data locality. The general idea of blocking is to chunk data into blocks which fit into different CPU cache levels. Once loaded as much operations as possible are performed on the block. Then the block is written back to memory and the next one gets loaded. This chapter explores the effectiveness of blocking for matrix-vector multiplications.
void impl_blocked_k(double* i_A,
double* i_b,
double* io_c,
std::size_t i_k,
std::size_t i_m,
std::size_t i_blockSize){
std::size_t l_nBlocks = i_k / i_blockSize;
for( std::size_t l_bk = 0; l_bk < l_nBlocks; l_bk++ )
for( std::size_t l_m = 0; l_m < i_m; l_m++ )
for( std::size_t l_k = l_bk*i_blockSize; l_k < (l_bk+1)*i_blockSize; l_k++)
io_c[l_m] += i_A[l_m*i_k + l_k] * i_b[l_k];
}
Task
Part 1: Benchmarking matrix-vector multiplication performance.
Develop a C/C++ program that measures the performance of matrix-vector multiplications with and without blocking.
Execute your programm on a full node of the Ara-Cluster.
Your sbatch should compute the performance for increasing matrix sizes M = K from 32 to 8192.
Benchmark the performance for blocking in k and m dimension, with a block size of 4.
Store your results in a CSV file.
Use the compile flags
-march=native
and-O2
.
Part 2: Visualizing and evaluating the results.
Plot your measurements and save them as a png image.
Explain your results. What is the influence of the different cache levels?
11.3. BLAS
This chapter benchmarks the performance of the optimized BLAS (Basic Linear Algebra Subprograms) library OpenBLAS. Level 1 BLAS perform scalar, vector and vector-vector operations, Level 2 BLAS perform matrix-vector operations, and Level 3 BLAS perform matrix-matrix operations. In this task we benchmark the matrix-matrix multiplication, which is a Level 3 BLAS operation. The matrix-matrix multiplications are accessed with the following interface in OpenBLAS:
xGEMM (ORDER, TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
and calculates:
ORDER
defines how the matrices are saved. It can be set toCblasColMajor
orCblasRowMajor
TRANSA
andTRANSB
allow to transpose the input matrices A and B. In OpenBLAS the constantCblasNoTrans
prevents a transposition.M
,N
andK
are integer values, which define the size of the matrices.A
,B
andC
are pointers to the respective matrices.ALPHA
andBETA
are scalar values. For matrix-matrix multiplications similar to the previous implementations they should be set to 1.LDA
,LDB
andLDC
are the strides of the respective matrices. In case of compact storage these variables should be set to the size of the leading dimension of the matrices.
Optional Task
Benchmark the performance of the matrix-matrix-multiplication from OpenBLAS.
Use the OpenBLAS method
cblas_dgemm
to perform a double precision matrix-matrix-multiplication.Benchmark the performance for square matrices of size M = N = K from 2 to 1024.
Compare the performance of OpenBLAS with your own implementations.
Hint
To install OpenBlas you can execute the following commands:
wget https://github.com/OpenMathLib/OpenBLAS/archive/refs/tags/v0.3.26.tar.gz
tar -xvf v0.3.26.tar.gz
cd OpenBLAS-0.3.26
make
make PREFIX=$(pwd)/../openblas_x86 install
cd ..
You can link with OpenBlas by using the flags -I path/to/openblas_x86/include
, -L path/to/openblas_x86/lib
and -lopenblas
.