4. Benchmarking
Benchmarking is a crucial practice for assessing and improving the performance of various systems and processes. There are different types of benchmarking, each tailored to specific objectives. First you have to clearly define what you want to benchmark, whether it’s a specific algorithm, a library, or a piece of code. Then you should decide which metrics to measure, such as execution time, memory usage, or CPU utilization. This process helps you identify bottlenecks and optimize your code for better performance.
4.1. Memory Bandwidth
STREAM Benchmark
Benchmarking using the STREAM Benchmark is a common way to measure the main memory bandwidth of a system. The triad operation involves performing a simple mathematical operation on three arrays. In C++, you can use this benchmark to assess the memory performance of your system, especially in applications where memory access speed is critical, such as scientific computing or data-intensive tasks.
The triad operation is defined as C = A + s * B
, where A
, B
, and C
are arrays, and s
is a scalar. It is often used to measure the memory bandwidth of a system because it involves reading data from two arrays (A and B), performing a scalar operation, and writing the result to another array (C). In this example, we use C++ and the Standard Template Library (STL) to create a simple triad benchmark. It also uses the <chrono>
library for time measurement.
#include <iostream>
#include <chrono>
#include <cstdlib>
int main() {
const int l_s = 1000000;
double l_scalar = 2.0;
double* l_A = new double[l_s];
double* l_B = new double[l_s];
double* l_C = new double[l_s];
for (int i = 0; i < l_s; ++i) {
l_A[i] = static_cast<double>(std::rand()) / RAND_MAX;
l_B[i] = static_cast<double>(std::rand()) / RAND_MAX;
l_C[i] = 0.0;
}
auto l_start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < l_s; ++i) {
l_C[i] = l_A[i] + l_scalar * l_B[i];
}
auto l_end_time = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> l_duration = l_end_time - l_start_time;
double l_data_access_speed = 3.0 * l_s * sizeof(double) / l_duration.count() / (1024 * 1024 * 1024);
std::cout << "STREAM Benchmark: " << l_data_access_speed << " GB/s" << std::endl;
delete[] l_A;
delete[] l_B;
delete[] l_C;
return 0;
}
Task
Part 1: Implementing the Memory Benchmark
Read about various memory bandwidth benchmarking tools and methods and give a short report on it.
Develop a C/C++ program that measures memory bandwidth using one of the benchmarks, for example STREAM Triad.
The program should compute the memory bandwidth by iterating over different array sizes (at least 30 varied sizes, considering L1, L2 and L3 cache levels).
Store the results in a CSV file named “memory_bandwidth.csv” with “Array Size (bytes)” and “Bandwidth (GB/s)” headers. (If you want to use GB/cycle don’t forget to report the frequency).
Consider at least 10000 iterations.
Part 2: Visualizing the Results
Develop a script (e.g., in Python) to read the CSV file and create a plot of the memory bandwidth data.
Find out the role of different cache levels in your results.
Save the plot as an image file (e.g., “memory_bandwidth_plot.png”).
Part 3: Slurm Script
Write a Slurm script to run C++ and python program on a full node.
Choose your desired compiler and flags and report it.
Load necessary modules and use required sbatch options.
Find the maximum memory bandwidth on your assigned node, share your results, and explain the changes there.
Optional: Multi-threaded
If you have experience with OpenMP, you may consider comparing your results obtained with a single thread to those achieved with multiple threads as an optional task.
4.2. likwid
Likwid is a user-friendly benchmarking tool and framework for creating multi-threaded assembly kernels. It is a user-friendly tool for performance-oriented developers. It supports Intel, AMD, ARMv8, and POWER9 processors on Linux, with additional Nvidia GPU support.
It includes:
likwid-topology: Displays thread, cache, and NUMA topology information.
likwid-bench: A micro-benchmarking platform for various CPU architectures.
and more…
likwid-topology
To get more information and help, you can use the following command:
likwid-topology -h
For basic hardware thread information:
likwid-topology
HWThread
: Numbers the processors.Thread
:SMT thread number inside a core.Core
: Physical CPU core number.Die
: The die IDs. In modern architectures, one socket might contain one or more dies assembled together.Socket
: socket numbers of the hardware threads.
and more information like cache and NUMA topology.
Get more information about the caches:
likwid-topology -c
Get graphical output:
likwid-topology -g
Storing the outputs:
Storing in a CSV output:
likwid-topology -O
Storing output to desired file.:
likwid-topology -o <file_name>
likwid-bench
To get more information and help, you can use the following command:
likwid-bench -h
For a list of all available benchmark kernels, use:
likwid-bench -a
The simplest form to run a benchmark is:
likwid-bench -t stream -w S0:50kB
A workgroup is defined using the format
<domain>:<size>:<nrThreads (optional)>
. You can specify multiple workgroups.
The result section contains the following output:
CPU Clock
: The CPU frequency at staring point.Time
: Benchmark runtime determined using Cycles and Cycle clock values.Iterations
: The sum of iterations performed by all threads.Iterations per thread
: The number of inner loop executions per thread.Inner loop executions
: The number of iterations performed in the internal loop, varying with the working set size and data processed in each inner loop iteration.Size (Byte)
: Total size.Size per thread
: Equal distribution of the working set per thread.Number of Flops
: Number of floating-point operations.MFlops/s
: Floating-point operations per second (Number of Flops/Time).Data volume (Byte)
: Processed data volume (Size (Byte) * Iterations per thread).MByte/s
: Bandwidth during the benchmark, considering only the application’s perspective.Loads per update
: Number of data items loaded to update one item in the output vector.Stores per update
: Number of stores performed for one update.Load bytes per element
: The volume of data loaded with each update.Store bytes per element
: The volume of data stored with each update.Load/store ratio
: Ratio of loaded and stored data items (Loads per update/Stores per update = Load bytes per element/Store bytes per element).Instructions
: Number of instructions executed during the benchmark, including only assembly kernel instructions.UOPs
: Number of micro-ops executed during the benchmark, including only assembly kernel instructions.