4. Benchmarking

Benchmarking is a crucial practice for assessing and improving the performance of various systems and processes. There are different types of benchmarking, each tailored to specific objectives. First you have to clearly define what you want to benchmark, whether it’s a specific algorithm, a library, or a piece of code. Then you should decide which metrics to measure, such as execution time, memory usage, or CPU utilization. This process helps you identify bottlenecks and optimize your code for better performance.

4.1. Memory Bandwidth

STREAM Benchmark

Benchmarking using the STREAM Benchmark is a common way to measure the main memory bandwidth of a system. The triad operation involves performing a simple mathematical operation on three arrays. In C++, you can use this benchmark to assess the memory performance of your system, especially in applications where memory access speed is critical, such as scientific computing or data-intensive tasks.

The triad operation is defined as C = A + s * B, where A, B, and C are arrays, and s is a scalar. It is often used to measure the memory bandwidth of a system because it involves reading data from two arrays (A and B), performing a scalar operation, and writing the result to another array (C). In this example, we use C++ and the Standard Template Library (STL) to create a simple triad benchmark. It also uses the <chrono> library for time measurement.

#include <iostream>
#include <chrono>
#include <cstdlib>

int main() {
  const int l_s = 1000000;
  double l_scalar = 2.0;

  double* l_A = new double[l_s];
  double* l_B = new double[l_s];
  double* l_C = new double[l_s];

  for (int i = 0; i < l_s; ++i) {
    l_A[i] = static_cast<double>(std::rand()) / RAND_MAX;
    l_B[i] = static_cast<double>(std::rand()) / RAND_MAX;
    l_C[i] = 0.0;
  }

  auto l_start_time = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < l_s; ++i) {
    l_C[i] = l_A[i] + l_scalar * l_B[i];
  }

  auto l_end_time = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> l_duration = l_end_time - l_start_time;

  double l_data_access_speed = 3.0 * l_s * sizeof(double) / l_duration.count() / (1024 * 1024 * 1024);

  std::cout << "STREAM Benchmark: " << l_data_access_speed << " GB/s" << std::endl;

  delete[] l_A;
  delete[] l_B;
  delete[] l_C;

  return 0;
}

Task

Part 1: Implementing the Memory Benchmark

Read about various memory bandwidth benchmarking tools and methods and give a short report on it.
Develop a C/C++ program that measures memory bandwidth using one of the benchmarks, for example STREAM Triad.

The program should compute the memory bandwidth by iterating over different array sizes (at least 30 varied sizes, considering L2 cache, L3 cache and main memory).

Store the results in a CSV file named “memory_bandwidth.csv” with “Array Size (bytes)” and “Bandwidth (GB/s)” headers. (If you want to use GB/cycle don’t forget to report the frequency).

Run your benchmark at least one second per size.

Part 2: Visualizing the Results

Develop a script (e.g., in Python) to read the CSV file and create a plot of the memory bandwidth data.
Find out the role of different cache levels in your results.
Save the plot as an image file (e.g., “memory_bandwidth_plot.png”).

Part 3: Slurm Script

Write a Slurm script to run C++ and python program on a full node.
Choose your desired compiler and flags and report it.
Load necessary modules and use required sbatch options.

Optional: Multi-threaded

If you have experience with OpenMP, you may consider comparing your results obtained with a single thread to those achieved with multiple threads as an optional task.

4.2. likwid

Likwid is a user-friendly benchmarking tool and framework for creating multi-threaded assembly kernels. It is a user-friendly tool for performance-oriented developers. It supports Intel, AMD, ARMv8, and POWER9 processors on Linux, with additional Nvidia GPU support.

It includes:

likwid-topology: Displays thread, cache, and NUMA topology information.
likwid-bench: A micro-benchmarking platform for various CPU architectures.

and more…

likwid-topology

To get more information and help, you can use the following command: likwid-topology -h
For basic hardware thread information: likwid-topology
- HWThread: Numbers the processors.
- Thread:SMT thread number inside a core.
- Core: Physical CPU core number.
- Die: The die IDs. In modern architectures, one socket might contain one or more dies assembled together.
- Socket: socket numbers of the hardware threads.
and more information like cache and NUMA topology.
Get more information about the caches: likwid-topology -c
Get graphical output: likwid-topology -g

Storing the outputs:

Storing in a CSV output: likwid-topology -O
Storing output to desired file.: likwid-topology -o <file_name>

likwid-bench

To get more information and help, you can use the following command: likwid-bench -h
For a list of all available benchmark kernels, use: likwid-bench -a
The simplest form to run a benchmark is: likwid-bench -t stream -w S0:50kB
- A workgroup is defined using the format <domain>:<size>:<nrThreads (optional)>. You can specify multiple workgroups.
The result section contains the following output:
- CPU Clock: The CPU frequency at staring point.
- Time: Benchmark runtime determined using Cycles and Cycle clock values.
- Iterations: The sum of iterations performed by all threads.
- Iterations per thread: The number of inner loop executions per thread.
- Inner loop executions: The number of iterations performed in the internal loop, varying with the working set size and data processed in each inner loop iteration.
- Size (Byte): Total size.
- Size per thread: Equal distribution of the working set per thread.
- Number of Flops: Number of floating-point operations.
- MFlops/s: Floating-point operations per second (Number of Flops/Time).
- Data volume (Byte): Processed data volume (Size (Byte) * Iterations per thread).
- MByte/s: Bandwidth during the benchmark, considering only the application’s perspective.
- Loads per update: Number of data items loaded to update one item in the output vector.
- Stores per update: Number of stores performed for one update.
- Load bytes per element: The volume of data loaded with each update.
- Store bytes per element: The volume of data stored with each update.
- Load/store ratio: Ratio of loaded and stored data items (Loads per update/Stores per update = Load bytes per element/Store bytes per element).
- Instructions: Number of instructions executed during the benchmark, including only assembly kernel instructions.
- UOPs: Number of micro-ops executed during the benchmark, including only assembly kernel instructions.