14. Optimizing Data Transfers

When optimizing small matrix multiplications, we obtained clean measurements of our low-level kernels by repeatedly calling our functions on the same data. For the considered matrix sizes, we effectively operated on a hot L1-cache. Any transfers from main memory or higher cache levels, occurring in production codes, were not part of our studies.

In contrast, this section studies the memory subsystem. We’ll see that the performance of the memory subsystem is highly heterogeneous. This guides the development and optimization of all HPC-software: “Maximize temporal and spatial data locality”.

14.1. Graviton2’s Memory Subsystem

A modern processor has different cache levels, typically consisting of an L1-, L2- and LL-cache. Lower cache levels have higher performance but are smaller in size. In this part we’ll a have a look at the theoretical numbers of Graviton2’s memory subsystem. Once, this is achieved we’ll, at least partially, benchmark the processor’s memory subsystem in the next part.

Tasks

Look up the sizes, latencies and bandwidths of N1’s per-core L1 and L2. Details are provided in the HC31 slides Arm Neoverse N1 Cloud-to-Edge Infrastructure SoCs and the white paper The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC.
Assume that you have data in either L1 or L2. What is the theoretical bandwidth you can expect when transferring the data to and from registers?
Read about N1’s mesh interconnect CMN-600 and System Level Cache (SLC). How much data can you fit into the SLC of a Graviton2 processor? What is its aggregate bandwidth?
Look up the main memory configuration of a full Graviton2 node. How much DRAM is available? What are expected latencies and bandwidths when accessing main memory? The article “Amazon’s Arm-based Graviton2 against AMD and Intel: Comparing Cloud Compute” at AnandTech might provide additional hints and measured numbers.

14.2. Benchmarking the Memory-Subsytem

In this part we’ll benchmark the bandwidth of the caches and main memory. Here, the triad-example of the previous sections, also called Stream-triad, represents the most commonly used benchmark in literature. Remember, that the triad operates on three arrays. It reads two of them and writes to the third one. Therefore, we can benchmark the bandwidths of our memory layers by choosing the size of the data set such that all three arrays fit into the respective cache levels or only in main memory. Additionally, we have to remember that L1 and L2 are private resources, duplicated as part of every core. In contrast the SLC and main memory are shared among all cores. This means in practice:

To measure the full aggregate bandwidth of a cache-level or main memory, we have to use all cores.

If only running on a single core, we can harness the L1 and L2 of that core fully. The scaling is expected to be linear when increasing the number of used cores.

We won’t obtain peak bandwidth of the SLC or main memory when using one core. For this, we have to use a “sufficient” number of cores. Especially for the main memory, a subset of all available cores is typically sufficient to obtain peak bandwidth.

Tasks

Use a single Graviton2 core and measure the bandwidth of the L1, L2, SLC and main memory using the Stream-triad. For your runs, use two implementations of the Stream-triad:
- A compiler-optimized C/C++ variant.
- A manually tuned version written in assembly language.
What do you observe? Do your measurements confirm the numbers on paper?
Hint
- You have to choose the number of values in the three arrays such that they fit into the respective memory level which you are benchmarking.
- Try using LD1 (multiple structures) without post-index offsets for maximum L1-performance.
Now, parallelize the Stream-triad through OpenMP. Use a fixed problem-size such that the data fits in main memory. Study the scaling behavior w.r.t. the number of used cores!