13. Memory Hierarchy

This lab studies the memory hierarchy of the Raspberry Pi 4 Model B boards used in class. Section 13.1 benchmarks the bandwidth of the respective memory levels by repeatedly reading the same data from cache / memory. Next, Section 13.1 follows the ideas of the lectures to highlight issues if spatial locality is not kept in mind. The design and implementation of a cache-aware algorithm which transposes a matrix concludes the lab in Section 13.3.

13.1. Temporal Locality

In the lectures, we repeatedly computed the sum \(\sum_{k=0}^{n-1} a_k\) of an array \(\vec{a} = (a_0, \ldots, a_{n-1})\) to benchmark the read bandwidth of M2’s cache levels and main memory. Since the benchmark computed the sum repeatedly, the data remained in the respective memory level. Thus, we harnessed temporal locality since smaller arrays fit into lower levels of cache.

M2 is a heterogeneous System on Chip (SoC) featuring four Avalanche performance cores and four Blizzard efficiency cores. We have seen that the two core types’ cache specifications differ. In this part of the lab, we repeat the performance benchmarking using the class’s Raspberry Pis. In contrast to M2, the Pis are homogeneous w.r.t. to the used cores since all of them have the same Cortex-A72 microarchitecture.

Tasks

Look up the L1 and L2 cache specifications of the cores used in the Raspberry Pi 4 Model B. What cache line size is used?
Benchmark the read bandwidth of the Raspberry Pis for different data sizes in the range from 4,096 bytes to 2GiB. Run each benchmark configuration for at least one second.
Visualize the obtained results.

13.2. Spatial Locality

This section uses the dot product \(\vec{a} \cdot \vec{b} = \sum_{k=0}^{n-1} a_k b_k\) of the two vectors \(\vec{a} = (a_0, \ldots, b_{n-1})\) and \(\vec{b} = (b_0, \ldots, b_{n-1})\) as running example. We choose the size \(n\) of the two vectors sufficiently large such that they do not fit into any cache level. Specifically, we use \(n=134,217,728\) and uint64_t elements such that each vector requires 1GiB of memory. Similar to the lectures’ example which incremented the values of an array, we study the performance of different implementations:

\(\sum_{k=0}^{n-1} a_k b_k\)
\(\left( \sum_{k=0}^{n/2-1} a_{2k} b_{2k} \right) + \left( \sum_{k=0}^{n/2-1} a_{2k+1} b_{2k+1} \right)\)
\(\sum_{l=0}^{3} \left( \sum_{k=0}^{n/4-1} a_{4k+l} b_{4k+l} \right)\)
\(\sum_{l=0}^{7} \left( \sum_{k=0}^{n/8-1} a_{8k+l} b_{8k+l} \right)\)
\(\sum_{l=0}^{15} \left( \sum_{k=0}^{n/16-1} a_{16k+l} b_{16k+l} \right)\)

Tasks

Assume that \(\vec{a}\) and \(\vec{b}\) are aligned to the cache line size and that the first implementation requires 1s per dot product. What performance do you expect for implementations 2 - 5 when running on a Raspberry Pi 4 Model B’s core? Why?
Implement the various dot product implementations. Use arrays with 134,217,728 elements of type uint64_t. Run your implementation on one of the class’s Pis.

13.3. Matrix Transpose

Section 13.2’s examples are a bit artificial. Possibly all of us would initially opt for the first and fastest implementation. However, in general, the most straightforward implementation is not always the fastest one. One such example is the transposition of a large matrix. In this part of the lab, we are given a vanilla implementation which sets \(B=A^T\) for a \(16,384 \times 16,384\) matrix \(A\) with uint64_t elements. Our goal is to implement a faster version which better exploits spatial locality.

Tasks

Have a look at the vanilla implementation. Which operation is troublesome? Reading the data of \(A\) or writing \(B\)? Why?
Design and implement a new version which is cache-aware. Benchmark your improved implementation!