13. Memory System

This lab covers the memory system. We begin with cache fundamentals in Section 13.1. Next, we study the memory hierarchy of the Raspberry Pi 5. Section 13.2 benchmarks the bandwidth of the respective memory levels by repeatedly reading the same data from cache or memory. Next, Section 13.3 follows the ideas from the lectures to highlight issues if spatial locality is not kept in mind. The design and implementation of a cache-aware algorithm that transposes a matrix concludes the lab in Section 13.4.

13.1. Cache Fundamentals

Table 13.1.1 L1 Cache Specifications of two CPU cores.

Core A

Core B

Address width

12 bits

33 bits

Cache size

128 bytes

64 KiB

Cache line size

8 bytes

64 bytes

Associativity

4-way

4-way

Write policy

Write-back

Write-back

Write Miss

Write-allocate

Write-allocate

Replacement

Least Recently Used

(Pseudo) Least Recently Used

Consider two different CPU cores with the L1 cache specifications given in Table 13.1.1. For Core A, initially the cache is empty, that is, the valid bit of all cache lines is set to 0. Now, we perform the following set of 8-bit loads and stores:

Table 13.1.2 Instructions executed on Core A.

#

Instruction

Tag

Index

Offset

Hit/Miss

1

LD(0x000)

2

LD(0x020)

3

ST(0x024, 0xaa)

4

LD(0x040)

5

LD(0x060)

6

LD(0x003)

7

ST(0x080, 0xbb)

8

LD(0x040)

Tasks

  1. Answer the following questions for Core A and Core B:

    1. Size of the physical memory?

    2. Number of sets?

    3. Number of offset bits?

    4. Number of index bits?

    5. Number of tag bits?

  2. Fill in the missing cells of Table 13.1.2.

  3. Run the instructions using the Cache Simulator and verify your results.

13.2. Temporal Locality

In this part, we repeatedly compute the sum \(\sum_{k=0}^{n-1} a_k\) of an array \(\vec{a} = (a_0, \ldots, a_{n-1})\) to benchmark the read bandwidth of the cache levels and main memory. Since the benchmark computes the sum repeatedly, the data remains in the respective memory level. Thus, we harness temporal locality since smaller arrays fit into lower levels of cache.

Tasks

  1. Look up the L1 and L2 cache specifications of the cores used in the Raspberry Pi 5. What cache line size is used?

  2. Benchmark the read bandwidth of a Pi for different data sizes in the range from 4,096 bytes to 2GiB. Run each benchmark configuration for at least one second.

  3. Visualize the obtained results.

13.3. Spatial Locality

This section uses the dot product \(\vec{a} \cdot \vec{b} = \sum_{k=0}^{n-1} a_k b_k\) of the two vectors \(\vec{a} = (a_0, \ldots, a_{n-1})\) and \(\vec{b} = (b_0, \ldots, b_{n-1})\) as a running example. We choose the size \(n\) of the two vectors sufficiently large such that they do not fit into any cache level. Specifically, we use \(n=134,217,728\) and uint64_t elements such that each vector requires 1GiB of memory. We study the performance of the following different implementations:

  1. \(\sum_{k=0}^{n-1} a_k b_k\)

  2. \(\left( \sum_{k=0}^{n/2-1} a_{2k} b_{2k} \right) + \left( \sum_{k=0}^{n/2-1} a_{2k+1} b_{2k+1} \right)\)

  3. \(\sum_{l=0}^{3} \left( \sum_{k=0}^{n/4-1} a_{4k+l} b_{4k+l} \right)\)

  4. \(\sum_{l=0}^{7} \left( \sum_{k=0}^{n/8-1} a_{8k+l} b_{8k+l} \right)\)

  5. \(\sum_{l=0}^{15} \left( \sum_{k=0}^{n/16-1} a_{16k+l} b_{16k+l} \right)\)

Tasks

  1. Assume that \(\vec{a}\) and \(\vec{b}\) are aligned to the cache line size and that the first implementation requires 1s per dot product. What performance do you expect for implementations 2–5 when running on one of the Raspberry Pi 5’s cores? Why?

  2. Implement the five dot product variants. Use arrays with 134,217,728 elements of type uint64_t. Run your implementation on one of the course’s Pis.

13.4. Matrix Transpose

The examples in Section 13.3 are somewhat artificial. Possibly all of us would initially opt for the first and fastest implementation. However, in general, the most straightforward implementation is not always the fastest one. One such example is the transposition of a large matrix. In this part of the lab, we are given a vanilla implementation that sets \(B=A^T\) for a \(16,384 \times 16,384\) matrix \(A\) with uint64_t elements. Our goal is to implement a faster version which better exploits spatial locality.

Tasks

  1. Have a look at the vanilla implementation. Which operation is troublesome? Reading the data of \(A\) or writing \(B\)? Why?

  2. Design and implement a new version which is cache-aware. Benchmark your improved implementation!