12. Pipelining

Possibly every core of a modern processor is pipelined. Examples are the execution pipeline of the U7 RISC-V core studied in the lectures or the Cortex-A72 core used in the labs. Studying the cores’ technical manuals, we see that many of the used stages are similar to the ones used in the simple pipelined core discussed in the lectures.

Section 12.1 of this lab starts by simulating the pipelined execution of a code snippet with data hazards. Next, we’ll proceed similarly for a control hazard in Section 12.2. Section 12.3 studies data hazards from a different angle. The Cortex-A72 core heavily relies on pipelining to increase performance. Thus, data hazards might have a large impact on the possible instruction throughput. Here, we introduce a set microbenchmarks and illustrate the characteristics of the microarchitecture’s multi-cycle integer pipeline.

12.1. Data Hazards

Listing 12.1.1 Code snippet with data hazards.
orr x0, xzr, #3
sub x1, x0, #5
orr x2, xzr, #2
add x3, x2, x0
and x4, xzr, x4
add x5, x0, #7

Let’s get started by simulating the code snippet in Listing 12.1.1 using our class’s pipelined core. The snippet has read-after-write dependences between some instructions. Since our core has neither a hazard unit nor forwarding support, we’ll run into trouble if running the code snippet out-of-the-box. We’ll study two approaches to resolve the data hazards in software:

  1. We might introduce artificial NOPs which don’t change the state of the core. The resulting pipeline bubbles prevent faulty code execution but reduce performance.

  2. One might simply reorder the instructions. This is possible if the reordering does not violate any dependences. Reordering instructions is more involved and a reordering which resolves all hazards might not exist. However, if successful, we may be able to fully exploit the core’s pipelined performance.

Tasks

  1. What are the values of registers x0 - x5 after an AArch64-compliant microarchitecture executed the code in Listing 12.1.1?

  2. Run the code snippet as is on the pipelined core (pipelined_core.dig) designed in the lectures. Provide a screenshot of the core together with the contents of the register file after executing the last instruction.

  3. Adjust Listing 12.1.1’s code snippet by adding a minimal number of NOPs to resolve all data hazards. Run your adjusted snippet on the pipelined core. Once again, provide a screenshot of the results including the contents of the register file directly after the last executed instruction.

  4. Now, instead of introducing pipeline bubbles through NOPs, simply reorder the instructions to resolve the data hazards. Provide a screenshot illustrating the results of your reordered code snippet when using the lecture’s pipelined core.

12.2. Control Hazards

In addition to data hazards, control hazards pose a second challenge for pipelined microarchitectures. They are also called branch hazards since control hazards occur in the presence of branches. From a microarchitectural point of view, we might follow different approaches to resolve control hazards:

  1. As done in Section 12.1 we might stall the pipeline until the branch had the chance to modify the program counter. In software, we could once again use NOPs to generate the respective pipeline bubbles. As before this approach comes at the cost of reducing the microarchitecture’s instruction throughput.

  2. Alternatively, we could guess the result of the branch instruction and start executing following instructions while the branch instruction still progresses through the pipelined core. If we guessed right, performance would remain high. If we guessed wrong, we would have to flush the pipeline to get rid of invalid instructions.

Listing 12.2.1 Code snippet with a control hazard.
orr x0, xzr, #7
nop
nop
subs xzr, x0, #5
b.ne #16
orr x1, xzr, #1
orr x2, xzr, #2
orr x3, xzr, #3
orr x4, xzr, #4

Since the speculative execution of instructions requires changes to our microarchitecture, we’ll simply add NOPs in software to resolve the control hazard in Listing 12.2.1.

Tasks

  1. What are the values of register x0 - x4 after an AArch64-compliant microarchitecture executed the code in Listing 12.2.1. Use - for undefined values.

  2. Run the code snippet as is on the pipelined core (pipelined_core.dig) designed in the lectures.

  3. Adjust Listing 12.2.1’s code snippet by adding a minimal number of NOPs to resolve the control hazard. Run your adjusted snippet on the pipelined core.

12.3. Microbenchmarking

Until this point, we studied data and control hazards in the context of our simple pipelined core. Being able to simulate every cycle of our core in isolation and being able to study all internal signals is the key advantage of this approach. Yet, our core’s sophistication is still far from that of the processors deployed on billions of devices today.

This part of the lab switches gears and investigates the Cortex-A72 core of the class’s Raspberry Pis w.r.t. pipelining. Our goal is to observe and predict the performance of the microarchitecture in the presence (or absence) of data hazards. For this, we proceed very similarly to the U7 RISC-V core discussed in the lectures: We develop a small collection of microbenchmarks to evaluate the performance and limitations of the available pipelines. As done for the U7 RISC-V core, we reduce the complexity of the task by limiting our efforts to a single instruction. Specifically, we’ll only study MADD in the W-form.

To accelerate your microbenchmarking work, a template for the first microbenchmark and a code frame are available. You may build the code through the provided Makefile by typing make.

Tasks

  1. Look up the instruction processing pipeline of the Cortex-A72 microarchitecure in the Cortex®-A72 Software Optimization Guide. Briefly explain a situation where the out-of-order portion of the microarchitecture is advantageous compared to a strict in-order core.

  2. Look up the throughput of the AArch64 MADD instruction in the W-form. Which pipelines are used by MADD? Assume a stable clock frequency of 1.5GHz. How many independent MADD instructions per second may a single Cortex-A72 core execute?

  3. Follow the approach of the lectures to develop a microbenchmark which measures the peak MADD throughput. Loop over a set of independent MADD instructions and name your kernel micro_aarch64_madd_w_independent. Report the obtained performance!

  4. Look up the effective execution latency of MADD. What is the instruction’s latency when read-after-write dependences exist w.r.t. Rn or Rm? Assume a stable clock frequency of 1.5GHz. How many MADD instructions per second may a single Cortex-A72 core perform if respective latency-bound code is executed?

  5. Follow the approach of the lectures to develop a microbenchmark which measures the exec latency of MADD in the W-form. Loop over a set of MADD instructions which are subject to read-after-write dependences w.r.t. Rn and name your kernel micro_aarch64_madd_w_raw_rn. Run your benchmark!

  6. Design a microbenchmark to demonstrate the core’s out-of-order capabilities. For this use a loop to repeatedly execute two blocks of MADD instructions:

    • Use MADD instructions with read-after-write dependences w.r.t. Rn or Rm in the first block.

    • Use independent MADD instructions in the second block.

    Name your kernel micro_aarch64_madd_w_raw_and_independent.