3. Pipelining

Modern computer cores implement pipelining which splits instructions into simpler stages. This offers great potential for speeding up computations since the stages can run in parallel. However, each stage in a pipeline depends on the preceding stages. This means that we have to generate sufficient pressure to keep the pipeline filled and reach the maximum throughput. When looking at a single instruction, we call the number of required cycles from issue to completion the instruction’s latency.

So far, we studied the sustainable floating point performance by running a set of microbenchmarks. These benchmarks had a very similar structure, e.g., we used single precision FMA ops in the following way:

fmla v0.4s, v30.4s, v31.4s
fmla v1.4s, v30.4s, v31.4s
fmla v2.4s, v30.4s, v31.4s
fmla v3.4s, v30.4s, v31.4s

fmla v4.4s, v30.4s, v31.4s
fmla v5.4s, v30.4s, v31.4s
fmla v6.4s, v30.4s, v31.4s
fmla v7.4s, v30.4s, v31.4s

We see that all FMA ops in this block use separate destination registers v0, v1, … v7. Further, we use v30 and v31 as source registers. This means, that the destination register of an FMA op is never read by any other ops in the unrolled part of the microkernel. Thus, no data dependences exist between any of the ops and all instructions can be executed in parallel. This produces a very high pressure on the FP/SIMD pipelines of the V1 core and keeps them filled at all times. Experimentally, we almost observe the pipelines’ theoretical floating point performance of four ASIMD FMA ops per cycle.

Now, let’s have a look at the pipeline latencies. How? We introduce read-after-write data dependences. Assume two instructions I1 and I2. A read-after-write dependence between I1 and I2 exists if I1 writes a result which is read by I2. This forces the hardware to wait until completion of instruction I1 before I2 can be issued. Back to the example, lets say it would look as follows:

fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s

fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s
fmla v0.4s, v0.4s, v1.4s

Here, all FMA ops write to the same destination register v0 but also read the source registers v0 and v1. v1 isn’t a problem since it is never modified, the issue is v0: We can only execute one of the fmla instructions if all previous ones completed.

Tasks

Locate the AArch64 ASIMD instructions fmul and fmla in the Arm Neoverse V1 Software Optimization Guide. What are their key metrics (used pipelines, latency, throughput)?
Assume a workload which is bound by fmul’s latency, meaning that a pipeline only works on a single instruction at a time. In that case, how many GFLOPS can you get out of a single FP/ASIMD pipeline? How does the situation change if you are bound by fmla’s latency?
Rewrite the peak performance microbenchmark peak_asimd_fmla_sp.s of Section 1. Replace all FMA ops by fmla v0.4s, v0.4s, v1.4s as described above. Name your kernel latency_src_asimd_fmla_sp.s and benchmark it! What do you observe?
Repeat the previous task, but use fmul v0.4s, v0.4s, v1.4s. Name the kernel latency_src_asimd_fmul_sp.s
Given the benchmark latency_src_asimd_fmla_sp.s. How does the situation change if you increase the distance of the read-after-write dependences?
Now introduce a data dependence only by means of the destination register, i.e., repeat the line fmla v0.4s, v30.4s, v31.4s for FMA ops. What do you observe? What happens if you use fmul v0.4s, v30.4s, v31.4s? Name the kernels latency_dst_asimd_fmla_sp.s and latency_dst_asimd_fmul_sp.s respectively.