Pipelining
==========
Modern computer cores implement pipelining which splits instructions into simpler stages.
This offers great potential for speeding up computations since the stages can run in parallel.
However, each stage in a pipeline depends on the preceding stages.
This means that we have to generate sufficient pressure to keep the pipeline filled and reach the maximum *throughput*.
When looking at a single instruction, we call the number of required cycles from issue to completion the instruction's *latency*.

So far, we studied the sustainable floating point performance by running a set of microbenchmarks.
These benchmarks had a very similar structure, e.g., we used single precision FMA ops in the following way:

  .. code-block:: asm

     fmla v0.4s, v30.4s, v31.4s
     fmla v1.4s, v30.4s, v31.4s
     fmla v2.4s, v30.4s, v31.4s
     fmla v3.4s, v30.4s, v31.4s

     fmla v4.4s, v30.4s, v31.4s
     fmla v5.4s, v30.4s, v31.4s
     fmla v6.4s, v30.4s, v31.4s
     fmla v7.4s, v30.4s, v31.4s

We see that all FMA ops in this block use separate destination registers ``v0``, ``v1``, ... ``v7``.
Further, we use ``v30`` and ``v31`` as source registers.
This means, that the destination register of an FMA op is never read by any other ops in the unrolled part of the microkernel.
Thus, no data dependences exist between any of the ops and all instructions can be executed in parallel.
This produces a very high pressure on the FP/SIMD pipelines of the V1 core and keeps them filled at all times.
Experimentally, we almost observe the pipelines' theoretical floating point performance of four ASIMD FMA ops per cycle.

Now, let's have a look at the pipeline latencies.
How? We introduce read-after-write data dependences.
Assume two instructions I1 and I2. A read-after-write dependence between I1 and I2 exists if I1 writes a result which is read by I2.
This forces the hardware to wait until completion of instruction I1 before I2 can be issued.
Back to the example, lets say it would look as follows:

  .. code-block:: asm

     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s

     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s
     fmla v0.4s, v0.4s, v1.4s

Here, all FMA ops write to the same destination register ``v0`` but also read the source registers ``v0`` and ``v1``.
``v1`` isn't a problem since it is never modified, the issue is ``v0``: We can only execute one of the ``fmla`` instructions if all previous ones completed.

.. admonition:: Tasks

   #. Locate the AArch64 ASIMD instructions ``fmul`` and ``fmla`` in the `Arm Neoverse V1 Software Optimization Guide <https://developer.arm.com/documentation/pjdoc466751330-9685/latest/>`_.
      What are their key metrics (used pipelines, latency, throughput)?
   #. Assume a workload which is bound by ``fmul``'s latency, meaning that a pipeline only works on a single instruction at a time.
      In that case, how many GFLOPS can you get out of a single FP/ASIMD pipeline?
      How does the situation change if you are bound by ``fmla``'s latency?
   #. Rewrite the peak performance microbenchmark ``peak_asimd_fmla_sp.s`` of :numref:`ch:neoverse_v1`.
      Replace all FMA ops by ``fmla v0.4s, v0.4s, v1.4s`` as described above.
      Name your kernel ``latency_src_asimd_fmla_sp.s`` and benchmark it!
      What do you observe?
   #. Repeat the previous task, but use ``fmul v0.4s, v0.4s, v1.4s``.
      Name the kernel ``latency_src_asimd_fmul_sp.s``
   #. Given the benchmark ``latency_src_asimd_fmla_sp.s``.
      How does the situation change if you increase the distance of the read-after-write dependences?
   #. Now introduce a data dependence only by means of the destination register, i.e.,
      repeat the line ``fmla v0.4s, v30.4s, v31.4s`` for FMA ops.
      What do you observe?
      What happens if you use ``fmul v0.4s, v30.4s, v31.4s``?
      Name the kernels ``latency_dst_asimd_fmla_sp.s`` and ``latency_dst_asimd_fmul_sp.s`` respectively.