6. Microbenchmarks#

Modern computer cores are based on inherently parallel microarchitectures. This means that we already have a high degree of parallelism within a single core. In this chapter, we examine microbenchmarks that allow us to measure core-level parallelism. A detailed understanding of the requirements for core-level parallelism will guide our implementation of low-level kernels in later chapters.

6.1. Execution Throughput#

A typical microarchitecture decomposes instructions into microoperations. The microoperations are issued to pipelines that execute them. Depending on the microoperation, it may be executed by one of several pipelines. Typically, an execution pipeline can accept one new microoperation per cycle.

Table 6.1.1 Theoretical execution throughput of various floating-point instructions on a core of NVIDIA’s Grace CPU assuming a 3.3 GHz clock frequency.#
Instruction	Inst / Cycle	Ops / Inst	Ops / Cycle	GFLOPS
FMLA (vector), `4S`	4	8	32	105.6
FMLA (vector), `2S`	4	4	16	52.8
FMUL (scalar)	4	1	4	13.2

Execution throughput is the maximum number of instructions in a given instruction group that a microarchitecture can execute per cycle. This assumes ideal conditions where there are no resource conflicts, data dependencies, or memory delays. For example, assume that a microarchitecture has an execution throughput of 4 for the Neon FMLA (vector) instruction. This means that under ideal conditions, we obtain the results of four FMLA (vector) instructions per cycle.

Table 6.1.1 shows the theoretical floating-point throughput of an example core. The 3.3 GHz is the nominal frequency of the NVIDIA Grace Superchip system microbenchmarked below. It is slightly higher than the 3.1 GHz listed in the corresponding datasheet. The execution throughput of the three example instructions is expressed in instructions per cycle and meets the microarchitecture specifications outlined in Table 3-16 of the Neoverse V2 Software Optimization Guide.

The number of operations (ops) for a given instruction that a microarchitecture can execute per cycle is given as:

\[\frac{\text{ops}}{\text{cycle}} = \frac{\text{instructions}}{\text{cycle}} \cdot \frac{\text{ops}}{\text{instruction}}\]

In the case of FMLA (vector), we may be interested in FP32 arithmetic instructions with arrangement specifier 4S. In this case, each instruction performs a fused-multiply add on four-element vectors, resulting in 8 floating-point operations per instruction. This gives us the throughput \(32 = 4 \cdot 8\) for Neoverse V2 in terms of operations per cycle.

We get the operations per second by also considering the clock frequency:

\[\frac{\text{ops}}{\text{second}} = \frac{\text{ops}}{\text{cycle}} \cdot \frac{\text{cycles}}{\text{second}}\]

For example, in the case of a Grace core and FMLA (vector) with arrangement specifier 4S, we get \(105.6\,\text{GOPS} = 3.3\,\text{GHz} \cdot 32\,\text{ops/cycle}\). This means that a core can perform 105.6 FP32 giga (\(10^9\)) operations per second using FMLA (vector) instructions under ideal conditions. Of all the instructions available, this is also the highest theoretical throughput for FP32 arithmetic. We refer to this number as the core’s theoretical FP32 peak performance.

Note

On Neoverse V2, the 128-bit SVE FMLA instruction also has a throughput of four instructions per cycle. So, in theory, we can achieve FP32 peak performance using different instructions. This may seem intuitive since the Neoverse V2 core uses the same pipelines to execute Neon and SVE instructions. However, there are microarchitectures where the situation is different. An example is the A64FX, where we need to use 512-bit SVE instructions for the highest performance.

So far we have discussed instruction throughput from a theoretical point of view. We will now discuss how to write microbenchmarks that allow us to measure execution throughput. To maximize instruction throughput, we must 1) utilize all available pipelines and 2) keep the pipelines full. In summary, we design the throughput microbenchmarks to execute many independent instructions.

Listing 6.1.1 FMLA (vector), 4S microbenchmark that measures execution throughput.#

loop_bench:
    sub x0, x0, #1

    .rept 100
    fmla  v0.4s,  v8.4s, v16.4s
    fmla  v1.4s,  v9.4s, v17.4s
    fmla  v2.4s, v10.4s, v18.4s
    fmla  v3.4s, v11.4s, v19.4s

    fmla  v4.4s, v12.4s, v20.4s
    fmla  v5.4s, v13.4s, v21.4s
    fmla  v6.4s, v14.4s, v22.4s
    fmla  v7.4s, v15.4s, v23.4s

    fmla  v8.4s, v16.4s, v24.4s
    fmla  v9.4s, v17.4s, v25.4s
    fmla v10.4s, v18.4s, v26.4s
    fmla v11.4s, v19.4s, v27.4s

    fmla v12.4s, v20.4s, v28.4s
    fmla v13.4s, v21.4s, v29.4s
    fmla v14.4s, v22.4s, v30.4s
    fmla v15.4s, v23.4s, v31.4s

    fmla v16.4s, v24.4s,  v0.4s
    fmla v17.4s, v25.4s,  v1.4s
    fmla v18.4s, v26.4s,  v2.4s
    fmla v19.4s, v27.4s,  v3.4s

    fmla v20.4s, v28.4s,  v4.4s
    fmla v21.4s, v29.4s,  v5.4s
    fmla v22.4s, v30.4s,  v6.4s
    fmla v23.4s, v31.4s,  v7.4s

    fmla v24.4s,  v0.4s,  v8.4s
    fmla v25.4s,  v1.4s,  v9.4s
    fmla v26.4s,  v2.4s, v10.4s
    fmla v27.4s,  v3.4s, v11.4s

    fmla v28.4s,  v4.4s, v12.4s
    fmla v29.4s,  v5.4s, v13.4s
    fmla v30.4s,  v6.4s, v14.4s
    fmla v31.4s,  v7.4s, v15.4s
    .endr

    cbnz x0, loop_bench

Listing 6.1.1 shows the loop of a microbenchmark that measures the execution throughput of FMLA (vector) using the 4S arrangement specifier. The benchmark consists of two parts: a loop and a block of code that is executed at each iteration of the loop. The loop decrements the value of X0 at each iteration (line 2) and continues iterating only if the value in X0 is not zero (line 46). The value in X0 is the only function parameter of the microbenchmark, allowing the surrounding scope to determine how many times the code block is executed.

The code block itself consists of 3200 instructions, realized by a subblock of 32 instructions (lines 5-43), repeated 100 times by the .rept and .endr directives in lines 4 and 44. Looking at the subblock, we see that the first eight instructions in lines 5-13 are completely independent. The following eight instructions in lines 15-23 have write-after-read dependencies with respect to the previous eight instructions in lines 5-13. For example, the ninth instruction fmla v8.4s, v16.4s, v24.4s in line 15 writes to register V8.4s, which is read by the instruction fmla v0.4s, v8.4s, v16.4s in line 5. Write-after-read dependencies are typically resolved by the renaming capabilities of the microarchitecture.

The first true dependencies (read-after-write) occur in the second half of the 32-instruction subblock. For example, the fmla v16.4s, v24.4s, v0.4s instruction in line 25 reads from source register V0.4s, which is written by the fmla v0.4s, v8.4s, v16.4s instruction in line 5. Read-after-write dependencies cannot be resolved by the microarchitecture, meaning that the corresponding result must be available before the instruction with the dependent read can be executed. Thus, this situation can become a bottleneck if we run out of independent instructions to keep the pipelines busy. The number of cycles required before the result of an instruction is available to other instructions is called latency and is discussed in Section 6.2.

Table 6.1.2 Microbenchmarked throughput of various floating-point instructions on a core of NVIDIA’s Grace CPU.#
Instruction	Inst / Cycle	GFLOPS
FMLA (vector), `4S`	4.0	105.7
FMLA (vector), `2S`	4.0	52.9
FMUL (scalar)	TODO	TODO

Using the microbenchmark structure in Listing 6.1.1 to measure the execution throughput of FMLA (vector) and FMUL (scalar) on Grace (Neoverse V2), we obtain the performance shown in Table 6.1.2. In summary, our microbenchmarks are able to achieve the execution throughput we expected from theoretical considerations.

Note

Another approach to writing the throughput benchmark could be to use the same source registers for the multiplicands in each instruction. For example, we can use V30 and V31 as source registers and accumulate in V0-V29

fmla  v0.4s, v30.4s, v31.4s
fmla  v1.4s, v30.4s, v31.4s
// [...]
fmla v29.4s, v30.4s, v31.4s

However, on Neoverse V2 (Grace) this leads to an unstable setup where the performance is 2x lower than expected every few runs. Possibly only half of the available SIMD pipelines are used in the underperforming runs. Details are unclear at the time of writing, so the structure in Listing 6.1.1 is recommended.

6.2. Execution Latency#

An instruction’s execution latency is the minimum number of processor cycles between the time an instruction’s microoperations are issued into the execution pipeline and the time its result is available – via write-back or forwarding – to a dependent instruction. Again, as with execution throughput, we assume ideal conditions.

As discussed in Section 6.1, we need independent instructions to fully utilize all available execution pipelines. On the other hand, if the instructions are interdependent, our workload can become latency-bound. Suppose we have \(n\) instructions \(I_1\) to \(I_n\) and each instruction depends on the previous one. This means that we have the dependency chain \(I_1 \rightarrow I_2 \rightarrow \ldots \rightarrow I_{n-1} \rightarrow I_n\). In this case, we lose core-level parallelism in two ways. First, we can only use one execution pipeline at a time. Second, pipeline parallelism is either completely inhibited or significantly reduced. More specifically, if the instruction \(I_k\) requires \(I_{k-1}\) to write the result back to the register file, we cannot use pipeline parallelism. However, if \(I_{k-1}\) can share the required results with \(I_k\) via late forwarding, then at least a limited amount of pipeline parallelism is possible. Late forwarding lets an earlier microoperation forward its result directly to a dependent microoperation without waiting for register-file write-back. We can now write microbenchmarks for both cases.

Listing 6.2.1 FMLA (vector), 4S microbenchmark with dependencies on the first source register and destination register.#

loop_bench:
    sub x0, x0, #1

    .rept 100
    fmla v0.4s, v0.4s,  v1.4s
    fmla v0.4s, v0.4s,  v2.4s
    fmla v0.4s, v0.4s,  v3.4s
    fmla v0.4s, v0.4s,  v4.4s

    fmla v0.4s, v0.4s,  v5.4s
    fmla v0.4s, v0.4s,  v6.4s
    fmla v0.4s, v0.4s,  v7.4s
    fmla v0.4s, v0.4s,  v8.4s

    fmla v0.4s, v0.4s,  v9.4s
    fmla v0.4s, v0.4s, v10.4s
    fmla v0.4s, v0.4s, v11.4s
    fmla v0.4s, v0.4s, v12.4s

    fmla v0.4s, v0.4s, v13.4s
    fmla v0.4s, v0.4s, v14.4s
    fmla v0.4s, v0.4s, v15.4s
    fmla v0.4s, v0.4s, v16.4s

    fmla v0.4s, v0.4s, v17.4s
    fmla v0.4s, v0.4s, v18.4s
    fmla v0.4s, v0.4s, v19.4s
    fmla v0.4s, v0.4s, v20.4s

    fmla v0.4s, v0.4s, v21.4s
    fmla v0.4s, v0.4s, v22.4s
    fmla v0.4s, v0.4s, v23.4s
    fmla v0.4s, v0.4s, v24.4s

    fmla v0.4s, v0.4s, v25.4s
    fmla v0.4s, v0.4s, v26.4s
    fmla v0.4s, v0.4s, v27.4s
    fmla v0.4s, v0.4s, v28.4s

    fmla v0.4s, v0.4s, v29.4s
    fmla v0.4s, v0.4s, v30.4s
    fmla v0.4s, v0.4s, v31.4s
    .endr

    cbnz x0, loop_bench

Listing 6.2.1 shows a microbenchmark loop for the write-back case using FMLA (vector) with arrangement specifier 4S as an example. The dependency chain is given by means of Neon register V0. All instructions use V0 as their first source and destination register. In the case of Neoverse V2, Table 3-16 of the Software Optimization Guide specifies an execution latency of 4 cycles for this case. This means that we can expect to execute 0.25 instructions per cycle for the microbenchmark.

Listing 6.2.2 FMLA (vector), 4S microbenchmark with destination register dependency.#

loop_bench:
    sub x0, x0, #1

    .rept 100
    fmla v0.4s,  v1.4s,  v9.4s
    fmla v0.4s,  v2.4s, v10.4s
    fmla v0.4s,  v3.4s, v11.4s
    fmla v0.4s,  v4.4s, v12.4s

    fmla v0.4s,  v5.4s, v13.4s
    fmla v0.4s,  v6.4s, v14.4s
    fmla v0.4s,  v7.4s, v15.4s
    fmla v0.4s,  v8.4s, v16.4s

    fmla v0.4s,  v9.4s, v17.4s
    fmla v0.4s, v10.4s, v18.4s
    fmla v0.4s, v11.4s, v19.4s
    fmla v0.4s, v12.4s, v20.4s

    fmla v0.4s, v13.4s, v21.4s
    fmla v0.4s, v14.4s, v22.4s
    fmla v0.4s, v15.4s, v23.4s
    fmla v0.4s, v16.4s, v24.4s

    fmla v0.4s, v17.4s, v25.4s
    fmla v0.4s, v18.4s, v26.4s
    fmla v0.4s, v19.4s, v27.4s
    fmla v0.4s, v20.4s, v28.4s

    fmla v0.4s, v21.4s, v29.4s
    fmla v0.4s, v22.4s, v30.4s
    fmla v0.4s, v23.4s, v31.4s
    fmla v0.4s, v24.4s,  v1.4s

    fmla v0.4s, v25.4s,  v2.4s
    fmla v0.4s, v26.4s,  v3.4s
    fmla v0.4s, v27.4s,  v4.4s
    fmla v0.4s, v28.4s,  v5.4s

    fmla v0.4s, v29.4s,  v6.4s
    fmla v0.4s, v30.4s,  v7.4s
    fmla v0.4s, v31.4s,  v8.4s
    .endr

    cbnz x0, loop_bench

Listing 6.2.2 shows a microbenchmark loop for the forwarding case. The dependency chain now exists only on the destination register, not on any of the source registers. In the case of Neoverse V2, the microarchitecture supports late-forwarding of accumulate operands, and an execution latency of 2 cycles is specified in this case. This means that we can expect to execute 0.5 instructions per cycle for the microbenchmark.

Table 6.2.1 Measured execution latencies of various floating-point instructions on a core of NVIDIA’s Grace CPU.#
Instruction	Dependencies	Inst / Cycle	GFLOPS
FMLA (vector), `4S`	destination & first source	0.25	6.6
FMLA (vector), `4S`	destination	0.5	13.2

We can confirm the theoretical Neoverse V2 execution latencies by running the microbenchmarks. Table 6.2.1 shows the measured latencies for the two cases. We see that we can reproduce the theoretical execution latencies given in the Software Optimization Guide.

In particular, the 6.6 GFLOPS measured for source register dependencies represents a 16-fold slowdown compared to the 105.6 GFLOPS throughput achieved when all pipelines and pipeline parallelism are fully utilized. In summary, we need to design our compute kernels very carefully to fully exploit the available parallelism of the target microarchitecture.

Microbenchmarks

Contents

6. Microbenchmarks#

6.1. Execution Throughput#

6.2. Execution Latency#