8. Recent Features: A Sneak Peek

This section looks into recent features and the future of AArch64 by studying the idea of vector length agnostic programming and the Scalable Vector Extension version two.

8.1. Arm Instruction Emulator

The A64FX processor was the first processor to support SVE with 512-bit vector registers. Graviton3 (Neoverse V1) also supports SVE but has 256-bit vector registers. In this section we’ll have a look at SVE instructions using different vector widths some of which are not supported by A64FX or V1. The Arm Instruction Emulator (ArmIE) allows us to emulate these instructions and especially run SVE2 code which we could not otherwise. Because of the emulation, we can not expect performance anywhere near to what actual hardware with native support would deliver. However, in addition to “just” emulating instructions, ArmIE is able to instrument binaries and can, for example, count the number of executed instructions.

If installed, ArmIE is available through the module system. You may show available modules by running:

module avail

and load ArmIE by running:

module load armie22/22.0

Tasks

Make yourself familiar with the emulator and browse through ArmIE’s documentation.
Compile at least two SVE examples from the lecture slides and execute them! Run the code with different SVE vector lengths. Try at least 128, 256 and 512 bits.
In your examples, count the number of AArch64 and SVE instructions by using libinscount_emulated.so.
Now, examine the memory access behavior of an example with load and/or store instructions by using libmemtrace_emulated.so.

Hint

ArmIE is available from Arm’s developer resources. You may install ArmIE in user space and make it available through the module system by running following the lines:

./arm-instruction-emulator_22.0_RHEL-8/arm-instruction-emulator_22.0_RHEL-8.sh --install-to ${HOME}/armie
export MODULEPATH=$MODULEPATH:${HOME}/armie/modulefiles
module load armie22/22.0

8.2. Vector Length Agnostic Programming

In this part we’ll use SVE to write a Vector Length Agnostic (VLA) function. For this we’ll vectorize a simple loop with an unknown number of iterations. The number of entries in the arrays i_a, i_b and i_c, and thus the number of loop iterations is an input parameter to the function triad_high:

Listing 8.2.1 File triad_high.cpp which implements the triad-function in C/C++.

#include "triad_high.h"

void triad_high( uint64_t         i_n_values,
                 float    const * i_a,
                 float    const * i_b,
                 float          * o_c ) {
  for( uint64_t l_va = 0; l_va < i_n_values; l_va++ ) {
    o_c[l_va] = i_a[l_va] + 2.0f * i_b[l_va];
  }
}

Not working with multiples of the vector length makes our life complicated when writing vectorized code. For our SVE-based small GEMMs we first assumed a fixed vector length of 256 bits when writing Section 7.2’s $(32 \times 6) = (32 \times 1) \times (6 \times 1)$ microkernel. We could then generalize the scope of our GEMMs to multiples of the microkernel’s sizes through loops over $M$ , $N$ and $K$ . A similar approach is feasible for most other instruction sets, e.g., ASIMD, AVX512 or OpenPower. Only Section 7.5’s uncommon $M = 31$ situation gave a glimpse into the power of VLA programming. Through predication, we were able to simply shorten the vector length of a single instruction. If programming ASIMD code, as one would have if targeting Neoverse N1, we would have to issue multiple instruction to do the same.

In the case of the triad_high function the situation is even more complex since the number of iterations is parameter-dependent. Now, one would typically implement two loops if writing ASIMD code: One which does full vector instructions and a drain loop which takes care of the remaining iterations. Instead, we’ll use SVE’s predicated instructions to express the same functionality with less instructions.

Tasks

Implement a VLA function triad_low in the file triad_low.s with the following signature:

void triad_low( uint64_t         i_n_values,
                float    const * i_a,
                float    const * i_b,
                float          * o_c )

Exploit SVE’s predicated vector instructions and use only a single loop to implement the functionality of triad_high.

Test and verify your function triad_low. Use different array sizes and emulated vector lengths in your tests!

Hint

You can implement triad_low by using the SVE instructions whilelt, b.none, fmov, ld1w, fmla, st1w, incw and b.any.

8.3. SVE2

The Scalable Vector Extension version two (SVE2) is a superset of SVE. SVE2 is an optional feature in Armv9 and extends SVE by introducing instructions tailored to diverse workloads, e.g., Machine Learning, genomics or databases.

Tasks

Have a look at the tutorial Introduction to SVE2.
Write a small kernel to illustrate the behavior of the SVE2 instructions FMLALB, FMLALT.
Write a small kernel to illustrate the behavior of the SVE2 instruction EOR3.

Hint

You’ll have to compile your code with enabled SVE2-support, e.g., by providing the flag -march=armv8-a+sve2 to Clang or GCC.

Hint

Use #include <arm_fp16.h> in your driver to use the data type float16_t. Details are available in the documentation of the Arm C Language Extensions.