Recent Features: A Sneak Peek
=============================
This section looks into recent features and the future of AArch64 by studying the idea of vector length agnostic programming and the Scalable Vector Extension version two.

Arm Instruction Emulator
------------------------
The A64FX processor was the first processor to support SVE with 512-bit vector registers.
Graviton3 (Neoverse V1) also supports SVE but has 256-bit vector registers.
In this section we'll have a look at SVE instructions using different vector widths some of which are not supported by A64FX or V1.
The Arm Instruction Emulator (`ArmIE <https://developer.arm.com/Tools%20and%20Software/Arm%20Instruction%20Emulator>`__) allows us to emulate these instructions and especially run SVE2 code which we could not otherwise.
Because of the emulation, we can not expect performance anywhere near to what actual hardware with native support would deliver.
However, in addition to "just" emulating instructions, ArmIE is able to instrument binaries and can, for example, count the number of executed instructions.

If installed, ArmIE is available through the module system.
You may show available modules by running:

.. code-block:: bash

  module avail

and load ArmIE by running:

.. code-block:: bash

  module load armie22/22.0

.. admonition:: Tasks

   #. Make yourself familiar with the emulator and browse through `ArmIE's documentation <https://developer.arm.com/documentation/102190/latest>`__.
   #. Compile at least two SVE examples from the lecture slides and execute them!
      Run the code with different SVE vector lengths. Try at least 128, 256 and 512 bits.
   #. In your examples, count the number of AArch64 and SVE instructions by using ``libinscount_emulated.so``.
   #. Now, examine the memory access behavior of an example with load and/or store instructions by using ``libmemtrace_emulated.so``.

.. hint::

   ArmIE is `available <https://developer.arm.com/downloads/-/arm-instruction-emulator>`__ from Arm's developer resources.
   You may install ArmIE in user space and make it available through the module system by running following the lines:
   
   .. code-block:: bash

      ./arm-instruction-emulator_22.0_RHEL-8/arm-instruction-emulator_22.0_RHEL-8.sh --install-to ${HOME}/armie
      export MODULEPATH=$MODULEPATH:${HOME}/armie/modulefiles
      module load armie22/22.0

Vector Length Agnostic Programming
----------------------------------
In this part we'll use SVE to write a Vector Length Agnostic (VLA) function.
For this we'll vectorize a simple loop with an unknown number of iterations.
The number of entries in the arrays ``i_a``, ``i_b`` and ``i_c``, and thus the number of loop iterations is an input parameter to the function ``triad_high``:

.. literalinclude:: data_sneak_peek/vla/triad_high.cpp
    :linenos:
    :language: cpp
    :caption: File ``triad_high.cpp`` which implements the triad-function in C/C++.

Not working with multiples of the vector length makes our life complicated when writing vectorized code.
For our SVE-based small GEMMs we first assumed a fixed vector length of 256 bits when writing :numref:`ch:gemms_sve_unrolled`'s :math:`(32 \times 6) = (32 \times 1) \times (6 \times 1)` microkernel.
We could then generalize the scope of our GEMMs to multiples of the microkernel's sizes through loops over :math:`M`, :math:`N` and :math:`K`.
A similar approach is feasible for most other instruction sets, e.g., ASIMD, AVX512 or OpenPower.
Only :numref:`ch:gemms_sve_arbitrary_m`'s uncommon :math:`M = 31` situation gave a glimpse into the power of VLA programming.
Through predication, we were able to simply shorten the vector length of a single instruction.
If programming ASIMD code, as one would have if targeting Neoverse N1, we would have to issue multiple instruction to do the same.

In the case of the ``triad_high`` function the situation is even more complex since the number of iterations is parameter-dependent.
Now, one would typically implement two loops if writing ASIMD code: One which does full vector instructions and a drain loop which takes care of the remaining iterations.
Instead, we'll use SVE's predicated instructions to express the same functionality with less instructions.

.. admonition:: Tasks

   #. Implement a VLA function ``triad_low`` in the file ``triad_low.s`` with the following signature:

      .. code-block:: c++

         void triad_low( uint64_t         i_n_values,
                         float    const * i_a,
                         float    const * i_b,
                         float          * o_c )

      Exploit SVE's predicated vector instructions and use only a single loop to implement the functionality of ``triad_high``.

   #. Test and verify your function ``triad_low``. Use different array sizes and emulated vector lengths in your tests!

.. hint::

   You can implement ``triad_low`` by using the SVE instructions ``whilelt``, ``b.none``, ``fmov``, ``ld1w``, ``fmla``, ``st1w``, ``incw`` and ``b.any``.

SVE2
----
The Scalable Vector Extension version two (SVE2) is a superset of SVE.
SVE2 is an optional feature in `Armv9 <https://www.arm.com/architecture/cpu/a-profile>`__ and extends SVE by introducing instructions tailored to diverse workloads, e.g., Machine Learning, genomics or databases.

.. admonition:: Tasks

   #. Have a look at the tutorial `Introduction to SVE2 <https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2>`__.
   #. Write a small kernel to illustrate the behavior of the SVE2 instructions `FMLALB <https://developer.arm.com/documentation/ddi0602/2022-03/SVE-Instructions/FMLALB--vectors---Half-precision-floating-point-multiply-add-long-to-single-precision--bottom-->`__, `FMLALT <https://developer.arm.com/documentation/ddi0602/2022-03/SVE-Instructions/FMLALT--vectors---Half-precision-floating-point-multiply-add-long-to-single-precision--top-->`__.
   #. Write a small kernel to illustrate the behavior of the SVE2 instruction `EOR3 <https://developer.arm.com/documentation/ddi0602/2022-03/SVE-Instructions/EOR3--Bitwise-exclusive-OR-of-three-vectors->`__.

.. hint::

   You'll have to compile your code with enabled SVE2-support, e.g., by providing the flag ``-march=armv8-a+sve2`` to Clang or GCC.

.. hint::

   Use ``#include <arm_fp16.h>`` in your driver to use the data type ``float16_t``.
   Details are available in the `documentation <https://developer.arm.com/documentation/101028/latest>`__ of the Arm C Language Extensions.