.. _ch:gemms_sve:

Small GEMMs: SVE
================

We wrote our first high-performance matrix kernel using the Advanced SIMD (ASIMD) extension in :numref:`ch:assembly_building_blocks`.
ASIMD is available on all AArch64 processors including those who support the more advanced Scalable Vector Extension (SVE).
Since SVE is key for high performance on Arm HPC processors, we'll switch gears in this lab.
After getting familiar with SVE in :numref:`ch:gemms_sve_getting_started`, we will develop an SVE-based microkernel for small GEMMS in :numref:`ch:gemms_sve_unrolled`.
Then, we generalize this approach by adding nested :math:`K`, :math:`M` and :math:`N` loops around the microkernel in :numref:`ch:gemms_sve_unrolled`, :numref:`ch:gemms_sve_loop_m` and :numref:`ch:gemms_sve_loop_n`.
In :numref:`ch:gemms_sve_arbitrary_m`, we study the generalization of our approach such that we can handle values of :math:`M` which are not multiples of the used vector length.

.. _ch:gemms_sve_getting_started:

Getting Started
---------------
We'll get started by making ourselves familiar with SVE.
We require SVE instructions for our GEMM microkernel in :numref:`ch:gemms_sve_unrolled`.
SVE -- together with the superset SVE2 -- represents the future of vector processing on Arm processors.
The `A64FX processor <https://www.fujitsu.com/global/products/computing/servers/supercomputer/a64fx/>`__ which also powers the `Fugaku <https://www.r-ccs.riken.jp/en/fugaku/>`__ supercomputer and the `Graviton3 <https://aws.amazon.com/ec2/graviton/>`__ processor already support SVE.

Once again we are on the lookout for information in the world wide web.
As before, the info is available via different channels, e.g., official documentation, tutorials or latest announcements and news articles.
The links provided below are meant to be a first try such that we can get started quickly.
Ultimately, its important to locate resources independently.

.. admonition:: Tasks

   * Browse over the Hot Chips 28 presentation `ARMv8-A Next-Generation Vector Architecture for HPC <https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-3807-00-00-00-01-20-49/ARMv8_2D00_A-SVE-technology-Hot-Chips-v12.pdf>`__.
   * Check out the presentation `Introduction to ARM SVE by ARM Software Developers <https://www.youtube.com/watch?v=eGCcPo4UAHs>`__.
   * Have a look at the paper `The ARM Scalable Vector Extension <https://arxiv.org/pdf/1803.06185.pdf>`__.
   * Have a look at the `SVE entry point <https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions>`__ on the official homepages.
   * Browse through the `Arm Architecture Reference Manual Supplement, The Scalable Vector Extension <https://developer.arm.com/documentation/ddi0584/latest/>`__.
   * Browse through the `SVE part <https://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions>`__ of the ISA.
   * Browse through the `SVE Programming Examples Whitepaper <https://developer.arm.com/documentation/dai0548/b>`__.
   * Have a look at the SVE examples of the `Arm SVE Tools Tutorial <https://gitlab.com/arm-hpc/training/arm-sve-tools/-/tree/master/>`__.
   * Read about the V1 and N2 microarchitectures, e.g, on `AnandTech <https://www.anandtech.com/show/16640/arm-announces-neoverse-v1-n2-platforms-cpus-cmn700-mesh>`__ or `The Next Platform <https://www.nextplatform.com/2021/04/27/arm-puts-some-muscle-into-future-neoverse-server-cpu-designs/>`__.
   * Search for additional information on SVE and its superset SVE2.

.. _ch:gemms_sve_unrolled:

The Unrolled Part
-----------------
In this part we'll write an SVE-based microkernel for small GEMMs.
Our targeted kernel ``gemm_asm_sve_32_6_1`` has the following signature:

.. code-block:: c++

     void gemm_asm_sve_32_6_1( float const * i_a,
                               float const * i_b,
                               float       * io_c );

and performs the operation

.. math::

  C \mathrel{+}= A B

on 32-bit floating point data with

.. math::

  M = 32, \; N = 6, \; K = 1, \quad ldA = 32, \; ldB = 1, \; ldC = 32.

We'll follow the ASIMD approach taken in :numref:`ch:assembly_building_blocks_unrolled` and completely unroll the kernel, i.e., write every instruction explicitly without adding any control structures such as loops.
Once done, we'll add loops to increase the matrix dimensions.
As before, this is particularly simple for :math:`K` which we'll already do at the end of this part.

Once again we have to preserve some SIMD registers in order to follow the procedure call standard `AAPCS64`_.
The template in :numref:`lst:gemm_asm_sve_32_6_1` may serve for the implementation of your SVE microkernel.

.. literalinclude:: data_small_gemms_sve/template.s
    :name: lst:gemm_asm_sve_32_6_1
    :linenos:
    :language: asm
    :caption: Template for the :math:`(32\times6)=(32\times1)(1\times6)` matrix kernel.
              The template temporarily saves the general purpose registers X19-X30 and the lower 64 bits of the SIMD registers V8-V15 on the stack.
              Additionally, the instruction ``ptrue p0.b`` sets all bits of the predicate register P0 to 1.

.. admonition:: Tasks

   #. Implement and verify the unrolled matrix kernel C += AB for M=32, N=6, K=1, ldA=32, ldB=1, ldC=32.
   #. Tune your kernel to squeeze more performance out of the core.
      You may change everything, e.g., the type of the used instructions or the order of the used instructions but have to follow the rules introduced in :numref:`ch:assembly_building_blocks_unrolled`. Report and document your optimizations.
   #. Add a loop over K to realize C += AB for M=32, N=6, K=48, ldA=32, ldB=48, ldC=32.
   #. Submit your team name together with your entries for "time (s)", "#executions", "GFLOPS" and "%peak" for the two kernels.

.. _AAPCS64: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#simd-and-floating-point-registers

.. _ch:gemms_sve_loop_m:

Loop over M
-----------
First, we looked at a completely unrolled implementation with shapes :math:`(32 \times 6) \mathrel{+}= (32 \times 1) * (1 \times 6)`.
Next, we extended our kernel by adding a loop over :math:`K` and implemented :math:`(32 \times 6) \mathrel{+}= (32 \times 48) * (48 \times 6)`.

The remaining parts of the lab study two directions of generalization:

* Larger sizes :math:`M` and :math:`N` for our matrices.
* Values for :math:`M` which are not divisible by the vector length.

In this part, we'll write a kernel which performs the operation

.. math::

  C \mathrel{+}= A B

on 32-bit floating point data with

.. math::

  M = 128, \; N = 6, \; K = 48, \quad ldA = 128, \; ldB = 48, \; ldC = 128.

Due to the sizes of the involved matrices, it is not advisable to completely unroll the kernel.
In the general case one writes a microkernel and adds three respective loops over :math:`M`, :math:`N` and :math:`K`.
We already wrote the microkernel in :numref:`ch:gemms_sve_unrolled` and added a loop over :math:`K`.
Since we also used :math:`N=6` in our microkernel only the additional loop over :math:`M` has to be added.
As in :numref:`ch:gemms_sve_unrolled` the new loop requires us to also change our code slightly to account for changing data locations in the loops' iterations.
For now, we are stuck with performing these steps manually.
On the bright side: Once understood, we can easily abstract when generating code at runtime in :numref:`ch:jit`.

.. admonition:: Tasks

   * Implement and verify the matrix kernel C += AB for M=128, N=6, K=48, ldA=128, ldB=48, ldC=128.
     Re-use the code of your microkernel, implemented in :numref:`ch:gemms_sve_unrolled`.
   * Optimize your matrix kernel. Respect the rules of :numref:`ch:assembly_building_blocks_unrolled`.
     Report and document your optimizations.
   * Submit the metrics "time (s)", "#executions", "GFLOPS" and "%peak" together with your team name for your best-performing variant.

.. _ch:gemms_sve_loop_n:

Loop over N
-----------
Let's increase the complexity of our matrix kernel further.
Compared to :numref:`ch:gemms_sve_loop_m` we increase the size of dimension :math:`N` from 6 to 48.
Specifically, we implement a kernel which performs the operation

.. math::

  C \mathrel{+}= A B

on 32-bit floating point data with

.. math::

  M = 128, \; N = 48, \; K = 48, \quad ldA = 128, \; ldB = 48, \; ldC = 128.

Once again, we simply have to add another loop and adjust the data locations accordingly.

.. admonition:: Tasks

   * Implement and verify the matrix kernel C += AB for M=128, N=48, K=48, ldA=128, ldB=48, ldC=128.
     Re-use the code of your kernel, implemented in :numref:`ch:gemms_sve_loop_m`.
   * Optimize your matrix kernel. Respect the rules of :numref:`ch:assembly_building_blocks_unrolled`.
     Report and document your optimizations.
   * Submit the metrics "time (s)", "#executions", "GFLOPS" and "%peak" together with your team name for your best-performing variant.

.. _ch:gemms_sve_arbitrary_m:

Arbitrary Values for M
----------------------
Supporting arbitrary values for :math:`K` is simple: we only have to change the number of iterations of the loop over :math:`K`.
Arbitrary values for :math:`N` are slightly more difficult since we might have to rethink our blocking and wrap a new microkernel.
The "difficult" case are changes to :math:`M` which are not multiples of the vector length.
That's the challenge we'll tackle now by implementing a kernel which performs the operation:

.. math::

  C \mathrel{+}= A B

on 32-bit floating point data with

.. math::

  M = 31, \; N = 6, \; K = 48, \quad ldA = 31, \; ldB = 48, \; ldC = 31.

.. admonition:: Tasks

   * Implement and verify the matrix kernel C += AB for M=31, N=6, K=48, ldA=31, ldB=48, ldC=31.
     Use predicated SVE instructions to tackle :math:`M=31` which is not a multiple of 8.
     Re-use the code of your microkernel, implemented in :numref:`ch:gemms_sve_unrolled`.
   * Optimize your matrix kernel. Respect the rules of :numref:`ch:assembly_building_blocks_unrolled`.
     Report and document your optimizations.
   * Submit the metrics "time (s)", "#executions", "GFLOPS" and "%peak" together with your team name for your best-performing variant.