.. _ch:neoverse_v1:

Neoverse V1
===========

Accessing the Pis
-----------------

In class we will use a set of Raspberry Pis which act as our Desktops.
As a bonus the Pis may build and run your AArch64 code (incl. ASIMD) as well.
To access them you need a KSZ account which you might still have to apply for.
For further information, please visit KSZ's `homepage <https://www.ksz.uni-jena.de/formulare>`__.
The KSZ accounts allows you to log into the Raspberry Pis in the lab room:

#. You should see an Ubuntu login screen. **Don't enter your credentials here.** First, press **Ctrl + Alt + F2**.
#. Now a shell interface opens. Here, you can provide your username, press enter, put in your KSZ password and press enter again.
#. A few lines of text will pop up. You can ignore them. **Press enter one more time.**
#. Next, type the command ``startx`` and press enter. Now you are all set up. Have fun! 😀

After finishing your work, you need to log out of the device:

#. In the bottom right corner of the screen, press the **red power button**.
#. A pop-up will open. Press **Logout**.
#. Now you are back in the shell. Just type ``exit``, press enter and you're done!

.. attention::

   Please don't shut the Pis down!

.. admonition:: Tasks

    #. Log into one of the lab room's Raspberry Pis.
    #. Open a terminal and run the two commands ``hostname`` and ``lscpu``.
    #. Log out from the Pi.

Accessing the Cloud Instance
----------------------------
HPC largely happens on dedicated machines.
Before anything productive can be done, we have to access the machines.
For the time being we'll be using a cloud instance.
The cloud instance features recent hardware and has a modern software stack.

.. admonition:: Tasks

   #. Generate a public/private SSH key pair, e.g., by using ``ssh-keygen -t ed25519 -C "your_email@example.com"``.
   #. Rename the *public* key using the following scheme ``sshkey_surname_firstname.pub``.
      Do not share your private key with anyone! Be paranoid about it!
   #. Upload the renamed public key to |cloud_upload|.
      The password for the file drop was shared in the lectures.
   #. Email |email_ssh| stating that you successfully uploaded the key.
   #. Wait for an answer.
      The login details will be provided in the reply.
      This may take a day or two.
   #. Test your account by logging into the machine with the provided info. Remember to be a good citizen as discussed in class!
   #. Mount your home directory by using ``sshfs``. Remember to backup your files periodically but be aware of your outbound/egress traffic (data leaving the cloud)!

Obtaining Information
---------------------
In the next few weeks we'll be using a `Graviton3 instance <https://aws.amazon.com/de/ec2/graviton/>`_ which uses the `Neoverse V1 CPU <https://www.arm.com/products/silicon-ip-cpu/neoverse/neoverse-v1>`_.
This means that we are targeting Arm's AArch64.
We'll address floating point workloads using Advanced SIMD (ASIMD) instructions at first and then move to the Scalable Vector Extension (SVE).
All basics will be covered in the lectures, however it's key to locate accurate information independently.

.. admonition:: Tasks

   #. Watch `AWS re:Invent 2021 - {New Launch} Deep dive into AWS Graviton3 and Amazon EC2 C7g instances <https://youtu.be/WDKwwFQKfSI?t=854>`__ of AWS re:Invent 2021 (minutes 14 to 22 contain the important info) and `Announcing Amazon EC2 C7g instances powered by AWS Graviton3 <https://youtu.be/QKo7yDAn75k>`__.
   #. Read about Neoverse V1 in news outlets. Examples are `Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility <https://www.anandtech.com/show/16640/arm-announces-neoverse-v1-n2-platforms-cpus-cmn700-mesh>`__ of AnandTech and `Arm Details Neoverse V1 and N2 Platforms, New Mesh Design <https://www.tomshardware.com/news/arm-details-neoverse-v1-and-n2-platforms-new-mesh-design>`__ by Tom's Hardware.
   #. Read about Graviton3 in news outlets. Examples are `Inside Amazon’s Graviton3 Arm Server Processor <https://www.nextplatform.com/2022/01/04/inside-amazons-graviton3-arm-server-processor/>`__ by TheNextPlatform and `AWS Graviton3 Hits GA with 3 Sockets Per Motherboard Designs <https://www.servethehome.com/amazon-aws-graviton3-hits-ga-with-3-sockets-per-motherboard-designs-tri-socket-arm/>`__ by ServeTheHome.
   #. Browse the `AWS Graviton Technical Guide <https://github.com/aws/aws-graviton-getting-started>`__.

.. _ch:n1_impressions:

Getting a First Impression
--------------------------
Reading hardware documentation and tech news, watching presentations, and having a look at respective slides is crucial to get the most recent info.
However, hardware access given, we can obtain information directly from the machine.

Many HPC workloads do floating point arithmetic.
One metric for a core or processor is the "theoretical peak performance".
It gives the number of floating point operations the core or processor might *theoretically* perform per second.
This number only exists on paper. Can we actually obtain it?

We'll study the achievable floating point performance by running microbenchmarks.
A microbenchmark is simple and neglects many aspects which complex software has to consider.
To benchmark the sustainable floating point performance, we'll assume a perfect world without data movements.
All data is assumed to be in vector registers and ready to be processed by floating point ops.
In four microbenchmark we'll investigate the difference between single and double precision arithmetic,
and the impact of SIMD ops over scalar ops.
Here are the crucial parts of our microbenchmarks doing floating point math:

* Single-precision scalar:

  .. code-block:: asm

     fmadd s0, s30, s31, s0
     fmadd s1, s30, s31, s1
     fmadd s2, s30, s31, s2
     fmadd s3, s30, s31, s3

     fmadd s4, s30, s31, s4
     fmadd s5, s30, s31, s5
     fmadd s6, s30, s31, s6
     fmadd s7, s30, s31, s7

     fmadd s8, s30, s31, s8
     fmadd s9, s30, s31, s9
     fmadd s10, s30, s31, s10
     fmadd s11, s30, s31, s11

     fmadd s12, s30, s31, s12
     fmadd s13, s30, s31, s13
     fmadd s14, s30, s31, s14
     fmadd s15, s30, s31, s15

     fmadd s16, s30, s31, s16
     fmadd s17, s30, s31, s17
     fmadd s18, s30, s31, s18
     fmadd s19, s30, s31, s19

     fmadd s20, s30, s31, s20
     fmadd s21, s30, s31, s21
     fmadd s22, s30, s31, s22
     fmadd s23, s30, s31, s23

     fmadd s24, s30, s31, s24
     fmadd s25, s30, s31, s25
     fmadd s26, s30, s31, s26
     fmadd s27, s30, s31, s27

     fmadd s28, s30, s31, s28
     fmadd s29, s30, s31, s29

* Double-precision scalar:


  .. code-block:: asm

     fmadd d0, d30, d31, d0
     fmadd d1, d30, d31, d1
     fmadd d2, d30, d31, d2
     fmadd d3, d30, d31, d3

     fmadd d4, d30, d31, d4
     fmadd d5, d30, d31, d5
     fmadd d6, d30, d31, d6
     fmadd d7, d30, d31, d7

     fmadd d8, d30, d31, d8
     fmadd d9, d30, d31, d9
     fmadd d10, d30, d31, d10
     fmadd d11, d30, d31, d11

     fmadd d12, d30, d31, d12
     fmadd d13, d30, d31, d13
     fmadd d14, d30, d31, d14
     fmadd d15, d30, d31, d15

     fmadd d16, d30, d31, d16
     fmadd d17, d30, d31, d17
     fmadd d18, d30, d31, d18
     fmadd d19, d30, d31, d19

     fmadd d20, d30, d31, d20
     fmadd d21, d30, d31, d21
     fmadd d22, d30, d31, d22
     fmadd d23, d30, d31, d23

     fmadd d24, d30, d31, d24
     fmadd d25, d30, d31, d25
     fmadd d26, d30, d31, d26
     fmadd d27, d30, d31, d27

     fmadd d28, d30, d31, d28
     fmadd d29, d30, d31, d29


* Single-precision SIMD:

  .. code-block:: asm

     fmla v0.4s, v30.4s, v31.4s
     fmla v1.4s, v30.4s, v31.4s
     fmla v2.4s, v30.4s, v31.4s
     fmla v3.4s, v30.4s, v31.4s

     fmla v4.4s, v30.4s, v31.4s
     fmla v5.4s, v30.4s, v31.4s
     fmla v6.4s, v30.4s, v31.4s
     fmla v7.4s, v30.4s, v31.4s

     fmla v8.4s, v30.4s, v31.4s
     fmla v9.4s, v30.4s, v31.4s
     fmla v10.4s, v30.4s, v31.4s
     fmla v11.4s, v30.4s, v31.4s

     fmla v12.4s, v30.4s, v31.4s
     fmla v13.4s, v30.4s, v31.4s
     fmla v14.4s, v30.4s, v31.4s
     fmla v15.4s, v30.4s, v31.4s

     fmla v16.4s, v30.4s, v31.4s
     fmla v17.4s, v30.4s, v31.4s
     fmla v18.4s, v30.4s, v31.4s
     fmla v19.4s, v30.4s, v31.4s

     fmla v20.4s, v30.4s, v31.4s
     fmla v21.4s, v30.4s, v31.4s
     fmla v22.4s, v30.4s, v31.4s
     fmla v23.4s, v30.4s, v31.4s

     fmla v24.4s, v30.4s, v31.4s
     fmla v25.4s, v30.4s, v31.4s
     fmla v26.4s, v30.4s, v31.4s
     fmla v27.4s, v30.4s, v31.4s

     fmla v28.4s, v30.4s, v31.4s
     fmla v29.4s, v30.4s, v31.4s

* Double-precision SIMD:

  .. code-block:: asm

     fmla v0.2d, v30.2d, v31.2d
     fmla v1.2d, v30.2d, v31.2d
     fmla v2.2d, v30.2d, v31.2d
     fmla v3.2d, v30.2d, v31.2d

     fmla v4.2d, v30.2d, v31.2d
     fmla v5.2d, v30.2d, v31.2d
     fmla v6.2d, v30.2d, v31.2d
     fmla v7.2d, v30.2d, v31.2d

     fmla v8.2d, v30.2d, v31.2d
     fmla v9.2d, v30.2d, v31.2d
     fmla v10.2d, v30.2d, v31.2d
     fmla v11.2d, v30.2d, v31.2d

     fmla v12.2d, v30.2d, v31.2d
     fmla v13.2d, v30.2d, v31.2d
     fmla v14.2d, v30.2d, v31.2d
     fmla v15.2d, v30.2d, v31.2d

     fmla v16.2d, v30.2d, v31.2d
     fmla v17.2d, v30.2d, v31.2d
     fmla v18.2d, v30.2d, v31.2d
     fmla v19.2d, v30.2d, v31.2d

     fmla v20.2d, v30.2d, v31.2d
     fmla v21.2d, v30.2d, v31.2d
     fmla v22.2d, v30.2d, v31.2d
     fmla v23.2d, v30.2d, v31.2d

     fmla v24.2d, v30.2d, v31.2d
     fmla v25.2d, v30.2d, v31.2d
     fmla v26.2d, v30.2d, v31.2d
     fmla v27.2d, v30.2d, v31.2d

     fmla v28.2d, v30.2d, v31.2d
     fmla v29.2d, v30.2d, v31.2d

We'll learn how to write such kernels (including data movement) soon.
For now, it's sufficient to know that each of the ``fmadd`` lines does a scalar FMA operation and each of the ``fmla`` lines a SIMD FMA operation.
For example, ``fmla v21.4s, v30.4s, v31.4s`` describes the following:

  * Operate on four single-precision values in parallel (``.4s``);
  * Multiply the data in the SIMD and floating point source registers ``v30`` and ``v31``; and
  * Add the result to the destination register ``v21``.

We have to put some boilerplate code around the inner parts of the microbenchmarks to execute them and measure performance.
Good news: You are in luck! Somebody did the work for the single-precision SIMD case for you 😉.
The provided example code `aarch64_micro <../_static/aarch64_micro.tar.xz>`_ already contains a wrapping function ``uint64_t peak_asimd_fmla_sp( uint64_t i_n_repetitions )`` in ``kernels/peak_asimd_fmla_sp.s``.
The function repeatedly executes the inner part and returns the number of floating point operations per iteration.
Further, the driver in ``driver_asimd.cpp`` supports microbenchmarking multiple cores through OpenMP and reports the required time and sustained GFLOPS.

.. admonition:: Tasks

   #. Print basic information on the machine located in ``/proc/cpuinfo``, ``/proc/meminfo``.
      Try the tools ``lscpu`` and ``lstopo-no-graphics``.

   #. Build and test the provided example code.

   #. Add new kernels for the remaining benchmarks single-precision scalar, double precision scalar and double precision SIMD in the files ``kernels/peak_asimd_fmadd_sp.s``, ``kernels/peak_asimd_fmadd_dp.s`` and ``kernels/peak_asimd_fmla_dp.s``.
      Extend the driver accordingly.

   #. Benchmark the sustainable floating point performance of the Graviton3 instance.
      Perform the following studies:

      * Run our microbenchmarks on 1-4 cores.
        Plot the sustained floating point performance in dependency of the number of used cores.
      * Run our microbenchmarks on 1-4 threads but pin all of them to a single core.
        Plot the sustained floating point performance depended on the number of used threads.

      .. hint::

        * Use the environment variable ``OMP_NUM_THREADS`` to set the number of threads.
          For example, setting ``OMP_NUM_THREADS=2`` would use two threads.
        * Use the environment variable ``OMP_PLACES`` to pin your threads to cores.
          For example, ``OMP_PLACES={0}`` only uses the first core whereas ``OMP_PLACES={0}:4:2`` uses every other of the four cores.
  
   #. Write new kernels which replace the fused-multiply add instructions by floating-point multiplies: `FMUL (vector) <https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/FMUL--vector---Floating-point-Multiply--vector--?lang=en>`_, `FMUL (scalar) <https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/FMUL--scalar---Floating-point-Multiply--scalar--?lang=en>`_.
      What do you observe when comparing the results to the FMA kernels?