1. Neoverse V1

1.1. Accessing the Pis

In class we will use a set of Raspberry Pis which act as our Desktops. As a bonus the Pis may build and run your AArch64 code (incl. ASIMD) as well. To access them you need a KSZ account which you might still have to apply for. For further information, please visit KSZ’s homepage. The KSZ accounts allows you to log into the Raspberry Pis in the lab room:

You should see an Ubuntu login screen. Don’t enter your credentials here. First, press Ctrl + Alt + F2.
Now a shell interface opens. Here, you can provide your username, press enter, put in your KSZ password and press enter again.
A few lines of text will pop up. You can ignore them. Press enter one more time.
Next, type the command startx and press enter. Now you are all set up. Have fun! 😀

After finishing your work, you need to log out of the device:

In the bottom right corner of the screen, press the red power button.
A pop-up will open. Press Logout.
Now you are back in the shell. Just type exit, press enter and you’re done!

Attention

Please don’t shut the Pis down!

Tasks

Log into one of the lab room’s Raspberry Pis.
Open a terminal and run the two commands hostname and lscpu.
Log out from the Pi.

1.2. Accessing the Cloud Instance

HPC largely happens on dedicated machines. Before anything productive can be done, we have to access the machines. For the time being we’ll be using a cloud instance. The cloud instance features recent hardware and has a modern software stack.

Tasks

Generate a public/private SSH key pair, e.g., by using ssh-keygen -t ed25519 -C "your_email@example.com".
Rename the public key using the following scheme sshkey_surname_firstname.pub. Do not share your private key with anyone! Be paranoid about it!
Upload the renamed public key to FSU-Cloud. The password for the file drop was shared in the lectures.
Email shima.bani@uni-jena.de stating that you successfully uploaded the key.
Wait for an answer. The login details will be provided in the reply. This may take a day or two.
Test your account by logging into the machine with the provided info. Remember to be a good citizen as discussed in class!
Mount your home directory by using sshfs. Remember to backup your files periodically but be aware of your outbound/egress traffic (data leaving the cloud)!

1.3. Obtaining Information

In the next few weeks we’ll be using a Graviton3 instance which uses the Neoverse V1 CPU. This means that we are targeting Arm’s AArch64. We’ll address floating point workloads using Advanced SIMD (ASIMD) instructions at first and then move to the Scalable Vector Extension (SVE). All basics will be covered in the lectures, however it’s key to locate accurate information independently.

Tasks

Watch AWS re:Invent 2021 - {New Launch} Deep dive into AWS Graviton3 and Amazon EC2 C7g instances of AWS re:Invent 2021 (minutes 14 to 22 contain the important info) and Announcing Amazon EC2 C7g instances powered by AWS Graviton3.
Read about Neoverse V1 in news outlets. Examples are Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility of AnandTech and Arm Details Neoverse V1 and N2 Platforms, New Mesh Design by Tom’s Hardware.
Read about Graviton3 in news outlets. Examples are Inside Amazon’s Graviton3 Arm Server Processor by TheNextPlatform and AWS Graviton3 Hits GA with 3 Sockets Per Motherboard Designs by ServeTheHome.
Browse the AWS Graviton Technical Guide.

1.4. Getting a First Impression

Reading hardware documentation and tech news, watching presentations, and having a look at respective slides is crucial to get the most recent info. However, hardware access given, we can obtain information directly from the machine.

Many HPC workloads do floating point arithmetic. One metric for a core or processor is the “theoretical peak performance”. It gives the number of floating point operations the core or processor might theoretically perform per second. This number only exists on paper. Can we actually obtain it?

We’ll study the achievable floating point performance by running microbenchmarks. A microbenchmark is simple and neglects many aspects which complex software has to consider. To benchmark the sustainable floating point performance, we’ll assume a perfect world without data movements. All data is assumed to be in vector registers and ready to be processed by floating point ops. In four microbenchmark we’ll investigate the difference between single and double precision arithmetic, and the impact of SIMD ops over scalar ops. Here are the crucial parts of our microbenchmarks doing floating point math:

Single-precision scalar:

fmadd s0, s30, s31, s0
fmadd s1, s30, s31, s1
fmadd s2, s30, s31, s2
fmadd s3, s30, s31, s3

fmadd s4, s30, s31, s4
fmadd s5, s30, s31, s5
fmadd s6, s30, s31, s6
fmadd s7, s30, s31, s7

fmadd s8, s30, s31, s8
fmadd s9, s30, s31, s9
fmadd s10, s30, s31, s10
fmadd s11, s30, s31, s11

fmadd s12, s30, s31, s12
fmadd s13, s30, s31, s13
fmadd s14, s30, s31, s14
fmadd s15, s30, s31, s15

fmadd s16, s30, s31, s16
fmadd s17, s30, s31, s17
fmadd s18, s30, s31, s18
fmadd s19, s30, s31, s19

fmadd s20, s30, s31, s20
fmadd s21, s30, s31, s21
fmadd s22, s30, s31, s22
fmadd s23, s30, s31, s23

fmadd s24, s30, s31, s24
fmadd s25, s30, s31, s25
fmadd s26, s30, s31, s26
fmadd s27, s30, s31, s27

fmadd s28, s30, s31, s28
fmadd s29, s30, s31, s29

Double-precision scalar:

fmadd d0, d30, d31, d0
fmadd d1, d30, d31, d1
fmadd d2, d30, d31, d2
fmadd d3, d30, d31, d3

fmadd d4, d30, d31, d4
fmadd d5, d30, d31, d5
fmadd d6, d30, d31, d6
fmadd d7, d30, d31, d7

fmadd d8, d30, d31, d8
fmadd d9, d30, d31, d9
fmadd d10, d30, d31, d10
fmadd d11, d30, d31, d11

fmadd d12, d30, d31, d12
fmadd d13, d30, d31, d13
fmadd d14, d30, d31, d14
fmadd d15, d30, d31, d15

fmadd d16, d30, d31, d16
fmadd d17, d30, d31, d17
fmadd d18, d30, d31, d18
fmadd d19, d30, d31, d19

fmadd d20, d30, d31, d20
fmadd d21, d30, d31, d21
fmadd d22, d30, d31, d22
fmadd d23, d30, d31, d23

fmadd d24, d30, d31, d24
fmadd d25, d30, d31, d25
fmadd d26, d30, d31, d26
fmadd d27, d30, d31, d27

fmadd d28, d30, d31, d28
fmadd d29, d30, d31, d29

Single-precision SIMD:

fmla v0.4s, v30.4s, v31.4s
fmla v1.4s, v30.4s, v31.4s
fmla v2.4s, v30.4s, v31.4s
fmla v3.4s, v30.4s, v31.4s

fmla v4.4s, v30.4s, v31.4s
fmla v5.4s, v30.4s, v31.4s
fmla v6.4s, v30.4s, v31.4s
fmla v7.4s, v30.4s, v31.4s

fmla v8.4s, v30.4s, v31.4s
fmla v9.4s, v30.4s, v31.4s
fmla v10.4s, v30.4s, v31.4s
fmla v11.4s, v30.4s, v31.4s

fmla v12.4s, v30.4s, v31.4s
fmla v13.4s, v30.4s, v31.4s
fmla v14.4s, v30.4s, v31.4s
fmla v15.4s, v30.4s, v31.4s

fmla v16.4s, v30.4s, v31.4s
fmla v17.4s, v30.4s, v31.4s
fmla v18.4s, v30.4s, v31.4s
fmla v19.4s, v30.4s, v31.4s

fmla v20.4s, v30.4s, v31.4s
fmla v21.4s, v30.4s, v31.4s
fmla v22.4s, v30.4s, v31.4s
fmla v23.4s, v30.4s, v31.4s

fmla v24.4s, v30.4s, v31.4s
fmla v25.4s, v30.4s, v31.4s
fmla v26.4s, v30.4s, v31.4s
fmla v27.4s, v30.4s, v31.4s

fmla v28.4s, v30.4s, v31.4s
fmla v29.4s, v30.4s, v31.4s

Double-precision SIMD:

fmla v0.2d, v30.2d, v31.2d
fmla v1.2d, v30.2d, v31.2d
fmla v2.2d, v30.2d, v31.2d
fmla v3.2d, v30.2d, v31.2d

fmla v4.2d, v30.2d, v31.2d
fmla v5.2d, v30.2d, v31.2d
fmla v6.2d, v30.2d, v31.2d
fmla v7.2d, v30.2d, v31.2d

fmla v8.2d, v30.2d, v31.2d
fmla v9.2d, v30.2d, v31.2d
fmla v10.2d, v30.2d, v31.2d
fmla v11.2d, v30.2d, v31.2d

fmla v12.2d, v30.2d, v31.2d
fmla v13.2d, v30.2d, v31.2d
fmla v14.2d, v30.2d, v31.2d
fmla v15.2d, v30.2d, v31.2d

fmla v16.2d, v30.2d, v31.2d
fmla v17.2d, v30.2d, v31.2d
fmla v18.2d, v30.2d, v31.2d
fmla v19.2d, v30.2d, v31.2d

fmla v20.2d, v30.2d, v31.2d
fmla v21.2d, v30.2d, v31.2d
fmla v22.2d, v30.2d, v31.2d
fmla v23.2d, v30.2d, v31.2d

fmla v24.2d, v30.2d, v31.2d
fmla v25.2d, v30.2d, v31.2d
fmla v26.2d, v30.2d, v31.2d
fmla v27.2d, v30.2d, v31.2d

fmla v28.2d, v30.2d, v31.2d
fmla v29.2d, v30.2d, v31.2d

We’ll learn how to write such kernels (including data movement) soon. For now, it’s sufficient to know that each of the fmadd lines does a scalar FMA operation and each of the fmla lines a SIMD FMA operation. For example, fmla v21.4s, v30.4s, v31.4s describes the following:

Operate on four single-precision values in parallel (.4s);

Multiply the data in the SIMD and floating point source registers v30 and v31; and

Add the result to the destination register v21.

We have to put some boilerplate code around the inner parts of the microbenchmarks to execute them and measure performance. Good news: You are in luck! Somebody did the work for the single-precision SIMD case for you 😉. The provided example code aarch64_micro already contains a wrapping function uint64_t peak_asimd_fmla_sp( uint64_t i_n_repetitions ) in kernels/peak_asimd_fmla_sp.s. The function repeatedly executes the inner part and returns the number of floating point operations per iteration. Further, the driver in driver_asimd.cpp supports microbenchmarking multiple cores through OpenMP and reports the required time and sustained GFLOPS.

Tasks

Print basic information on the machine located in /proc/cpuinfo, /proc/meminfo. Try the tools lscpu and lstopo-no-graphics.
Build and test the provided example code.
Add new kernels for the remaining benchmarks single-precision scalar, double precision scalar and double precision SIMD in the files kernels/peak_asimd_fmadd_sp.s, kernels/peak_asimd_fmadd_dp.s and kernels/peak_asimd_fmla_dp.s. Extend the driver accordingly.
Benchmark the sustainable floating point performance of the Graviton3 instance. Perform the following studies:
- Run our microbenchmarks on 1-4 cores. Plot the sustained floating point performance in dependency of the number of used cores.
- Run our microbenchmarks on 1-4 threads but pin all of them to a single core. Plot the sustained floating point performance depended on the number of used threads.
Hint
- Use the environment variable OMP_NUM_THREADS to set the number of threads. For example, setting OMP_NUM_THREADS=2 would use two threads.
- Use the environment variable OMP_PLACES to pin your threads to cores. For example, OMP_PLACES={0} only uses the first core whereas OMP_PLACES={0}:4:2 uses every other of the four cores.
Write new kernels which replace the fused-multiply add instructions by floating-point multiplies: FMUL (vector), FMUL (scalar). What do you observe when comparing the results to the FMA kernels?