1. Neoverse V1
1.1. Accessing the Pis
In class we will use a set of Raspberry Pis which act as our Desktops. As a bonus the Pis may build and run your AArch64 code (incl. ASIMD) as well. To access them you need a KSZ account which you might still have to apply for. For further information, please visit KSZ’s homepage. The KSZ accounts allows you to log into the Raspberry Pis in the lab room:
You should see an Ubuntu login screen. Don’t enter your credentials here. First, press Ctrl + Alt + F2.
Now a shell interface opens. Here, you can provide your username, press enter, put in your KSZ password and press enter again.
A few lines of text will pop up. You can ignore them. Press enter one more time.
Next, type the command
startx
and press enter. Now you are all set up. Have fun! 😀
After finishing your work, you need to log out of the device:
In the bottom right corner of the screen, press the red power button.
A pop-up will open. Press Logout.
Now you are back in the shell. Just type
exit
, press enter and you’re done!
Attention
Please don’t shut the Pis down!
Tasks
Log into one of the lab room’s Raspberry Pis.
Open a terminal and run the two commands
hostname
andlscpu
.Log out from the Pi.
1.2. Accessing the Cloud Instance
HPC largely happens on dedicated machines. Before anything productive can be done, we have to access the machines. For the time being we’ll be using a cloud instance. The cloud instance features recent hardware and has a modern software stack.
Tasks
Generate a public/private SSH key pair, e.g., by using
ssh-keygen -t ed25519 -C "your_email@example.com"
.Rename the public key using the following scheme
sshkey_surname_firstname.pub
. Do not share your private key with anyone! Be paranoid about it!Upload the renamed public key to FSU-Cloud. The password for the file drop was shared in the lectures.
Email shima.bani@uni-jena.de stating that you successfully uploaded the key.
Wait for an answer. The login details will be provided in the reply. This may take a day or two.
Test your account by logging into the machine with the provided info. Remember to be a good citizen as discussed in class!
Mount your home directory by using
sshfs
. Remember to backup your files periodically but be aware of your outbound/egress traffic (data leaving the cloud)!
1.3. Obtaining Information
In the next few weeks we’ll be using a Graviton3 instance which uses the Neoverse V1 CPU. This means that we are targeting Arm’s AArch64. We’ll address floating point workloads using Advanced SIMD (ASIMD) instructions at first and then move to the Scalable Vector Extension (SVE). All basics will be covered in the lectures, however it’s key to locate accurate information independently.
Tasks
Watch AWS re:Invent 2021 - {New Launch} Deep dive into AWS Graviton3 and Amazon EC2 C7g instances of AWS re:Invent 2021 (minutes 14 to 22 contain the important info) and Announcing Amazon EC2 C7g instances powered by AWS Graviton3.
Read about Neoverse V1 in news outlets. Examples are Arm Announces Neoverse V1, N2 Platforms & CPUs, CMN-700 Mesh: More Performance, More Cores, More Flexibility of AnandTech and Arm Details Neoverse V1 and N2 Platforms, New Mesh Design by Tom’s Hardware.
Read about Graviton3 in news outlets. Examples are Inside Amazon’s Graviton3 Arm Server Processor by TheNextPlatform and AWS Graviton3 Hits GA with 3 Sockets Per Motherboard Designs by ServeTheHome.
Browse the AWS Graviton Technical Guide.
1.4. Getting a First Impression
Reading hardware documentation and tech news, watching presentations, and having a look at respective slides is crucial to get the most recent info. However, hardware access given, we can obtain information directly from the machine.
Many HPC workloads do floating point arithmetic. One metric for a core or processor is the “theoretical peak performance”. It gives the number of floating point operations the core or processor might theoretically perform per second. This number only exists on paper. Can we actually obtain it?
We’ll study the achievable floating point performance by running microbenchmarks. A microbenchmark is simple and neglects many aspects which complex software has to consider. To benchmark the sustainable floating point performance, we’ll assume a perfect world without data movements. All data is assumed to be in vector registers and ready to be processed by floating point ops. In four microbenchmark we’ll investigate the difference between single and double precision arithmetic, and the impact of SIMD ops over scalar ops. Here are the crucial parts of our microbenchmarks doing floating point math:
Single-precision scalar:
fmadd s0, s30, s31, s0 fmadd s1, s30, s31, s1 fmadd s2, s30, s31, s2 fmadd s3, s30, s31, s3 fmadd s4, s30, s31, s4 fmadd s5, s30, s31, s5 fmadd s6, s30, s31, s6 fmadd s7, s30, s31, s7 fmadd s8, s30, s31, s8 fmadd s9, s30, s31, s9 fmadd s10, s30, s31, s10 fmadd s11, s30, s31, s11 fmadd s12, s30, s31, s12 fmadd s13, s30, s31, s13 fmadd s14, s30, s31, s14 fmadd s15, s30, s31, s15 fmadd s16, s30, s31, s16 fmadd s17, s30, s31, s17 fmadd s18, s30, s31, s18 fmadd s19, s30, s31, s19 fmadd s20, s30, s31, s20 fmadd s21, s30, s31, s21 fmadd s22, s30, s31, s22 fmadd s23, s30, s31, s23 fmadd s24, s30, s31, s24 fmadd s25, s30, s31, s25 fmadd s26, s30, s31, s26 fmadd s27, s30, s31, s27 fmadd s28, s30, s31, s28 fmadd s29, s30, s31, s29
Double-precision scalar:
fmadd d0, d30, d31, d0 fmadd d1, d30, d31, d1 fmadd d2, d30, d31, d2 fmadd d3, d30, d31, d3 fmadd d4, d30, d31, d4 fmadd d5, d30, d31, d5 fmadd d6, d30, d31, d6 fmadd d7, d30, d31, d7 fmadd d8, d30, d31, d8 fmadd d9, d30, d31, d9 fmadd d10, d30, d31, d10 fmadd d11, d30, d31, d11 fmadd d12, d30, d31, d12 fmadd d13, d30, d31, d13 fmadd d14, d30, d31, d14 fmadd d15, d30, d31, d15 fmadd d16, d30, d31, d16 fmadd d17, d30, d31, d17 fmadd d18, d30, d31, d18 fmadd d19, d30, d31, d19 fmadd d20, d30, d31, d20 fmadd d21, d30, d31, d21 fmadd d22, d30, d31, d22 fmadd d23, d30, d31, d23 fmadd d24, d30, d31, d24 fmadd d25, d30, d31, d25 fmadd d26, d30, d31, d26 fmadd d27, d30, d31, d27 fmadd d28, d30, d31, d28 fmadd d29, d30, d31, d29
Single-precision SIMD:
fmla v0.4s, v30.4s, v31.4s fmla v1.4s, v30.4s, v31.4s fmla v2.4s, v30.4s, v31.4s fmla v3.4s, v30.4s, v31.4s fmla v4.4s, v30.4s, v31.4s fmla v5.4s, v30.4s, v31.4s fmla v6.4s, v30.4s, v31.4s fmla v7.4s, v30.4s, v31.4s fmla v8.4s, v30.4s, v31.4s fmla v9.4s, v30.4s, v31.4s fmla v10.4s, v30.4s, v31.4s fmla v11.4s, v30.4s, v31.4s fmla v12.4s, v30.4s, v31.4s fmla v13.4s, v30.4s, v31.4s fmla v14.4s, v30.4s, v31.4s fmla v15.4s, v30.4s, v31.4s fmla v16.4s, v30.4s, v31.4s fmla v17.4s, v30.4s, v31.4s fmla v18.4s, v30.4s, v31.4s fmla v19.4s, v30.4s, v31.4s fmla v20.4s, v30.4s, v31.4s fmla v21.4s, v30.4s, v31.4s fmla v22.4s, v30.4s, v31.4s fmla v23.4s, v30.4s, v31.4s fmla v24.4s, v30.4s, v31.4s fmla v25.4s, v30.4s, v31.4s fmla v26.4s, v30.4s, v31.4s fmla v27.4s, v30.4s, v31.4s fmla v28.4s, v30.4s, v31.4s fmla v29.4s, v30.4s, v31.4s
Double-precision SIMD:
fmla v0.2d, v30.2d, v31.2d fmla v1.2d, v30.2d, v31.2d fmla v2.2d, v30.2d, v31.2d fmla v3.2d, v30.2d, v31.2d fmla v4.2d, v30.2d, v31.2d fmla v5.2d, v30.2d, v31.2d fmla v6.2d, v30.2d, v31.2d fmla v7.2d, v30.2d, v31.2d fmla v8.2d, v30.2d, v31.2d fmla v9.2d, v30.2d, v31.2d fmla v10.2d, v30.2d, v31.2d fmla v11.2d, v30.2d, v31.2d fmla v12.2d, v30.2d, v31.2d fmla v13.2d, v30.2d, v31.2d fmla v14.2d, v30.2d, v31.2d fmla v15.2d, v30.2d, v31.2d fmla v16.2d, v30.2d, v31.2d fmla v17.2d, v30.2d, v31.2d fmla v18.2d, v30.2d, v31.2d fmla v19.2d, v30.2d, v31.2d fmla v20.2d, v30.2d, v31.2d fmla v21.2d, v30.2d, v31.2d fmla v22.2d, v30.2d, v31.2d fmla v23.2d, v30.2d, v31.2d fmla v24.2d, v30.2d, v31.2d fmla v25.2d, v30.2d, v31.2d fmla v26.2d, v30.2d, v31.2d fmla v27.2d, v30.2d, v31.2d fmla v28.2d, v30.2d, v31.2d fmla v29.2d, v30.2d, v31.2d
We’ll learn how to write such kernels (including data movement) soon.
For now, it’s sufficient to know that each of the fmadd
lines does a scalar FMA operation and each of the fmla
lines a SIMD FMA operation.
For example, fmla v21.4s, v30.4s, v31.4s
describes the following:
Operate on four single-precision values in parallel (
.4s
);Multiply the data in the SIMD and floating point source registers
v30
andv31
; andAdd the result to the destination register
v21
.
We have to put some boilerplate code around the inner parts of the microbenchmarks to execute them and measure performance.
Good news: You are in luck! Somebody did the work for the single-precision SIMD case for you 😉.
The provided example code aarch64_micro already contains a wrapping function uint64_t peak_asimd_fmla_sp( uint64_t i_n_repetitions )
in kernels/peak_asimd_fmla_sp.s
.
The function repeatedly executes the inner part and returns the number of floating point operations per iteration.
Further, the driver in driver_asimd.cpp
supports microbenchmarking multiple cores through OpenMP and reports the required time and sustained GFLOPS.
Tasks
Print basic information on the machine located in
/proc/cpuinfo
,/proc/meminfo
. Try the toolslscpu
andlstopo-no-graphics
.Build and test the provided example code.
Add new kernels for the remaining benchmarks single-precision scalar, double precision scalar and double precision SIMD in the files
kernels/peak_asimd_fmadd_sp.s
,kernels/peak_asimd_fmadd_dp.s
andkernels/peak_asimd_fmla_dp.s
. Extend the driver accordingly.Benchmark the sustainable floating point performance of the Graviton3 instance. Perform the following studies:
Run our microbenchmarks on 1-4 cores. Plot the sustained floating point performance in dependency of the number of used cores.
Run our microbenchmarks on 1-4 threads but pin all of them to a single core. Plot the sustained floating point performance depended on the number of used threads.
Hint
Use the environment variable
OMP_NUM_THREADS
to set the number of threads. For example, settingOMP_NUM_THREADS=2
would use two threads.Use the environment variable
OMP_PLACES
to pin your threads to cores. For example,OMP_PLACES={0}
only uses the first core whereasOMP_PLACES={0}:4:2
uses every other of the four cores.
Write new kernels which replace the fused-multiply add instructions by floating-point multiplies: FMUL (vector), FMUL (scalar). What do you observe when comparing the results to the FMA kernels?