A64FX

The V1 cloud instances introduced in Section 1 represent a quick way to get us started: A public ssh-key is sufficient. However, we’ll work on A64FX processors during most of the class. A64FX also relies on AArch64 and supports ASIMD instructions. This means that we could run programs on A64FX which have been optimized for N1. However, in general this leads to low performance due to large architectural differences. Especially A64FX’s support for the new Scalable Vector Extension has to be exploited to reach a high utilization of the processor.

Accessing the Machine

For working on A64FX we’ll harness the Future Technologies Platform (FTP). The cluster is hosted by NHR@KIT and follows a stricter procedure for access. We’ll get started on the process early on such that the individual accounts are ready when needed.

Tasks

Fill out and sign the Access Form for NHR-System HoreKa.
Scan the form and convert it to PDF if necessary. Follow the following naming scheme ftp_scalablefsu_surname_firstname.pdf.
Upload the document to FSU-Cloud. The password for the file drop is shared in class.
Log into the Federated Login Service (FELS):
- Select Helmholtz AAI
- Select Friedrich Schiller University Jena
- Obtain the information under Index -> Personal Data.
Send an e-mail to alex.breuer@uni-jena.de that you successfully uploaded the completed form. Also put the information from FELS into your e-mail.
Wait for a confirmation that everything looks good. This may take a day or two.
Wait some more.. 🥱
Once you have the green light that you should proceed, self-register for the Future Technologies Partition.
- Set up two-factor authentication (use Authy and smartphone)
- Add your public SSH key under “Index -> My SSH Pubkeys”
- Set the SSH key under “Registered Services -> Set SSH Key”
- Have a peek at “Registered Services -> Registry Info -> More details for interested”

First Steps

Time to get our feet wet! Our first goal is simple: Connect to a compute node and print some information about it.

Tasks

Log into the FTP by using the X86 login node:
```
ssh <username>@ftp-x86-login.scc.kit.edu
```
Your username is available in the Federated Login Service under “Registered services / Future Technologies Partition / Registry info”.

Hint

Only selected IP ranges are able to reach the login nodes of the FTP. Friedrich Schiller University is whitelisted, i.e., you have to be in the university’s network to access the FTP. From home you may use a VPN to achieve this.
Log back out from the X86 partition of FTP. We’ll work on the A64FX partition from now on but had to do a single X86 login to set up our home directories.
Log into the FTP by using the A64FX login node:
```
ssh <username>@ftp-a64-head.scc.kit.edu
```
Allocate an A64FX compute node for three hours:
```
salloc -N 1 -p a64fx -t 03:00:00
```
Get the id of your job and the name of your node through the squeue command.
Connect to the compute node and print as already done in Section 1.4 information on the CPU.

Hint

FTP’s Slurm configuration automatically connects you to the compute node in the terminal which used salloc. If you’d like to connect to the node from a second session, you may simply ssh to it.
Try the module system on a compute node as described in FTP’s documentation
Log out from the compute node and release the compute resources. Double check that your job was canceled.

Hint

The command scancel allows you to cancel any of your jobs at any time. For example, to cancel your job with id 171 you’d type scancel 171.

Obtaining Information

As already done in Section 1 for Neoverse V1, we’ll first make ourselves familiar with A64FX by studying available information.

Tasks

Watch the Hot Chips 30 presentation Fujitsu High Performance CPU for the Post-K computer (mins 32-60).
Read about A64FX on the vendor’s homepage and in HPC news, e.g., at The Next Platform or at The Register.
Read about Fugaku, e.g., in Fuitsu’s report Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption.
Browse through recent events covering A64FX. Examples are events of the Arm HPC User Group or the Ookami User Group Meeting.

Microbenchmarks

Let’s microbenchmark A64FX! Some observations won’t change, e.g., we accepted that we should not pin all threads on a single core 😁. Instead lets have a look at how the achievable performance increases when using SVE over ASIMD instructions.

Once again the important parts of SVE kernels are provided. We’ll learn about the details soon, stay tuned! Before doing any floating point operations, we have to set a predicate register to enable 512-bit vector instructions on A64FX. In our microbenchmarks we’ll use predicate register p0, meaning that we set all bits to true just after entering a wrapping function:

ptrue p0.b

Now, the crucial parts doing floating point math are as follows:

Single-precision SIMD:

fmla z0.s, p0/m, z30.s, z31.s
fmla z1.s, p0/m, z30.s, z31.s
fmla z2.s, p0/m, z30.s, z31.s
fmla z3.s, p0/m, z30.s, z31.s

fmla z4.s, p0/m, z30.s, z31.s
fmla z5.s, p0/m, z30.s, z31.s
fmla z6.s, p0/m, z30.s, z31.s
fmla z7.s, p0/m, z30.s, z31.s

fmla z8.s, p0/m, z30.s, z31.s
fmla z9.s, p0/m, z30.s, z31.s
fmla z10.s, p0/m, z30.s, z31.s
fmla z11.s, p0/m, z30.s, z31.s

fmla z12.s, p0/m, z30.s, z31.s
fmla z13.s, p0/m, z30.s, z31.s
fmla z14.s, p0/m, z30.s, z31.s
fmla z15.s, p0/m, z30.s, z31.s

fmla z16.s, p0/m, z30.s, z31.s
fmla z17.s, p0/m, z30.s, z31.s
fmla z18.s, p0/m, z30.s, z31.s
fmla z19.s, p0/m, z30.s, z31.s

fmla z20.s, p0/m, z30.s, z31.s
fmla z21.s, p0/m, z30.s, z31.s
fmla z22.s, p0/m, z30.s, z31.s
fmla z23.s, p0/m, z30.s, z31.s

fmla z24.s, p0/m, z30.s, z31.s
fmla z25.s, p0/m, z30.s, z31.s
fmla z26.s, p0/m, z30.s, z31.s
fmla z27.s, p0/m, z30.s, z31.s

fmla z28.s, p0/m, z30.s, z31.s
fmla z29.s, p0/m, z30.s, z31.s

Double-precision SIMD:

fmla z0.d, p0/m, z30.d, z31.d
fmla z1.d, p0/m, z30.d, z31.d
fmla z2.d, p0/m, z30.d, z31.d
fmla z3.d, p0/m, z30.d, z31.d

fmla z4.d, p0/m, z30.d, z31.d
fmla z5.d, p0/m, z30.d, z31.d
fmla z6.d, p0/m, z30.d, z31.d
fmla z7.d, p0/m, z30.d, z31.d

fmla z8.d, p0/m, z30.d, z31.d
fmla z9.d, p0/m, z30.d, z31.d
fmla z10.d, p0/m, z30.d, z31.d
fmla z11.d, p0/m, z30.d, z31.d

fmla z12.d, p0/m, z30.d, z31.d
fmla z13.d, p0/m, z30.d, z31.d
fmla z14.d, p0/m, z30.d, z31.d
fmla z15.d, p0/m, z30.d, z31.d

fmla z16.d, p0/m, z30.d, z31.d
fmla z17.d, p0/m, z30.d, z31.d
fmla z18.d, p0/m, z30.d, z31.d
fmla z19.d, p0/m, z30.d, z31.d

fmla z20.d, p0/m, z30.d, z31.d
fmla z21.d, p0/m, z30.d, z31.d
fmla z22.d, p0/m, z30.d, z31.d
fmla z23.d, p0/m, z30.d, z31.d

fmla z24.d, p0/m, z30.d, z31.d
fmla z25.d, p0/m, z30.d, z31.d
fmla z26.d, p0/m, z30.d, z31.d
fmla z27.d, p0/m, z30.d, z31.d

fmla z28.d, p0/m, z30.d, z31.d
fmla z29.d, p0/m, z30.d, z31.d

Before getting to it, let’s have a look at one of the instructions. Knowing that all bits in predicate register p0 are set to 1 and that A64FX has a vector length of 512-bit, the instruction fmla z21.s, p0/m, z30.s, z31.s does the following:

Operate on single-precision values (.s);
Multiply the 16 values in SVE register z30 with those in z31; and
Add the result to the destination register z21.

Tasks

Build Section 1.4’s ASIMD microbenchmarks on A64FX. Benchmark the achievable performance when using the ASIMD kernels on 1-48 cores.
Add two new kernels in kernels/peak_sve_fmla_sp.s and kernels/peak_sve_fmla_dp.s which use single- and double-precision SVE FMLA instructions. Add a new driver in driver_sve.cpp which benchmarks the SVE-based kernels. Benchmark the achievable performance on 1-48 cores.
Plot the sustained floating point performance of your two studies w.r.t. the number of used cores. What do you observe?