1. Neoverse V1

1.1. Accessing the Pis

In class we will use a set of Raspberry Pis which act as our Desktops. As a bonus the Pis may build and run your AArch64 code (incl. ASIMD) as well. To access them you need a KSZ account which you might still have to apply for. For further information, please visit KSZ’s homepage. The KSZ accounts allows you to log into the Raspberry Pis in the lab room:

  1. You should see an Ubuntu login screen. Don’t enter your credentials here. First, press Ctrl + Alt + F2.

  2. Now a shell interface opens. Here, you can provide your username, press enter, put in your KSZ password and press enter again.

  3. A few lines of text will pop up. You can ignore them. Press enter one more time.

  4. Next, type the command startx and press enter. Now you are all set up. Have fun! 😀

After finishing your work, you need to log out of the device:

  1. In the bottom right corner of the screen, press the red power button.

  2. A pop-up will open. Press Logout.

  3. Now you are back in the shell. Just type exit, press enter and you’re done!

Attention

Please don’t shut the Pis down!

Tasks

  1. Log into one of the lab room’s Raspberry Pis.

  2. Open a terminal and run the two commands hostname and lscpu.

  3. Log out from the Pi.

1.2. Accessing the Cloud Instance

HPC largely happens on dedicated machines. Before anything productive can be done, we have to access the machines. For the time being we’ll be using a cloud instance. The cloud instance features recent hardware and has a modern software stack.

Tasks

  1. Generate a public/private SSH key pair, e.g., by using ssh-keygen -t ed25519 -C "your_email@example.com".

  2. Rename the public key using the following scheme sshkey_surname_firstname.pub. Do not share your private key with anyone! Be paranoid about it!

  3. Upload the renamed public key to FSU-Cloud. The password for the file drop was shared in the lectures.

  4. Email shima.bani@uni-jena.de stating that you successfully uploaded the key.

  5. Wait for an answer. The login details will be provided in the reply. This may take a day or two.

  6. Test your account by logging into the machine with the provided info. Remember to be a good citizen as discussed in class!

  7. Mount your home directory by using sshfs. Remember to backup your files periodically but be aware of your outbound/egress traffic (data leaving the cloud)!

1.3. Obtaining Information

In the next few weeks we’ll be using a Graviton3 instance which uses the Neoverse V1 CPU. This means that we are targeting Arm’s AArch64. We’ll address floating point workloads using Advanced SIMD (ASIMD) instructions at first and then move to the Scalable Vector Extension (SVE). All basics will be covered in the lectures, however it’s key to locate accurate information independently.

1.4. Getting a First Impression

Reading hardware documentation and tech news, watching presentations, and having a look at respective slides is crucial to get the most recent info. However, hardware access given, we can obtain information directly from the machine.

Many HPC workloads do floating point arithmetic. One metric for a core or processor is the “theoretical peak performance”. It gives the number of floating point operations the core or processor might theoretically perform per second. This number only exists on paper. Can we actually obtain it?

We’ll study the achievable floating point performance by running microbenchmarks. A microbenchmark is simple and neglects many aspects which complex software has to consider. To benchmark the sustainable floating point performance, we’ll assume a perfect world without data movements. All data is assumed to be in vector registers and ready to be processed by floating point ops. In four microbenchmark we’ll investigate the difference between single and double precision arithmetic, and the impact of SIMD ops over scalar ops. Here are the crucial parts of our microbenchmarks doing floating point math:

  • Single-precision scalar:

    fmadd s0, s30, s31, s0
    fmadd s1, s30, s31, s1
    fmadd s2, s30, s31, s2
    fmadd s3, s30, s31, s3
    
    fmadd s4, s30, s31, s4
    fmadd s5, s30, s31, s5
    fmadd s6, s30, s31, s6
    fmadd s7, s30, s31, s7
    
    fmadd s8, s30, s31, s8
    fmadd s9, s30, s31, s9
    fmadd s10, s30, s31, s10
    fmadd s11, s30, s31, s11
    
    fmadd s12, s30, s31, s12
    fmadd s13, s30, s31, s13
    fmadd s14, s30, s31, s14
    fmadd s15, s30, s31, s15
    
    fmadd s16, s30, s31, s16
    fmadd s17, s30, s31, s17
    fmadd s18, s30, s31, s18
    fmadd s19, s30, s31, s19
    
    fmadd s20, s30, s31, s20
    fmadd s21, s30, s31, s21
    fmadd s22, s30, s31, s22
    fmadd s23, s30, s31, s23
    
    fmadd s24, s30, s31, s24
    fmadd s25, s30, s31, s25
    fmadd s26, s30, s31, s26
    fmadd s27, s30, s31, s27
    
    fmadd s28, s30, s31, s28
    fmadd s29, s30, s31, s29
    
  • Double-precision scalar:

    fmadd d0, d30, d31, d0
    fmadd d1, d30, d31, d1
    fmadd d2, d30, d31, d2
    fmadd d3, d30, d31, d3
    
    fmadd d4, d30, d31, d4
    fmadd d5, d30, d31, d5
    fmadd d6, d30, d31, d6
    fmadd d7, d30, d31, d7
    
    fmadd d8, d30, d31, d8
    fmadd d9, d30, d31, d9
    fmadd d10, d30, d31, d10
    fmadd d11, d30, d31, d11
    
    fmadd d12, d30, d31, d12
    fmadd d13, d30, d31, d13
    fmadd d14, d30, d31, d14
    fmadd d15, d30, d31, d15
    
    fmadd d16, d30, d31, d16
    fmadd d17, d30, d31, d17
    fmadd d18, d30, d31, d18
    fmadd d19, d30, d31, d19
    
    fmadd d20, d30, d31, d20
    fmadd d21, d30, d31, d21
    fmadd d22, d30, d31, d22
    fmadd d23, d30, d31, d23
    
    fmadd d24, d30, d31, d24
    fmadd d25, d30, d31, d25
    fmadd d26, d30, d31, d26
    fmadd d27, d30, d31, d27
    
    fmadd d28, d30, d31, d28
    fmadd d29, d30, d31, d29
    
  • Single-precision SIMD:

    fmla v0.4s, v30.4s, v31.4s
    fmla v1.4s, v30.4s, v31.4s
    fmla v2.4s, v30.4s, v31.4s
    fmla v3.4s, v30.4s, v31.4s
    
    fmla v4.4s, v30.4s, v31.4s
    fmla v5.4s, v30.4s, v31.4s
    fmla v6.4s, v30.4s, v31.4s
    fmla v7.4s, v30.4s, v31.4s
    
    fmla v8.4s, v30.4s, v31.4s
    fmla v9.4s, v30.4s, v31.4s
    fmla v10.4s, v30.4s, v31.4s
    fmla v11.4s, v30.4s, v31.4s
    
    fmla v12.4s, v30.4s, v31.4s
    fmla v13.4s, v30.4s, v31.4s
    fmla v14.4s, v30.4s, v31.4s
    fmla v15.4s, v30.4s, v31.4s
    
    fmla v16.4s, v30.4s, v31.4s
    fmla v17.4s, v30.4s, v31.4s
    fmla v18.4s, v30.4s, v31.4s
    fmla v19.4s, v30.4s, v31.4s
    
    fmla v20.4s, v30.4s, v31.4s
    fmla v21.4s, v30.4s, v31.4s
    fmla v22.4s, v30.4s, v31.4s
    fmla v23.4s, v30.4s, v31.4s
    
    fmla v24.4s, v30.4s, v31.4s
    fmla v25.4s, v30.4s, v31.4s
    fmla v26.4s, v30.4s, v31.4s
    fmla v27.4s, v30.4s, v31.4s
    
    fmla v28.4s, v30.4s, v31.4s
    fmla v29.4s, v30.4s, v31.4s
    
  • Double-precision SIMD:

    fmla v0.2d, v30.2d, v31.2d
    fmla v1.2d, v30.2d, v31.2d
    fmla v2.2d, v30.2d, v31.2d
    fmla v3.2d, v30.2d, v31.2d
    
    fmla v4.2d, v30.2d, v31.2d
    fmla v5.2d, v30.2d, v31.2d
    fmla v6.2d, v30.2d, v31.2d
    fmla v7.2d, v30.2d, v31.2d
    
    fmla v8.2d, v30.2d, v31.2d
    fmla v9.2d, v30.2d, v31.2d
    fmla v10.2d, v30.2d, v31.2d
    fmla v11.2d, v30.2d, v31.2d
    
    fmla v12.2d, v30.2d, v31.2d
    fmla v13.2d, v30.2d, v31.2d
    fmla v14.2d, v30.2d, v31.2d
    fmla v15.2d, v30.2d, v31.2d
    
    fmla v16.2d, v30.2d, v31.2d
    fmla v17.2d, v30.2d, v31.2d
    fmla v18.2d, v30.2d, v31.2d
    fmla v19.2d, v30.2d, v31.2d
    
    fmla v20.2d, v30.2d, v31.2d
    fmla v21.2d, v30.2d, v31.2d
    fmla v22.2d, v30.2d, v31.2d
    fmla v23.2d, v30.2d, v31.2d
    
    fmla v24.2d, v30.2d, v31.2d
    fmla v25.2d, v30.2d, v31.2d
    fmla v26.2d, v30.2d, v31.2d
    fmla v27.2d, v30.2d, v31.2d
    
    fmla v28.2d, v30.2d, v31.2d
    fmla v29.2d, v30.2d, v31.2d
    

We’ll learn how to write such kernels (including data movement) soon. For now, it’s sufficient to know that each of the fmadd lines does a scalar FMA operation and each of the fmla lines a SIMD FMA operation. For example, fmla v21.4s, v30.4s, v31.4s describes the following:

  • Operate on four single-precision values in parallel (.4s);

  • Multiply the data in the SIMD and floating point source registers v30 and v31; and

  • Add the result to the destination register v21.

We have to put some boilerplate code around the inner parts of the microbenchmarks to execute them and measure performance. Good news: You are in luck! Somebody did the work for the single-precision SIMD case for you 😉. The provided example code aarch64_micro already contains a wrapping function uint64_t peak_asimd_fmla_sp( uint64_t i_n_repetitions ) in kernels/peak_asimd_fmla_sp.s. The function repeatedly executes the inner part and returns the number of floating point operations per iteration. Further, the driver in driver_asimd.cpp supports microbenchmarking multiple cores through OpenMP and reports the required time and sustained GFLOPS.

Tasks

  1. Print basic information on the machine located in /proc/cpuinfo, /proc/meminfo. Try the tools lscpu and lstopo-no-graphics.

  2. Build and test the provided example code.

  3. Add new kernels for the remaining benchmarks single-precision scalar, double precision scalar and double precision SIMD in the files kernels/peak_asimd_fmadd_sp.s, kernels/peak_asimd_fmadd_dp.s and kernels/peak_asimd_fmla_dp.s. Extend the driver accordingly.

  4. Benchmark the sustainable floating point performance of the Graviton3 instance. Perform the following studies:

    • Run our microbenchmarks on 1-4 cores. Plot the sustained floating point performance in dependency of the number of used cores.

    • Run our microbenchmarks on 1-4 threads but pin all of them to a single core. Plot the sustained floating point performance depended on the number of used threads.

    Hint

    • Use the environment variable OMP_NUM_THREADS to set the number of threads. For example, setting OMP_NUM_THREADS=2 would use two threads.

    • Use the environment variable OMP_PLACES to pin your threads to cores. For example, OMP_PLACES={0} only uses the first core whereas OMP_PLACES={0}:4:2 uses every other of the four cores.

  5. Write new kernels which replace the fused-multiply add instructions by floating-point multiplies: FMUL (vector), FMUL (scalar). What do you observe when comparing the results to the FMA kernels?