8. Optimization

This part of our projects harnesses the power of ARA, a dedicated compute cluster, to run our solver. Before using ARA-cluster, consider the following rules:

Important

The cluster is divided into login and compute nodes. Do not use the login nodes to run your application! Only use the login nodes for the configuration of your runs, e.g., compiling your solver. To use the compute nodes, you have to submit batch jobs or use interactive jobs.
You may submit multiple jobs!
The space available in your home directory is limited and shared with all other users. Do not store large input and output files in your home! A parallel file system is mounted under /beegfs. Use your user-local directory /beegfs/<user> for large input and output files.

8.1. ARA

ARA has several partitions. Each partition has different hardware and/or different configurations. Use the s_hadoop partition for your future work. Your URZ user and password allow you to ssh to one of the logins given on the cluster’s homepage.

We have to allocate compute resources to run our code. The command salloc --partition=s_hadoop gets us an interactive shell on a compute node. Once a node was assigned to run our apps interactively, we can either ssh to the node or start our solver via the command srun ./<solver> <parameters>. The command hostname helps to identify the node on which our shell is running. Once done with interactive testing, we cancel the job allocation through exit on the login or scancel.

Interactive shells are a fast way to test software. However, they are not meant for production simulations and might have a short time limit. We’ll submit batch jobs to start simulations running for a long time. A batch job consists of a script that describes the configuration of the job, e.g., the number of tasks or the required time, and the commands that are used to start our application. Since we do not use multiple nodes, we’ll allocate a single node by setting the ntasks parameter to 1 and the cpus-per-task parameter to 72. The maximum runtime should be set large enough so that our application can finish in time. Note, that jobs with a shorter maximum run time may be scheduled faster. The command sbatch <script> allows us to submit our job scripts. Don’t forget to specify a location for the stdout and stderr file.

Tasks

Upload your code and Ch. 6’s input data to the cluster. Compile your code on the cluster.
Run different scenarios using interactive and batch jobs. Verify that you can reproduce the results of Ch. 6 when running on the cluster.
Is the cluster faster than your own computer? Add a timer to your solver which allows you to measure the duration of the time stepping loop. Derive the normalized metric time per cell and iteration for your local machine and the ARA cluster. Exclude file I/O overheads and the setup time from the measurements.

Hint

The command line tools scp or sftp allow us to copy data to and from the cluster. Instead of copying code using scp/sftp we can also clone git repositories directly.
You may mount a remote file system through sshfs.
The tool wget allows us to download data from the world wide web. You can obtain Ch. 6’s input data via: wget https://cloud.uni-jena.de/s/CqrDBqiMyKComPc/download/data_in.tar.xz -O tsunami_lab_data_in.tar.xz.
Available modules (installed software) may be listed with module avail. You can load a module via module load <some_module>.
SCons is not available by default. To get SCons load a recent Python module through module load tools/python/3.8 and use pip to install SCons: pip install --user scons.

8.2. Compilers

Different compilers optimize code differently. This chapter compares the GNU’s C++ compiler (g++) and the Intel Compiler (icpc). Recent versions of both compilers are available through the cluster’s module system.

Tasks

Add support for generic compilers to your build script. Use the environment variable CXX for this purpose. For example, a build with the Intel compiler would then be invoked via CXX=icpc scons.

Hint: The surrounding environment can be accessed as os.environ in SCons. We can forward this environment to our local SCons one by setting the data behind key ENV. To use a custom C++ compiler, we additionally have to overwrite the data behind key CXX in our local SCons environment.
Recompile your code using recent versions of the GNU and Intel compilers. Which compiler generates faster code? Remember not to run the solver on the login nodes!
Compile your code using both compilers and try different optimization switches, e.g., -O2, -fast. Compare the runtime of the resulting software! Research potential implications of the optimization flags on the numerical accuracy.
Compilers have additional options to generate optimization reports. For example, use the Intel compiler’s option -qopt-report. Make yourself familiar with optimization reports and add an option for them in your build script. Analyze the time-consuming parts of your code! Is the compiler able to vectorize your code? Is your f-wave solver inlined?

8.3. Instrumentation and Performance Counters

Compiler reports may give first hints to guide the optimization of your code. More-accurate reports can be obtained through manual or automatic instrumentation. To gain additional insight, these instrumentations are often combined with data obtained from performance counters of the underlying hardware.

We’ll use Intel VTune Profile to get a better understanding of our code. VTune is available on the ARA-Cluster as part of the Intel-tools. You may load a recent version through the module system, e.g., the 2020 Update 2 version via module load compiler/intel/2020-Update2. VTune ships with a GUI, invocable via vtune-gui, and a command line interface, accessible through vtune.

Tasks

Login to the cluster with enabled X-forwarding and start the VTune GUI. Create a new project for your application. Add an analysis to the project but do not start the analysis on the login node. Instead, allocate a node and run the analysis on this node.

Note: Not all analysis types are usable since respective kernel drivers are not installed on the compute nodes. However, Hotspots and Threading are a good starting point.
Now, run the same analysis through the command line tool in a batch job! You can copy & paste the required command from the GUI. Remember to load the Intel tools in the job script first.
Once your batch job has finished, use the GUI to visualize the results. Add debug symbols to the code, i.e., use -g. Consider to disable function inlining to get finer results: -fno-inline.
Which parts are compute-intensive? Did you expect this?
Think about how you could improve the performance of your code, especially the compute-intensive parts. Consider features of C++, e.g., inlining or templates, as well as algorithmic features, e.g., the reuse of calculated values, avoidance of unnecessary square roots or divisions!
(optional) Instrument your code manually using Score-P. Use Cube for the visualization of your measurements. Use Score-P’s PAPI-interface to access hardware counters.