8. Optimization

This part of our projects harnesses the power of Draco, a dedicated compute cluster, to run our solver. Before using the Draco cluster, consider the following rules:

Important

  • The cluster is divided into login and compute nodes. Do not use the login nodes to run your application! Only use the login nodes for the configuration of your runs, e.g., compiling your solver. To use the compute nodes, you have to submit batch jobs or use interactive jobs.

  • You may submit multiple jobs!

8.1. Draco

Draco has several partitions. Each partition has different hardware and/or different configurations. Your URZ user and password allow you to ssh to one of the logins given on the cluster’s homepage.

We have to allocate compute resources to run our code. The command salloc --partition=short requests an interactive allocation on a compute node. In contrast to some other clusters, Draco automatically forwards your shell to the allocated compute node once the resources are granted, i.e., the ssh to the node happens in the background. The command hostname helps to confirm that your shell is now running on a compute node and not on the login node. From there, you can start your solver directly, e.g., via ./<solver> <parameters> or srun ./<solver> <parameters>. Once done with interactive testing, we release the allocation by typing exit on the compute node (which returns us to the login node) or through scancel.

Interactive shells are a fast way to test software. However, they are not meant for production simulations and might have a short time limit. We’ll submit batch jobs to start simulations running for a long time. A batch job consists of a script that describes the configuration of the job, e.g., the number of tasks or the required time, and the commands that are used to start our application. Since we do not use multiple nodes, we’ll allocate a single node by setting the ntasks parameter to 1 and the cpus-per-task parameter to 96. The maximum runtime should be set large enough so that our application can finish in time. Note, that jobs with a shorter maximum run time may be scheduled faster. The command sbatch <script> allows us to submit our job scripts. Don’t forget to specify a location for the stdout and stderr file.

Tasks

  1. Upload your code and Ch. 6’s input data to the cluster. Compile your code on the cluster.

  2. Run different scenarios using interactive and batch jobs. Verify that you can reproduce the results of Ch. 6 when running on the cluster.

  3. Is the cluster faster than your own computer? Add a timer to your solver which allows you to measure the duration of the time stepping loop. Derive the normalized metric time per cell and iteration for your local machine and the Draco cluster. Exclude file I/O overheads and the setup time from the measurements.

Hint

  • The command line tools scp or sftp allow us to copy data to and from the cluster. Instead of copying code using scp/sftp we can also clone git repositories directly.

  • You may mount a remote file system through sshfs.

  • The tool wget allows us to download data from the world wide web. You can obtain Ch. 6’s input data via: wget https://cloud.uni-jena.de/s/CqrDBqiMyKComPc/download/data_in.tar.xz -O tsunami_lab_data_in.tar.xz.

  • Available modules (installed software) may be listed with module avail. You can load a module via module load <some_module>.

  • SCons is not available by default. To get SCons load a recent Python module through module load tools/python/3.8 and use pip to install SCons: pip install --user scons.

8.2. Compilers

Different compilers optimize code differently. This chapter compares the GNU’s C++ compiler (g++) and the LLVM project’s Clang compiler (clang++). Recent versions of both compilers are available through the cluster’s module system.

Tasks

  1. Add support for generic compilers to your build script. Use the environment variable CXX for this purpose. For example, a build with the Clang compiler would then be invoked via CXX=clang++ scons.

    Hint: The surrounding environment can be accessed as os.environ in SCons. We can forward this environment to our local SCons one by setting the data behind key ENV. To use a custom C++ compiler, we additionally have to overwrite the data behind key CXX in our local SCons environment.

  2. Recompile your code using recent versions of the GNU and Clang compilers. Which compiler generates faster code? Remember not to run the solver on the login nodes!

  3. Compile your code using both compilers and try different optimization switches, e.g., -O2, -Ofast. Compare the runtime of the resulting software! Research potential implications of the optimization flags on the numerical accuracy.

  4. Compilers have additional options to generate optimization reports. For example, use Clang’s optimization-remark options such as -Rpass=.*, -Rpass-missed=.*, and -Rpass-analysis=.*. Make yourself familiar with optimization reports and add an option for them in your build script. Analyze the time-consuming parts of your code! Is the compiler able to vectorize your code? Is your f-wave solver inlined?

8.3. Instrumentation and Performance Counters

Compiler reports may give first hints to guide the optimization of your code. More-accurate reports can be obtained through manual or automatic instrumentation. To gain additional insight, these instrumentations are often combined with data obtained from performance counters of the underlying hardware.

We’ll use Intel VTune Profile to get a better understanding of our code. VTune is available on the Draco cluster as part of Intels oneAPI-tools. VTune can be made available by adding its binary directory to your PATH variable, i.e.: export PATH=/cluster/intel/oneapi/2025.0.0/vtune/2025.0/bin64:$PATH. VTune ships with a GUI, invocable via vtune-gui, and a command line interface, accessible through vtune.

Tasks

  1. Login to the cluster with enabled X-forwarding and start the VTune GUI. Create a new project for your application. Add an analysis to the project but do NOT start the analysis on the login node. Instead, allocate a node and run the analysis on this node.

    Note: Not all analysis types are usable since respective kernel drivers are not installed on the compute nodes. However, Hotspots is a good starting point.

  2. Now, run the same analysis through the command line tool in a batch job! You can copy & paste the required command from the GUI. Remember to add the VTune binary directory to your PATH variable in the job script first.

  3. Once your batch job has finished, use the GUI to visualize the results. Add debug symbols to the code, i.e., use -g. Consider to disable function inlining to get finer results: -fno-inline.

  4. Which parts are compute-intensive? Did you expect this?

  5. Think about how you could improve the performance of your code, especially the compute-intensive parts. Consider features of C++, e.g., inlining or templates, as well as algorithmic features, e.g., the reuse of calculated values, avoidance of unnecessary square roots or divisions!