.. _ch:openmp:

OpenMP
======
Until this point we covered parallelism within a single core and have seen that the SIMD pipelines of a core offer significant parallelism.
However, modern high-performance processors and accelerators may be built of dozens, hundreds, or even thousands of cores.
For example, the A64FX processor has 48 compute cores.
A `V100 <https://www.nvidia.com/en-us/data-center/v100/>`_ GPU features 5,120 CUDA cores whereas an `Instinct MI100 <https://www.amd.com/en/products/server-accelerators/instinct-mi100>`__ GPU has 7,680 cores.

In this section, we'll study the `OpenMP API <https://www.openmp.org/>`__ which allows us to efficiently exploit parallelism in the shared memory domain.
While earlier versions of OpenMP were limited to CPU-based parallelism, new versions support offloading of compute to accelerators.
We'll start simple by parallelizing our standard triad example in :numref:`ch:openmp_triad`.
Next, in :numref:`ch:openmp_monte_carlo` we'll discuss thread safety by using the Monte Carlo method for the computation of :math:`\pi`.
Then, in :numref:`ch:openmp_tasking`, we are ready for task-based parallelism which allows us to parallelize the application of a function to data stored in a binary tree.
Lastly, OpenMP's offloading support will allow us to target GPUs in :numref:`ch:openmp_offloading`.

.. _ch:openmp_triad:

Triad
-----

In this part we'll parallelize the triad of :numref:`ch:compiler` using OpenMP.
While the actual parallelization is very simple, we'll learn the details about some environment variables which allow us to change the runtime behavior of our OpenMP parallelization.

.. admonition:: Tasks

   #. Implement a version of the triad which uses OpenMP for shared memory parallelization.
   #. Benchmark your OpenMP-parallel implementation of the triad depending on the used number of values per array.
      Report at least the following: :math:`1024`, :math:`1024^2`, :math:`512 \cdot 1024^2` and :math:`2 \cdot 1024^3`.
      Explain the results!
   #. Illustrate the effects of the environment variables ``OMP_NUM_THREADS``, ``OMP_PLACES``, and ``OMP_DISPLAY_ENV`` (GNU) or ``KMP_AFFINITY`` (Intel).

.. _ch:openmp_monte_carlo:

Monte Carlo Method
------------------
In this section we'll use the Monte Carlo method to estimate :math:`\pi`.
For this we randomly generate points inside the unit square :math:`[0,1)^2`.
Now, a point :math:`(x,y)` lies within the unit circle if :math:`x^2 + y^2 < 1`.

.. _fig:openmp_monte_carlo:

.. figure:: data_openmp/monte_carlo.svg
   :align: center
   :width: 30 %

   Illustration of the Monte Carlo methods for the computation of :math:`\pi`.
   Shown is the unit square and the respective quadrant of the unit circle.
   Fifteen randomly sampled points (crosses) lie within the circle.
   Six samples (circles) lie outside of the unit circle.

This situation is illustrated in :numref:`fig:openmp_monte_carlo`.
If we randomly generate some points inside the unit square, we can simply derive how many of those lie inside of the unit circle.
If :math:`N` is the total number of generated points and :math:`C` those inside of the circle, we obtain:

.. math::

   \pi \approx \frac{4C}{N}

In :numref:`fig:openmp_monte_carlo`'s example we have :math:`N=21` and :math:`C=15`, and thus would obtain :math:`\pi \approx 2.86`.
This is still a pretty bad approximation, but we'll boost accuracy by simply generating many more random samples.

.. admonition:: Tasks

   #. Use the `std::mt19937_64 <https://cplusplus.com/reference/random/mt19937_64/>`__ random number generator and `std::uniform_real_distribution <https://en.cppreference.com/w/cpp/numeric/random/uniform_real_distribution>`_ to generate a sequence of random numbers in :math:`[0,1)`.
   #. Parallelize the random number generation using OpenMP.
      Be careful to have independent instances of the random number generator on every thread.
   #. Implement the described Monte Carlo method for the computation of :math:`\pi` sequentially.
      Compute the error and required time-to-solution for 50,000,000 random samples.
   #. Parallelize your implementation using OpenMP.
      Compare an implementation relying on critical sections for accessing the counter for :math:`C` to a version using a reduction.

.. _ch:openmp_tasking:

Tasking
-------

.. _fig:openmp_tree:

.. figure:: data_openmp/tree.svg
   :align: center
   :width: 50 %

   Illustration of a binary tree with four levels and ten nodes.
   Node 0 has two children: node 1 is the left child and node 2 the right child.
   Node 1 has only node 3 as left child.
   Nodes 6 and 7 have node 1 as parent and no children.
   Node 2 has node 4 as left child and node 5 as right child.
   Node 4's children are nodes 8 and 9.
   Nodes 5, 8, and 9 have no children. 

This section parallelizes the application of a function to a binary tree through OpenMP.
We'll implement the nodes of our binary tree through a minimal representation.
This means that every node only stores a data item and pointers to its left child and right child.
If a child doesn't exist, we simply store a ``nullptr``.

Using ``double`` as an illustrative data structure for our nodes, we may (partially) implement such a node through the C++ class ``Node``:

.. code-block:: c++

   class Node {
     private:
       //! data item of the node
       double m_data = 0.0;
       //! pointer to left child if any
       Node * m_left = nullptr;
       //! pointer to right child if any
       Node * m_right = nullptr;

     public:
       /**
        * Destructor of the node.
        * Frees all allocated memory.
        **/
       ~Node() {
         if( m_left  != nullptr ) delete m_left;
         if( m_right != nullptr ) delete m_right;
       }

       // TODO: finish implementation
   };

We'll use the class ``Tree`` as our entry point.
``Tree`` simply holds the root node and should have a function ``initRandom`` which randomly initializes the tree.
Further, the method ``applyFunction`` should apply a function to the tree's data:

.. code-block:: c++

   class Tree {
     private:
       //! root node of the tree
       Node * m_root = nullptr;

     public:
       /**
        * Destructor which frees all allocated memory.
        **/
       ~Tree() {
         if( m_root != nullptr ) delete m_root;
       };

       /**
        * Randomly initializes the tree.
        *
        * @param i_n_levels number of levels in the tree.
        * @param i_probLeft probability that a node has a left child.
        * @param i_probRight probability that a node has right child.
        * @param i_generator random number generator.
        **/
       void initRandom( uint64_t       i_nLevels,
                        double         i_probLeft,
                        double         i_probRight,
                        std::mt19937 & i_generator );

       /**
        * Applies a function to all data items of the tree's nodes.
        *
        * @param i_func function which is applied.
        *               has the current level, parent node's data and node's data as args.
        **/
       void applyFunction( double(* i_func)( uint64_t i_lvl,
                                             double   i_dataParent,
                                             double   i_dataNode ) );

       // TODO: finish implementation
   };

.. admonition:: Tasks

   #. Finish the implementation of the binary tree.
      Add functions to the classes ``Node`` and ``Tree`` as needed.
   #. Parallelize the ``applyFunction`` method of the class ``Tree`` through OpenMP's tasking.
      Note that you'll also have to insert respective OpenMP tasking pragmas to the class ``Node``.
   #. Benchmark your OpenMP parallelization.
      Include tests with 32 levels and a left and child probability of 0.85 respectively when calling ``initRandom``.

.. _ch:openmp_offloading:

Offloading
----------
.. _fig:openmp_offloading:

.. figure:: data_openmp/offloading_2.svg
   :align: center
   :width: 70 %

   Illustration of the behavior of the OpenMP construct ``#pragma omp target teams`` followed by ``#pragma omp parallel``.
   In the illustrated example a league of three teams is created where each team consists of four threads.

This section studies offloading.
The term "offloading" means that we offload our workloads or at least parts of them to accelerators, e.g., GPUs.
Often times software using accelerators goes through specific languages or language extensions.
Examples are `CUDA <https://developer.nvidia.com/cuda-zone>`__ or `HIP <https://github.com/ROCm-Developer-Tools/HIP>`__.
Recent versions of the OpenMP API support offloading workloads to accelerators directly.
When compared platform-specific options, going through OpenMP has the potential of providing better performance portability.

.. admonition:: Tasks

   #. Illustrate the behavior of offloading-related OpenMP `runtime library routines <https://www.openmp.org/spec-html/5.1/openmpch3.html>`__.
      Include at least ``omp_get_num_devices()``, ``omp_is_initial_device()``, ``omp_get_device_num()``, ``omp_is_initial_device()``, ``omp_get_initial_device()``, ``omp_get_default_device()`` and ``omp_get_num_teams()`` into your considerations.
      Do this on the MI100- and V100-accelerated nodes of the FTP.
   #. Illustrate the OpenMP constructs ``#pragma omp target teams`` and ``#pragma omp parallel`` in the context of offloading (see also :numref:`fig:openmp_offloading`).
      Manually adjust the number of teams and threads in your experiments.
   #. Offload the triad of :numref:`ch:compiler` and highlight the impact of respective OpenMP constructs and routines on the way.
      Try different approaches for the host-device data transfers.
      Try different approaches for the parallelization of the workload.