13. OpenMP

Until this point we covered parallelism within a single core and have seen that the SIMD pipelines of a core offer significant parallelism. However, modern high-performance processors and accelerators may be built of dozens, hundreds, or even thousands of cores. For example, the A64FX processor has 48 compute cores. A V100 GPU features 5,120 CUDA cores whereas an Instinct MI100 GPU has 7,680 cores.

In this section, we’ll study the OpenMP API which allows us to efficiently exploit parallelism in the shared memory domain. While earlier versions of OpenMP were limited to CPU-based parallelism, new versions support offloading of compute to accelerators. We’ll start simple by parallelizing our standard triad example in Section 13.1. Next, in Section 13.2 we’ll discuss thread safety by using the Monte Carlo method for the computation of \(\pi\). Then, in Section 13.3, we are ready for task-based parallelism which allows us to parallelize the application of a function to data stored in a binary tree. Lastly, OpenMP’s offloading support will allow us to target GPUs in Section 13.4.

13.1. Triad

In this part we’ll parallelize the triad of Section 11 using OpenMP. While the actual parallelization is very simple, we’ll learn the details about some environment variables which allow us to change the runtime behavior of our OpenMP parallelization.

Tasks

  1. Implement a version of the triad which uses OpenMP for shared memory parallelization.

  2. Benchmark your OpenMP-parallel implementation of the triad depending on the used number of values per array. Report at least the following: \(1024\), \(1024^2\), \(512 \cdot 1024^2\) and \(2 \cdot 1024^3\). Explain the results!

  3. Illustrate the effects of the environment variables OMP_NUM_THREADS, OMP_PLACES, and OMP_DISPLAY_ENV (GNU) or KMP_AFFINITY (Intel).

13.2. Monte Carlo Method

In this section we’ll use the Monte Carlo method to estimate \(\pi\). For this we randomly generate points inside the unit square \([0,1)^2\). Now, a point \((x,y)\) lies within the unit circle if \(x^2 + y^2 < 1\).

../_images/monte_carlo.svg

Fig. 13.2.1 Illustration of the Monte Carlo methods for the computation of \(\pi\). Shown is the unit square and the respective quadrant of the unit circle. Fifteen randomly sampled points (crosses) lie within the circle. Six samples (circles) lie outside of the unit circle.

This situation is illustrated in Fig. 13.2.1. If we randomly generate some points inside the unit square, we can simply derive how many of those lie inside of the unit circle. If \(N\) is the total number of generated points and \(C\) those inside of the circle, we obtain:

\[\pi \approx \frac{4C}{N}\]

In Fig. 13.2.1’s example we have \(N=21\) and \(C=15\), and thus would obtain \(\pi \approx 2.86\). This is still a pretty bad approximation, but we’ll boost accuracy by simply generating many more random samples.

Tasks

  1. Use the std::mt19937_64 random number generator and std::uniform_real_distribution to generate a sequence of random numbers in \([0,1)\).

  2. Parallelize the random number generation using OpenMP. Be careful to have independent instances of the random number generator on every thread.

  3. Implement the described Monte Carlo method for the computation of \(\pi\) sequentially. Compute the error and required time-to-solution for 50,000,000 random samples.

  4. Parallelize your implementation using OpenMP. Compare an implementation relying on critical sections for accessing the counter for \(C\) to a version using a reduction.

13.3. Tasking

../_images/tree.svg

Fig. 13.3.1 Illustration of a binary tree with four levels and ten nodes. Node 0 has two children: node 1 is the left child and node 2 the right child. Node 1 has only node 3 as left child. Nodes 6 and 7 have node 1 as parent and no children. Node 2 has node 4 as left child and node 5 as right child. Node 4’s children are nodes 8 and 9. Nodes 5, 8, and 9 have no children.

This section parallelizes the application of a function to a binary tree through OpenMP. We’ll implement the nodes of our binary tree through a minimal representation. This means that every node only stores a data item and pointers to its left child and right child. If a child doesn’t exist, we simply store a nullptr.

Using double as an illustrative data structure for our nodes, we may (partially) implement such a node through the C++ class Node:

class Node {
  private:
    //! data item of the node
    double m_data = 0.0;
    //! pointer to left child if any
    Node * m_left = nullptr;
    //! pointer to right child if any
    Node * m_right = nullptr;

  public:
    /**
     * Destructor of the node.
     * Frees all allocated memory.
     **/
    ~Node() {
      if( m_left  != nullptr ) delete m_left;
      if( m_right != nullptr ) delete m_right;
    }

    // TODO: finish implementation
};

We’ll use the class Tree as our entry point. Tree simply holds the root node and should have a function initRandom which randomly initializes the tree. Further, the method applyFunction should apply a function to the tree’s data:

class Tree {
  private:
    //! root node of the tree
    Node * m_root = nullptr;

  public:
    /**
     * Destructor which frees all allocated memory.
     **/
    ~Tree() {
      if( m_root != nullptr ) delete m_root;
    };

    /**
     * Randomly initializes the tree.
     *
     * @param i_n_levels number of levels in the tree.
     * @param i_probLeft probability that a node has a left child.
     * @param i_probRight probability that a node has right child.
     * @param i_generator random number generator.
     **/
    void initRandom( uint64_t       i_nLevels,
                     double         i_probLeft,
                     double         i_probRight,
                     std::mt19937 & i_generator );

    /**
     * Applies a function to all data items of the tree's nodes.
     *
     * @param i_func function which is applied.
     *               has the current level, parent node's data and node's data as args.
     **/
    void applyFunction( double(* i_func)( uint64_t i_lvl,
                                          double   i_dataParent,
                                          double   i_dataNode ) );

    // TODO: finish implementation
};

Tasks

  1. Finish the implementation of the binary tree. Add functions to the classes Node and Tree as needed.

  2. Parallelize the applyFunction method of the class Tree through OpenMP’s tasking. Note that you’ll also have to insert respective OpenMP tasking pragmas to the class Node.

  3. Benchmark your OpenMP parallelization. Include tests with 32 levels and a left and child probability of 0.85 respectively when calling initRandom.

13.4. Offloading

../_images/offloading_2.svg

Fig. 13.4.1 Illustration of the behavior of the OpenMP construct #pragma omp target teams followed by #pragma omp parallel. In the illustrated example a league of three teams is created where each team consists of four threads.

This section studies offloading. The term “offloading” means that we offload our workloads or at least parts of them to accelerators, e.g., GPUs. Often times software using accelerators goes through specific languages or language extensions. Examples are CUDA or HIP. Recent versions of the OpenMP API support offloading workloads to accelerators directly. When compared platform-specific options, going through OpenMP has the potential of providing better performance portability.

Tasks

  1. Illustrate the behavior of offloading-related OpenMP runtime library routines. Include at least omp_get_num_devices(), omp_is_initial_device(), omp_get_device_num(), omp_is_initial_device(), omp_get_initial_device(), omp_get_default_device() and omp_get_num_teams() into your considerations. Do this on the MI100- and V100-accelerated nodes of the FTP.

  2. Illustrate the OpenMP constructs #pragma omp target teams and #pragma omp parallel in the context of offloading (see also Fig. 13.4.1). Manually adjust the number of teams and threads in your experiments.

  3. Offload the triad of Section 11 and highlight the impact of respective OpenMP constructs and routines on the way. Try different approaches for the host-device data transfers. Try different approaches for the parallelization of the workload.