13. OpenMP
Until this point we covered parallelism within a single core and have seen that the SIMD pipelines of a core offer significant parallelism. However, modern high-performance processors and accelerators may be built of dozens, hundreds, or even thousands of cores. For example, the A64FX processor has 48 compute cores. A V100 GPU features 5,120 CUDA cores whereas an Instinct MI100 GPU has 7,680 cores.
In this section, we’ll study the OpenMP API which allows us to efficiently exploit parallelism in the shared memory domain. While earlier versions of OpenMP were limited to CPU-based parallelism, new versions support offloading of compute to accelerators. We’ll start simple by parallelizing our standard triad example in Section 13.1. Next, in Section 13.2 we’ll discuss thread safety by using the Monte Carlo method for the computation of \(\pi\). Then, in Section 13.3, we are ready for task-based parallelism which allows us to parallelize the application of a function to data stored in a binary tree. Lastly, OpenMP’s offloading support will allow us to target GPUs in Section 13.4.
13.1. Triad
In this part we’ll parallelize the triad of Section 11 using OpenMP. While the actual parallelization is very simple, we’ll learn the details about some environment variables which allow us to change the runtime behavior of our OpenMP parallelization.
Tasks
Implement a version of the triad which uses OpenMP for shared memory parallelization.
Benchmark your OpenMP-parallel implementation of the triad depending on the used number of values per array. Report at least the following: \(1024\), \(1024^2\), \(512 \cdot 1024^2\) and \(2 \cdot 1024^3\). Explain the results!
Illustrate the effects of the environment variables
OMP_NUM_THREADS
,OMP_PLACES
, andOMP_DISPLAY_ENV
(GNU) orKMP_AFFINITY
(Intel).
13.2. Monte Carlo Method
In this section we’ll use the Monte Carlo method to estimate \(\pi\). For this we randomly generate points inside the unit square \([0,1)^2\). Now, a point \((x,y)\) lies within the unit circle if \(x^2 + y^2 < 1\).
This situation is illustrated in Fig. 13.2.1. If we randomly generate some points inside the unit square, we can simply derive how many of those lie inside of the unit circle. If \(N\) is the total number of generated points and \(C\) those inside of the circle, we obtain:
In Fig. 13.2.1’s example we have \(N=21\) and \(C=15\), and thus would obtain \(\pi \approx 2.86\). This is still a pretty bad approximation, but we’ll boost accuracy by simply generating many more random samples.
Tasks
Use the std::mt19937_64 random number generator and std::uniform_real_distribution to generate a sequence of random numbers in \([0,1)\).
Parallelize the random number generation using OpenMP. Be careful to have independent instances of the random number generator on every thread.
Implement the described Monte Carlo method for the computation of \(\pi\) sequentially. Compute the error and required time-to-solution for 50,000,000 random samples.
Parallelize your implementation using OpenMP. Compare an implementation relying on critical sections for accessing the counter for \(C\) to a version using a reduction.
13.3. Tasking
This section parallelizes the application of a function to a binary tree through OpenMP.
We’ll implement the nodes of our binary tree through a minimal representation.
This means that every node only stores a data item and pointers to its left child and right child.
If a child doesn’t exist, we simply store a nullptr
.
Using double
as an illustrative data structure for our nodes, we may (partially) implement such a node through the C++ class Node
:
class Node {
private:
//! data item of the node
double m_data = 0.0;
//! pointer to left child if any
Node * m_left = nullptr;
//! pointer to right child if any
Node * m_right = nullptr;
public:
/**
* Destructor of the node.
* Frees all allocated memory.
**/
~Node() {
if( m_left != nullptr ) delete m_left;
if( m_right != nullptr ) delete m_right;
}
// TODO: finish implementation
};
We’ll use the class Tree
as our entry point.
Tree
simply holds the root node and should have a function initRandom
which randomly initializes the tree.
Further, the method applyFunction
should apply a function to the tree’s data:
class Tree {
private:
//! root node of the tree
Node * m_root = nullptr;
public:
/**
* Destructor which frees all allocated memory.
**/
~Tree() {
if( m_root != nullptr ) delete m_root;
};
/**
* Randomly initializes the tree.
*
* @param i_n_levels number of levels in the tree.
* @param i_probLeft probability that a node has a left child.
* @param i_probRight probability that a node has right child.
* @param i_generator random number generator.
**/
void initRandom( uint64_t i_nLevels,
double i_probLeft,
double i_probRight,
std::mt19937 & i_generator );
/**
* Applies a function to all data items of the tree's nodes.
*
* @param i_func function which is applied.
* has the current level, parent node's data and node's data as args.
**/
void applyFunction( double(* i_func)( uint64_t i_lvl,
double i_dataParent,
double i_dataNode ) );
// TODO: finish implementation
};
Tasks
Finish the implementation of the binary tree. Add functions to the classes
Node
andTree
as needed.Parallelize the
applyFunction
method of the classTree
through OpenMP’s tasking. Note that you’ll also have to insert respective OpenMP tasking pragmas to the classNode
.Benchmark your OpenMP parallelization. Include tests with 32 levels and a left and child probability of 0.85 respectively when calling
initRandom
.
13.4. Offloading
This section studies offloading. The term “offloading” means that we offload our workloads or at least parts of them to accelerators, e.g., GPUs. Often times software using accelerators goes through specific languages or language extensions. Examples are CUDA or HIP. Recent versions of the OpenMP API support offloading workloads to accelerators directly. When compared platform-specific options, going through OpenMP has the potential of providing better performance portability.
Tasks
Illustrate the behavior of offloading-related OpenMP runtime library routines. Include at least
omp_get_num_devices()
,omp_is_initial_device()
,omp_get_device_num()
,omp_is_initial_device()
,omp_get_initial_device()
,omp_get_default_device()
andomp_get_num_teams()
into your considerations. Do this on the MI100- and V100-accelerated nodes of the FTP.Illustrate the OpenMP constructs
#pragma omp target teams
and#pragma omp parallel
in the context of offloading (see also Fig. 13.4.1). Manually adjust the number of teams and threads in your experiments.Offload the triad of Section 11 and highlight the impact of respective OpenMP constructs and routines on the way. Try different approaches for the host-device data transfers. Try different approaches for the parallelization of the workload.