5. OpenMP

In this section, we’ll study the OpenMP API which allows us to efficiently exploit parallelism in the shared memory domain. While earlier versions of OpenMP were limited to CPU-based parallelism, new versions support offloading of compute to accelerators.

OpenMP consists of a set of compiler directives, compilers without support can simply ignore them (non-threaded execution). It is a standard and in some form supported by all major compilers, e.g., GNU, Clang, Intel

5.1. Fork-Join Model

At program startup a single thread executes the program, the so-called “master thread”. This thread opens a parallel region. Inside a parallel region multiple threads execute instruction streams concurrently. The number of concurrent threads may differ from one parallel region to another. OpenMP has settings to precisely map threads to hardware. Typically, these settings are set through environment variables.

5.2. Synchronization

Synchronization allows to orchestrate the parallel execution by imposing a certain order. This typically comes with a performance penalty. In many cases synchronization is implicit, e.g., “process x waits until receiving certain data from process y”.

Possible use cases:

Fork-join model: Wait for all threads to reach the end of a parallel section before master thread continues
Data dependencies or producer-consumer mechanism
Exclusive use of a resource

Barrier

A process or thread has to wait at the barrier until all other processes or threads reached it (sometimes within a group)

#pragma omp parallel
{
    doWork();
    #pragma omp barrier
    doMoreWork();
}

Critical Region

A critical region only allows one thread at a time to execute the respective code.

#pragma omp parallel
{
    doWork();
    #pragma omp critical
    {
        doSomethingCritical{};
    }
    doMoreWork();
}

Reductions

Assume a set of threads/processes each holding a value. A reduction operator reduces these values to a single value, e.g., by summing them up. OpenMP: Reduction clause applies a reduction to each variable in the list, then creates a private copy of the shared variable for each thread. At the end the reduction is performed and the result is written back to the shared variable.

int a[] = {1,2,3,4,5,6,7,8,9,10};
int sum = 0;
#pragma omp parallel for shared(sum, a) reduction(+: sum)
for (i = 0; i < 10; i++)
{
    sum += a[i]
}

5.3. Matrix-Multiplication

A matrix-multiplication, short matmul, is an essential primitive for many workloads e.g. neural networks. To achieve peak performance in these workloads parallelized code is crucial.

Tasks

Implement a serial Matrix-Multiplication only with one-dimensional arrays. The Matrices should be square with N=1000 and N=2000.
Measure the execution time.
Parallelize the MatMul using OpenMP.
Measure the execution time and compare the results with your serial implementation.

5.4. Monte Carlo Method

In this section we’ll use the Monte Carlo method to estimate \(\pi\). For this we randomly generate points inside the unit square \([0,1)^2\). Now, a point \((x,y)\) lies within the unit circle if \(x^2 + y^2 < 1\).

../_images/monte_carlo.svg — Fig. 5.4.1 Illustration of the Monte Carlo methods for the computation of \(\pi\). Shown is the unit square and the respective quadrant of the unit circle. Fifteen randomly sampled points (crosses) lie within the circle. Six samples (circles) lie outside of the unit circle.

This situation is illustrated in Fig. 5.4.1. If we randomly generate some points inside the unit square, we can simply derive how many of those lie inside of the unit circle. If \(N\) is the total number of generated points and \(C\) those inside of the circle, we obtain:

\[\pi \approx \frac{4C}{N}\]

In Fig. 5.4.1 s example we have \(N=21\) and \(C=15\), and thus would obtain \(\pi \approx 2.86\). This is still a pretty bad approximation, but we’ll boost accuracy by simply generating many more random samples.

Tasks

Use the std::mt19937_64 random number generator and std::uniform_real_distribution to generate a sequence of random numbers in \([0,1)\).
Parallelize the random number generation using OpenMP. Be careful to have independent instances of the random number generator on every thread.
Implement the described Monte Carlo method for the computation of \(\pi\) sequentially. Compute the error and required time-to-solution for 50,000,000 random samples.
Parallelize your implementation using OpenMP. Compare an implementation relying on critical sections for accessing the counter for \(C\) to a version using a reduction.