9. Parallelization

The chapter introduces parallelism in the shared memory domain through OpenMP. OpenMP offers “simple” pragmas to parallelize C/C++ code. For example, we could parallelize a for loop by putting the pragma #pragma omp parallel for in front of the loop. A compiler with OpenMP support will generate code that executes the loop in parallel using multiple threads. You can enable OpenMP support using the flag -fopenmp for the GNU compilers and the flag -qopenmp for the Intel compiler.

With OpenMP the order in which loop iterations are executed is no longer guaranteed. To get correct results, we have to ensure that no dependences between loop iterations exists, i.e., two loop iterations should typically not write and read the same location in memory. Chapter 5.4 (Data-Sharing Attribute Clauses) of the OpenMP Application Binary Interface explains constructs that can be used to share data between threads. For example, we may add a private clause to the pragma to declare a variable thread-local, i.e., each thread gets its own copy of the variable:

float l_a[N];
float l_b[N];
float l_result[N];
float l_sum;

// [...]
// additional code that initializes a and b
// [...]

#pragma omp for private( l_sum )
for( int l_i = 0; l_i < N; l_i++ ) {
  l_sum = l_a[l_i] + b[l_i];
  l_result[l_i] = 2 * l_sum;
}

For parallel applications we are interested in the speedup. The speedup $S_{p}$ for $p$ cores is defined as follows:

(9.1)

S_{p} = \frac{T_{1}}{T_{p}}

$T_{p}$ is the observed runtime on $p$ cores. In the ideal case $S_{p} = p$ holds, which means that the application runs $p$ -times faster on $p$ cores than on one core. However, largely due to serial parts and communication or synchronization between threads, the achieved speedup is usually smaller: $S_{p} < p$ .

Tasks

Parallelize your solver using OpenMP!
Compare the run time of your parallelized solver to the serial version on a single node of ARA! What speedup do you get for up to 72 threads when using a node of the hadoop partition? You can specify the number of OpenMP threads using the environment variable OMP_NUM_THREADS. Is it useful to spawn more threads than cores?
For the two-dimensional version of the solver, is it better to parallelize the outer or the inner loops?
Try different scheduling and pinning strategies. What performs best? Study NUMA effects and use OpenMP’s first touch policy to perform NUMA-aware initializations.