9. Parallelization
The chapter introduces parallelism in the shared memory domain through OpenMP.
OpenMP offers “simple” pragmas to parallelize C/C++ code.
For example, we could parallelize a for loop by putting the pragma #pragma omp parallel for
in front of the loop.
A compiler with OpenMP support will generate code that executes the loop in parallel using multiple threads.
You can enable OpenMP support using the flag -fopenmp
for the GNU compilers and the flag -qopenmp
for the Intel compiler.
With OpenMP the order in which loop iterations are executed is no longer guaranteed.
To get correct results, we have to ensure that no dependences between loop iterations exists, i.e., two loop iterations should typically not write and read the same location in memory.
Chapter 5.4 (Data-Sharing Attribute Clauses) of the OpenMP Application Binary Interface explains constructs that can be used to share data between threads.
For example, we may add a private
clause to the pragma to declare a variable thread-local, i.e., each thread gets its own copy of the variable:
float l_a[N];
float l_b[N];
float l_result[N];
float l_sum;
// [...]
// additional code that initializes a and b
// [...]
#pragma omp for private( l_sum )
for( int l_i = 0; l_i < N; l_i++ ) {
l_sum = l_a[l_i] + b[l_i];
l_result[l_i] = 2 * l_sum;
}
For parallel applications we are interested in the speedup.
The speedup
Tasks
Parallelize your solver using OpenMP!
Compare the run time of your parallelized solver to the serial version on a single node of ARA! What speedup do you get for up to 72 threads when using a node of the hadoop partition? You can specify the number of OpenMP threads using the environment variable
OMP_NUM_THREADS
. Is it useful to spawn more threads than cores?For the two-dimensional version of the solver, is it better to parallelize the outer or the inner loops?
Try different scheduling and pinning strategies. What performs best? Study NUMA effects and use OpenMP’s first touch policy to perform NUMA-aware initializations.