Programming shared memory systems:
OpenMP
Various methods to program shared memory systems
Introduction to OpenMP
Programming shared memory systems
Create multiple threads to use multiple cores
o Use a specialized programming language designed for shared memory multi-
thread programming. E.g Cilk plus. These are not mainstream yet.
o Use library support for shared memory multi-threading. E.g. Python
multiprocessing shared memory module
o Use parallelizing compiler
o Use compiler directives that supplement a sequential programming language.
E.g. OpenMP (with C/C++, fortran)
o Use processes
o Use threads
Programming shared memory systems
Although there are different ways for shared memory programming,
issues in programming shared memory systems are similar. Any
shared memory programming (or parallel programming) needs to
address two issues.
o How to create multiple execution threads, mostly the fork-join model.
o How to share data and coordinate – e.g. avoiding race condition
We will review processes and threads, but focus on OpenMP
o OpenMP is widely supported: Gcc supports OpenMP
o OpenMP is relatively easy to understand and use
Use processes
A topic in the UG OS class.
Heavyweight
o Process creation, termination, and coordination are expensive.
UNIX system calls: fork(), exit(), wait().
o Inter-process communication is somewhat awkward because process is designed
for isolation, not sharing.
Explicit shared memory operations: shmget(), shmat(), and shmctl()
Due to its overhead, although processes can be used for shared memory
programming, they are rarely used in practice.
Use threads
A topics in the UG OS class
A thread is a lightweight processes: creation, termination, communication
are all much cheaper than those for process.
Pthreads: POSIX-compliant threads
o Thread creation, termination, and coordination: pthread_create(), pthread_exit(),
and pthread_join().
o Communication is through shared memory. Pthread routines to coordinating memory
access: pthread_mutex_lock(), pthread_cond_wait(), pthread_cond_signal() 。
Very easy to write wrong programs. Some legacy system routines are still not “thread-safe.”
o Programmer needs to explicitly manage the threads.
OpenMP
What does OpenMP stands for?
o Open specifications for Multi Processing via collaborative work between
interested parties from the hardware and software industry, government
and academia.
OpenMP is an Application Program Interface (API) that may be used
to explicitly direct multi-threaded, shared memory parallelism.
o API components: Compiler Directives, Runtime Library Routines.
Environment Variables
OpenMP is a directive-based method to invoke parallel computations on share-
memory multiprocessors
OpenMP
OpenMP API is specified for C/C++ and Fortran.
OpenMP is not intrusive to the original serial code: instructions
appear in comment statements for fortran and pragmas for C/C++.
OpenMP website: http://www.openmp.org
o Materials in this lecture are taken from various OpenMP tutorials in the
website and other places.
Why OpenMP
OpenMP is portable: It is jointly defined and endorsed by a group of
major hardware and software vendors.
o Hoping for an ANSI standard
OpenMP can be implemented incrementally, one function or even
one loop at a time.
o A nice way to get a parallel program from a sequential program.
How to compile and run OpenMP programs?
To compile: use gcc with flag -fopenmp
o Example: gcc –fopenmp a.c
o Try lect11/helloworld.cpp, lect11/helloworld1.cpp – see how easy it is to write wrong
shared memory program.
To run: ‘a.out’
o To change the number of threads (the default is the number of cores on the machine):
setenv OMP_NUM_THREADS 4 (tcsh) or export OMP_NUM_THREADS=4(bash)
Put it togather, on linprog:
gcc –fopenmp helloworld1.cpp
Export OMP_NUM_THREADS = 4
./a.out -- run the program with 4 threads
OpenMP basic syntax
Most of the constructs in OpenMP are compiler directives
#pragma omp construct [clause [clause] …]
o Example
#pragma omp parallel num_threads(8)
OpenMP constructs usually apply to a structured block (one
statement).
Function prototypes are in
#include <omp.h>
OpenMP “Hello world” program
#include <omp.h>
#include <iostream> OpenMP include file
#include <sstream>
using namespace std;
int main()
{
stringstream ss[100];
int nprocs;
OpenMP directive, parallel region with
the default number of threads
#pragma omp parallel
{
int total_threads = omp_get_num_threads();
int myid = omp_get_thread_num();
if (myid == 0) nprocs = total_threads; OpenMP library routines that return the number
ss[myid] << "Hello world, I am No." << myid << " out of "
<< total_threads << " Members\n"; Of threads and the thread ID respectively.
}
for (int i=0; i<nprocs; i++) {
string a; while (ss[i] >> a) cout << a << " "; cout << "\n";
}
return 0;
}
OpenMP execution model
OpenMP uses the fork-join model of parallel execution.
o All OpenMP programs begin with a single master thread.
o The master thread executes sequentially until a parallel region is encountered, when it
creates a team of parallel threads (FORK).
o When the team threads complete the parallel region, they synchronize and terminate,
leaving only the master thread that executes sequentially (JOIN).
OpenMP general code structure
#include <omp.h>
main () {
int var1, var2, var3;
Serial code
...
/* Beginning of parallel section. Fork a team of threads. Specify variable scoping*/
#pragma omp parallel private(var1, var2) shared(var3)
{
/* Parallel section executed by all threads */
...
/* All threads join master thread and disband*/
}
Resume serial code
...
}
An example lect11/simple_multithread.cpp
If variables are not specified as public or private. All variables outside the parallel
region are public variables (shared by all threads)
What about variables declared inside the parallel block?
What if the ‘nprocs = total_threads;’ statement in the program is moved before the if
statement?
Easy for OpenMP to create threads
o Deterministic multi-threaded program: no race condition – no two threads can access one
memory location with one writing to it!
o Non-deterministic multi-threaded programs are mostly incorrect programs.
o Need to be mindful when making system/library routine calls.
Data model
• Private and shared variables
• Variables in the global data space are accessed
by all parallel threads (shared variables).
• Variables in a thread’s private space can only be
accessed by the thread (private variables)
• several variations, depending on the initial values and
whether the results are copied outside the region.
• OpenMP allows for specification of private and
shared variable before entering each parallel region.
FOR ( I = 0; I < ARRAYSIZE; I++ ) {
FOR ( PRIVINDX = 0; PRIVINDX < 16; PRIVINDX++ ) {
PRIVDBL = ( (DOUBLE) PRIVINDX ) / 16;
Y[I] = SIN( EXP( COS( - EXP( SIN(X[I]) ) ) ) ) + COS( PRIVDBL );
Can this loop run in parallel? On what condition?
If one has a choice, one should parallelize inner or outer loop?
#PRAGMA OMP PARALLEL FOR PRIVATE( PRIVINDX, PRIVDBL )
FOR ( I = 0; I < ARRAYSIZE; I++ ) {
FOR ( PRIVINDX = 0; PRIVINDX < 16; PRIVINDX++ ) {
PRIVDBL = ( (DOUBLE) PRIVINDX ) / 16;
Y[I] = SIN( EXP( COS( - EXP( SIN(X[I]) ) ) ) ) + COS( PRIVDBL );
Parallel for loop index variable is
private by default.
OpenMP directives
Format:
#progma omp directive-name [clause,..] newline
(use ‘\’ for multiple lines)
Example:
#pragma omp parallel default(shared) private(beta,pi)
Scope of a directive is one block of statements { …}
More detail in the specification at http://www.openmp.org/ (OpenMP
5.0 at https://www.openmp.org/spec-html/5.0/openmp.html)
Parallel region construct
A block of code that will be executed by multiple threads.
#pragma omp parallel [clause …]
{
……
} (implied barrier)
Clauses: if ([parallel :]expression), num_thread(integer-expression), private (list),
shared (list), default (shared | none), reduction ([reduction-modifier,] reduction-
modifier: list), firstprivate(list), copyin(list), proc_bind(master | close | spread),
allocate([allocator:} list)
o private(list): everything private and local (no relation with variables outside the
block).
o shared(list): data accessed by all threads
o default (none| shared)
The reduction clause (see
lect11/reduction.cpp)
Sum = 0.0;
#pragma parallel for default(none) shared (n, x) private (I) reduction(+ : sum)
{
For(I=0; I<n; I++) sum = sum + x(I);
}
Updating sum must avoid racing condition
With the reduction clause, OpenMP generates code such that the
race condition is avoided.
o See also lect11/reduction1.cpp and lect11/reduction2.cpp.
Clauses
Firstprivate(list): variables are private, but are initialized with the value
before entering the block.
copyin(list): The copyin clause provides a mechanism to copy the value of
a threadprivate variable of the master thread to the threadprivate variable
of each other member of the team that is executing the parallel region.
proc_bind(master | close | spread): thread affinity related,
to talk more in the next class (OpenMP for NUMA
architectures)
allocate([allocator :] list): memory allocation related, to talk
more in the next class. (OpenMP for NUMA architectures)
Work-sharing constructs
#pragma omp for [clause …]
#pragma omp sections [clause …]
#pragma omp single [clause …]
These must be enclosed in parallel region
No implied barrier on entry, implied barrier on exit (unless specified otherwise)
“omp for”: loop iterations are distributed over the number of threads.
“omp sections”: each section is assigned to one thread
“omp single”: only one thread runs the block
The omp for directive: example
omp for shares loop iteration among different threads. It can have a schedule clause
schedule (static | dynamic | guided [, chunk])
The omp session clause – example (see also
lect11/simple_multithread1.cpp)
Omp single construct
The single construct specifies that the associated structured block is
executed by only one of the threads in the team (not necessarily the
master thread), in the context of its implicit task. The other threads
in the team, which do not execute the block, wait at an implicit
barrier at the end of the single construct unless a nowait clause is
specified.
See lect11/omp_single.cpp
Some merged constructs
Synchronization: barrier
For(I=0; I<N; I++) Both loops are in parallel region
a[I] = b[I] + c[I];
With no synchronization in between.
For(I=0; I<N; I++) What is the problem?
d[I] = a[I] + b[I]
The second loop should not start
For(I=0; I<N; I++)
a[I] = b[I] + c[I]; before all of the first loop finishes.
#pragma omp barrier
For(I=0; I<N; I++)
d[I] = a[I] + b[I]
Critical session
For(I=0; I<N; I++) {
……
Cannot be parallelized if sum is shared.
sum += A[I];
See lect11/reduction1.cpp.
……
What if we comment out ‘#pragma omp critical’ in
}
the program?
For(I=0; I<N; I++) {
……
#pragma omp critical
{
sum += A[I];
}
……
}
OpenMP environment variables
OMP_NUM_THREADS
OMP_SCHEDULE
OpenMP runtime environment
omp_get_num_threads()
omp_get_thread_num()
omp_in_parallel()
Routines related to locks
……
OpenMP examples
lect11/piomp.c
Sequential Matrix Multiply
For (I=0; I<n; I++)
for (j=0; j<n; j++)
c[I][j] = 0;
for (k=0; k<n; k++)
c[I][j] = c[I][j] + a[I][k] * b[k][j];
OpenMP Matrix Multiply (lect11/mm_omp.c)
#pragma omp parallel for private(j, k)
For (I=0; I<n; I++)
for (j=0; j<n; j++)
c[I][j] = 0;
for (k=0; k<n; k++)
c[I][j] = c[I][j] + a[I][k] * b[k][j];
OpenMP Summary
o OpenMP provides a compact, yet powerful programming model for shared memory
programming
It is very easy to use OpenMP to create parallel programs.
o OpenMP preserves the sequential version of the program
o Developing an OpenMP program:
Start from a sequential program
Identify the code segment that takes most of the time.
Determine whether the important loops can be parallelized
The loops may have critical sections, reduction variables, etc
Determine the shared and private variables.
Add directives