OpenMP Tutorial
Seung-Jai Min
(smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
ECE 563 Programming Parallel Machines
Parallel Programming Standards
Thread Libraries - Win32 API / Posix threads OUR FOCUS Compiler Directives - OpenMP (Shared memory programming) Message Passing Libraries - MPI (Distributed memory programming)
ECE 563 Programming Parallel Machines
Shared Memory Parallel Programming in the Multi-Core Era
Desktop and Laptop
2, 4, 8 cores and ?
A single node in distributed memory clusters
Steele cluster node: 2 8 (16) cores
Shared memory hardware Accelerators
Cell processors: 1 PPE and 8 SPEs Nvidia Quadro GPUs: 128 processing units
ECE 563 Programming Parallel Machines
OpenMP: Some syntax details to get us started
Most of the constructs in OpenMP are compiler directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]]
For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]] !$OMP construct [clause [clause]] *$OMP construct [clause [clause]]
Include files
#include omp.h
ECE 563 Programming Parallel Machines 4
How is OpenMP typically used?
OpenMP is usually used to parallelize loops:
Find your most time consuming loops. Split them up between threads.
Sequential Program void main() { int i, k, N=1000; double A[N], B[N], C[N]; for (i=0; i<N; i++) { A[i] = B[i] + k*C[i] } } Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }
5
ECE 563 Programming Parallel Machines
How is OpenMP typically used?
(Cont.)
Single Program Multiple Data (SPMD)
Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + k*C[i]; } }
ECE 563 Programming Parallel Machines
How is OpenMP typically used?
(Cont.)
Single Program Multiple Data (SPMD)
Thread 0 void main() Thread 2 void main() { Thread 3 void main() { int i, k, N=1000; void main() { int i, k, double A[N], B[N], C[N];N=1000; { int i, k, double A[N], B[N], C[N];N=1000; lb = 0; int i, k, double A[N], B[N], C[N];N=1000; lb = 250; ub = 250; double A[N], B[N], C[N]; lb = 500; ub = 500; for (i=lb;i<ub;i++) { lb = 750; ub = 750; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; ub = 1000; A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } A[i] = B[i] for (i=lb;i<ub;i++) { + k*C[i]; } } A[i] = B[i] + k*C[i]; } } } } }
ECE 563 Programming Parallel Machines 7
Thread 1
OpenMP Fork-and-Join model
printf(program begin\n); N = 1000; #pragma omp parallel for for (i=0; i<N; i++) A[i] = B[i] + C[i]; M = 500; #pragma omp parallel for for (j=0; j<M; j++) p[j] = q[j] r[j]; printf(program done\n); Serial Parallel Serial Parallel Serial
ECE 563 Programming Parallel Machines
OpenMP Constructs
1.
Parallel Regions
#pragma omp parallel
2.
Worksharing
#pragma omp for, #pragma omp sections
3.
Data Environment
#pragma omp parallel shared/private ()
4.
Synchronization
#pragma omp barrier
5.
Runtime functions/environment variables
int my_thread_id = omp_get_num_threads(); omp_set_num_threads(8);
ECE 563 Programming Parallel Machines 9
OpenMP: Structured blocks
Most OpenMP constructs apply to structured blocks.
Structured block: one point of entry at the top and one point of exit at the bottom. The only branches allowed are STOP statements in Fortran and exit() in C/C++.
ECE 563 Programming Parallel Machines
10
OpenMP: Structured blocks
A Structured Block
#pragma omp parallel { more: do_big_job(id); if(++count>1) goto more; } printf( All done \n);
Not A Structured Block
if(count==1) goto more; #pragma omp parallel { more: do_big_job(id); if(++count>1) goto done; } done: if(!really_done()) goto more;
ECE 563 Programming Parallel Machines
11
Structured Block Boundaries
In C/C++: a block is a single statement or a group of statements between brackets {}
#pragma omp parallel { id = omp_thread_num(); A[id] = big_compute(id); } #pragma omp for for (I=0;I<N;I++) { res[I] = big_calc(I); A[I] = B[I] + res[I]; }
ECE 563 Programming Parallel Machines
12
Structured Block Boundaries
In Fortran: a block is a single statement or a group of statements between directive/end-directive pairs.
C$OMP PARALLEL 10 W(id) = garbage(id) res(id) = W(id)**2 if(res(id) goto 10 C$OMP END PARALLEL C$OMP PARALLEL DO do I=1,N res(I)=bigComp(I) end do C$OMP END PARALLEL DO
ECE 563 Programming Parallel Machines
13
OpenMP Parallel Regions
double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(ID, A); } printf(all done\n);
double A[1000]; omp_set_num_threads(4)
A single copy of A is shared between all threads.
pooh(0,A)
pooh(1,A)
pooh(2,A)
pooh(3,A)
printf(all done\n);
Implicit barrier: threads wait here for all threads to finish before proceeding
14
ECE 563 Programming Parallel Machines
The OpenMP API Combined parallel work-share
OpenMP shortcut: Put the parallel and the work-share on the same line
int i; double res[MAX]; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } int i; double res[MAX]; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
the same OpenMP
ECE 563 Programming Parallel Machines 15
Shared Memory Model
private private
thread1
thread2
Shared Memory
thread5 thread4
private private private
thread3
Data can be shared or private Shared data is accessible by all threads Private data can be accessed only by the thread that owns it Data transfer is transparent to the programmer
16
ECE 563 Programming Parallel Machines
Default storage attributes
Shared Memory programming model
Variables are shared by default
Data Environment:
Distributed Memory Programming Model
All variables are private
ECE 563 Programming Parallel Machines
17
Default storage attributes
Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static
Data Environment:
But not everything is shared...
Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE.
ECE 563 Programming Parallel Machines
18
Data Environment
int A[100]; /* (Global) SHARED */ int main() { int ii, jj; /* PRIVATE */ int B[100]; /* SHARED */ #pragma omp parallel private(jj) { int kk = 1; /* PRIVATE */ #pragma omp for for (ii=0; ii<N; ii++) for (jj=0; jj<N; jj++) A[ii][jj] = foo(B[ii][jj]); } }
ECE 563 Programming Parallel Machines 19
int foo(int x) { /* PRIVATE */ int count=0; return x*count; }
Work Sharing Construct
Loop Construct
#pragma omp for [clause[[,] clause ] new-line for-loops Where clause is one of the following: private / firstprivate / lastprivate(list) reduction(operator: list) schedule(kind[, chunk_size]) collapse(n) ordered nowait
ECE 563 Programming Parallel Machines 20
Schedule
for (i=0; i<1100; i++) A[i] = ; #pragma omp parallel for schedule (static, 250) or (static)
250 250 250 250 100 or 275 275 275 275
p0
p1
p2
p3
p0
p0
p1
p2
p3
#pragma omp parallel for schedule (dynamic, 200)
200 200 200 200 200 100
p3
p0
p2
p3
p1
p0
#pragma omp parallel for schedule (guided, 100)
137 120 105 100 100 100 100 100 100 100 38
p0
p3
p0
p1 p2 p3 p0 p1 p2 p3 p0
#pragma omp parallel for schedule (auto)
ECE 563 Programming Parallel Machines 21
Critical Construct
sum = 0; #pragma omp parallel private (lsum) { lsum = 0; #pragma omp for for (i=0; i<N; i++) { lsum = lsum + A[i]; } #pragma omp critical { sum += lsum; } Threads wait their turn; } only one thread at a time
executes the critical section
ECE 563 Programming Parallel Machines 22
Reduction Clause
Shared variable sum = 0; #pragma omp parallel for reduction (+:sum) for (i=0; i<N; i++) { sum = sum + A[i]; }
ECE 563 Programming Parallel Machines
23
Performance Evaluation
How do we measure performance? (or how do we remove noise?)
#define N 24000 For (k=0; k<10; k++) { #pragma omp parallel for private(i, j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j+1])/2.0; }
ECE 563 Programming Parallel Machines 24
Performance IssuesSpeedup
What if you see a speedup saturation?
1 2 4 6 # CPUs 8
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines 25
Performance Issues
Speedup
What if you see a speedup saturation?
1 2 4 6 8 # CPUs
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=1; j<N-1; j++) a[i][j] = b[i][j];
ECE 563 Programming Parallel Machines
26
Loop Scheduling
Any guideline for a chunk size?
#define N <big-number> chunk = ???; #pragma omp parallel for schedule (static, chunk) for (i=1; i<N-1; i++) a[i][j] = ( b[i-2] + b[i-1] + b[i] b[i+1] + b[i+2] )/5.0;
27
ECE 563 Programming Parallel Machines
Performance Issues
Load imbalance: triangular access pattern
#define N 12000 #pragma omp parallel for private(j) for (i=1; i<N-1; i++) for (j=i; j<N-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0;
ECE 563 Programming Parallel Machines
28
Summary
OpenMP has advantages
Incremental parallelization Compared to MPI
No data partitioning No communication scheduling
ECE 563 Programming Parallel Machines
29
Resources
http://www.openmp.org http://openmp.org/wp/resources
ECE 563 Programming Parallel Machines
30