Lecture on Parallelization
I. Basic Parallelization
II. Data dependence analysis
III. Interprocedural parallelization
Chapter 11.1-11.1.4
Carnegie Mellon
M. Lam & S. Liao CS243: Parallelization 1
Parallelization of Numerical Applications
• DoAll loop parallelism
– Find loops whose iterations are independent
– Number of iterations typically scales with the problem
– Usually much larger than the number of processors in a machine
– Divide up iterations across machines
Carnegie Mellon
CS243: Parallelization 2 M. Lam & S. Liao
Basic Parallelism
Examples:
FOR i = 1 to 100
A[i] = B[i] + C[i]
FOR i = 11 TO 20
a[i] = a[i-1] + 3
FOR i = 11 TO 20
a[i] = a[i-10] + 3
• Does there exist a data dependence edge between two different
iterations?
• A data dependence edge is loop-carried if
it crosses iteration boundaries
• DoAll loops: loops without loop-carried dependences
Carnegie Mellon
CS243: Parallelization 3 M. Lam & S. Liao
Recall: Data Dependences
• True dependence:
a =
= a
• Anti-dependence:
= a
a =
• Output dependence
a =
a =
Carnegie Mellon
CS243: Parallelization 4 M. Lam & S. Liao
Affine Array Accesses
• Common patterns of data accesses: (i, j, k are loop indexes)
A[i], A[j], A[i-1], A[0], A[i+j], A[2*i], A[2*i+1]
A[i,j], A[i-1, j+1]
• Array indexes are affine expressions of surrounding loop indexes
– Loop indexes: in, in-1, ... , i1
– Integer constants: cn, cn-1, ... , c0
– Array index: cnin + cn-1in-1+ ... + c1i1+ c0
– Affine expression: linear expression + a constant term (c0)
Carnegie Mellon
CS243: Parallelization 5 M. Lam & S. Liao
II. Formulating Data Dependence Analysis
FOR i := 2 to 5 do
A[i-2] = A[i]+1;
• Between read access A[i] and write access A[i-2]there is a
dependence if:
– there exist two iterations ir and iw within the loop bounds, s.t.
– iterations ir & iw read & write the same array element, respectively
∃integers i w, i r 2 ≤ i w, i r ≤ 5 i r = iw - 2
• Between write access A[i-2] and write access A[i-2] there is a
dependence if:
∃integers i w, i v 2 ≤ i w, i v ≤ 5 i w – 2 = iv - 2
– To rule out the case when the same instance depends on itself:
• add constraint iw ≠ iv
Carnegie Mellon
CS243: Parallelization 6 M. Lam & S. Liao
Memory Disambiguation
is
Undecidable at Compile Time
read(n)
For i =
a[i] = a[n]
Carnegie Mellon
CS243: Parallelization 7 M. Lam & S. Liao
Domain of Data Dependence Analysis
• Only use loop bounds and array indexes that are affine functions of loop
variables
for i = 1 to n
for j = 2i to 100
a[i+2j+3][4i+2j][i*i] = …
… = a[1][2i+1][j]
• Assume a data dependence between the read & write operation if there
exists:
– a read instance with indexes ir, jr and
– a write instance with indexes iw, jw
∃integers ir, jr, iw, jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 100
2ir ≤ jr ≤ 100
iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1
• Equate each dimension of array access; ignore non-affine ones
– No solution → No data dependence
– Solution → there may be a dependence
Carnegie Mellon
CS243: Parallelization 8 M. Lam & S. Liao
Complexity of Data Dependence Analysis
For every pair of accesses not necessarily distinct (F1 , f1 ) and (F2, f2)
one must be a write operation
Let B1i1+b1 ≥ 0, B2i2+b2 ≥ 0 be the corresponding loop bound
constraints,
∃ integers i1, i2 B1i1 + b1 ≥ 0, B2i2 + b2 ≥ 0
F1 i1+ f1 = F2 i2+f2
• If the accesses
Equivalent are not
to integer distinct,
linear then add the constraint i 1 ≠ i2
programming
∃ integer i A 1i ≤ b 1 A 2i = b 2
• Integer linear programming is NP-complete
– O(size of the coefficients) or O(nn)
Carnegie Mellon
CS243: Parallelization 9 M. Lam & S. Liao
Data Dependence Analysis Algorithm
• Typically solving many tiny, repeated problems
– Integer linear programming packages optimize for large problems
– Use memoization to remember the results of simple tests
• Apply a series of relatively simple tests
– GCD: 2*i, 2*i+1; GCD for simultaneous equations
– Test if the ranges overlap
• Backed up by a more expensive algorithm
– Use Fourier-Motzkin Elimination to test if there is a real solution
• Keep eliminating variables to see if a solution remains
• If there is no solution, then there is no integer solution
Carnegie Mellon
10 M. Lam & S. Liao
Fourier-Motzkin Elimination
• To eliminate a variable from a set of linear inequalities.
• To eliminate a variable x1
– Rewrite all expressions in terms of lower or upper bounds of x1
– Create a transitive constraint for each pair of lower and upper bounds.
• Example: Let L, U be lower bounds and upper bounds resp
– To eliminate x1:
L1(x2, …, xn) ≤ U1 (x2, …, xn)
L1(x2, …, xn) ≤ x1 ≤ U1 (x2, …, xn) L1(x2, …, xn) ≤ U2 (x2, …, xn)
L2(x2, …, xn) ≤ x1 ≤ U2 (x2, …, xn) L2(x2, …, xn) ≤ U1 (x2, …, xn)
L2(x2, …, xn) ≤ U2 (x2, …, xn)
Carnegie Mellon
11 M. Lam & S. Liao
Example
FOR i = 1 to 5
FOR j = i+1 to 5
A[i,j] = f(A[i,i], A[i-1,j])
1≤i 1 ≤ i’
i≤5 i’ ≤ 5
i+1 ≤ j i’ + 1 ≤ j’
j≤5 j’ ≤ 5
Carnegie Mellon
CS243: Parallelization 12 M. Lam & S. Liao
Data Dependence Analysis Algorithm
• Typically solving many tiny, repeated problems
– Integer linear programming packages optimize for large problems
– Use memoization to remember the results of simple tests
• Apply a series of relatively simple tests
– GCD: 2*i, 2*i+1; GCD for simultaneous equations
– Test if the ranges overlap
• Backed up by a more expensive algorithm
– Use Fourier-Motzkin Elimination to test if there is a real solution
• Keep eliminating variables to see if a solution remains
• Add heuristics to encourage finding an integer solution.
– Create 2 subproblems if a real, but not integer, solution is found.
• For example, if x = .5 is a solution,
create two problems,
by adding x ≤ 0 and x ≥ 1 respectively to original constraint.
Carnegie Mellon
13 M. Lam & S. Liao
Relaxing Dependences
Privatization:
• Scalar
for i = 1 to n
t = (A[i] + B[i]) / 2;
C[i] = t * t;
• Array
for i = 1 to n
for j = 1 to n
t[j] = (A[i,j] + B[i,j]) / 2;
for j = 1 to n
C[i,j] = t[j] * t[j];
Reduction:
for i = 1 to n
sum = sum + A[i];
Carnegie Mellon
CS243: Parallelization 14 M. Lam & S. Liao
Carnegie Mellon
CS243: Parallelization 15 M. Lam & S. Liao
Interprocedural Parallelization
• Why? Amdahl’s Law
• Interprocedural symbolic analysis
– Find interprocedural array indexes
which are affine expressions of outer loop indices
• Interprocedural parallelization analysis
– Data dependence based on summaries of array regions accessed
• If the regions do not intersect, there is parallelism
– Find privatizable scalar variables and arrays
– Find scalar and array reductions
Carnegie Mellon
CS243: Parallelization 16 M. Lam & S. Liao
Conclusions
• Basic parallelization
– Doall loop: loops with no loop-carried data dependences
– Data dependence for affine loop indexes = integer linear programming
• Coarse-grain parallelism because of Amdahl’s Law
– Interprocedural analysis is useful for affine indices
– Ask users for help on unresolved dependences
Carnegie Mellon
CS243: Parallelization 17 M. Lam & S. Liao