CS-402 Parallel and Distributed Systems
Fall 2024
Quick Review Lecture No. 04
Modern CPU core architecture
o Superscalar (pipeline, multiple issues)
o Branch prediction
o Out of order execution
o Many execution units
o Memory hierarchy
Implication on software performance
Locality – temporal locality and spatial locality
Dependence
Dependence refers to the relationship between different parts of a program where one part relies on the results of
another. In parallel computing, managing dependencies is essential to ensure that tasks are executed in the correct
order. There are several types of dependencies:
1. Data Dependence: Occurs when one task needs data produced by another task.
This can be further classified into:
i. True Dependence (Read After Write): Task B needs data that Task A writes.
ii. Anti-Dependence (Write After Read): Task B writes data that Task A reads.
iii. Output Dependence (Write After Write): Both Task A and Task B write to the same data location.
2. Control Dependence: Occurs when the execution of one task depends on the control flow of
another task, such as conditional statements.
Managing these dependencies is crucial to avoid issues like race conditions, where the outcome depends
on the sequence of task execution.
Semantically Equivalent Programs
A semantically equivalent program is one that produces the same output and has the same
effect as another program, even if the internal structure or execution order is different. In
parallel and distributed computing, this concept is important for optimization.
For example, a compiler might transform a sequential program into a parallel one, ensuring
that the parallel version is semantically equivalent to the original. This means the parallel
program must respect the same dependencies and produce the same results as the sequential
one.
Dependence
A semantically equivalent program maintains the dependence in
the original program.
o As shown in the previous slide, semantically equivalent programs can
have very different data locality
A=1 B=2 A=1 B=A+1
B=2 D=4 B=A+1 A=1
C=3 A=1 C=B+1 C=B+1
D=4 C=3 D=C+1 D=C+1
Which two code segments are semantically equivalent?
Dependence
Two instructions (tasks) have dependence when they operate on the same
data with at least one operation being write.
The order of two instructions can be swapped when they do not have
dependence.
Three types of data dependence:
• True dependence: Write X-Read X (RAW)
• Output dependence: Write X – Write X (WAW)
• Anti dependence: Read X – Write X (WAR)
• What about RAR? (Read after Read)
Loop optimizations – making higher performance loops
It is crucial for enhancing the performance of your code, especially in high-performance applications where loops
often consume the majority of execution time. Here are some common loop optimization techniques:
1.Loop Unrolling: This technique involves expanding the loop body to decrease the overhead of loop control. For
example, instead of iterating a loop 100 times, you can unroll it to perform multiple iterations worth of work in a
single loop iteration.
2.Loop Fusion: Combining adjacent loops that iterate over the same range can reduce loop overhead and improve
cache performance.
3.Loop Fission: Also known as loop distribution, this technique splits a loop into multiple loops, each handling a
part of the original loop’s body. This can improve data locality and parallelism.
4.Loop Interchange: Swapping the order of nested loops can enhance cache performance by improving data
locality.
5.Loop Invariant Code Motion: Moving computations that produce the same result in every iteration outside the
loop can significantly reduce the amount of work done inside the loop.
6.Blocking (or Tiling): This technique breaks down a loop into smaller blocks or tiles to improve cache
performance by ensuring that data used in the loop fits into the cache.
7.Parallelization: Utilizing multi-threading or vectorization to execute loop iterations in parallel can greatly
improve performance on multi-core processors.
8.Memory Reference Optimization: Reducing the number of memory accesses within a loop can enhance
performance, especially for loops that involve large data sets.
Loop optimizations – making higher performance loops
Why loops?
o Most of the program time is spent on loop.
o Most parallelism in a program is inside loops.
Focusing on affine loops only with affine array accesses.
o Non affine loops are much harder to analyze and manipulate.
for (i =0; i<n; i++)
for (j=0; j<n; j++)
m[i][j] = 0;
An example affine loop
Affine expression and affine array accesses
Let , , …, be a set of variables. An affine expression is a linear expression of the variables:
+ +⋯+ +
Here, , , …, are (integer) constants.
Affine array accesses: array indexes are affine expressions of loop indices
o Let , , …, be loop index variables. Array indexes are affine expressions of the loop index
variables
+ + ⋯+ +
Here, , , …, are integer constants.
o Cover common array access patterns: a[i], a[j], a[i+1], a[i+j], a[2*i], etc.
o No a[b[j]] or a[i*i] -- No good static techniques for loops with such array references (irregular
problems).
Affine loops
All loop bounds have to be affine expressions of the loop index variables of the
containing loops.
All array references are affine array accesses.
No pointers, no aliases of the arrays.
for (i=0; i<n; i++) for (i=0; i<n; i++) for (i=0; i<n; i++) for (i=0; i<n; i++)
for (j=0; j<n; j++) for (j=0; j<n; j++) for (j=0; j<n; j++) for (j=0; j<n; j++)
a[i+j*10] = 0; a[i][j+1] = 0; a[i*j] = 0; a[b[i][j]] = 0;
Affine loops
Each loop has three components: a lower bound L, an upper bound U, and a step S.
for (int i = L, i < U; i+=S) {
// loop body
All loop bounds are affine expressions of the loop index variables of the containing loops.
For a nest of n loops, the iteration of the innermost loop can be represented by an
iteration vector ⃗ = , , …, , where n is the innermost loop.
Iteration Number and Vector
for an n-level loop: ⃗ = , , …,
for (int = ; < ; += ){
for (int = ; < ; += ){
……
for (int = ; < ; += ){
// innermost loop body
}
}
}
Iteration Number and Vector
If the statements in iteration ⃗ = , , … , are executed before the statements in
iteration ⃗ = , , … , , we say ⃗ < ⃗.
To compare ⃗ = , , …, and ⃗ = , , …,
o if < , then ⃗ < ⃗
o if < , then ⃗ < ⃗
o if = , then recursively compare , …, and ,…,
Example:
o compare ⃗ = [10] and ⃗ = [20]
o compare ⃗ = [4, 10] and ⃗ = [0, 20]
o compare ⃗ = [10, 30, 20] and ⃗ = [10, 20, 30]
Dependence in the loop and loop parallelism
for (i=0; i<n; i++)
c[i] = a[i] + b[i];
For some loops like the one above, executing loop iterations in any order yields the same
results (semantically equivalent).
o Different iterations access different data (no dependence across iterations)
This type of loops is said to be parallelizable. One can execute the loop on n processors by
giving each processor an ID = 0, 1, …, n-1. Each processor executes the same code c[ID] =
a[ID] + b[ID].
Dependence in a loop
for (i=1; i<n; i++)
a[i] = a[i-1] + b[i];
For some loops like the one above, executing loop iterations in different order will yield
different results.
o Iteration [1]: a[1] = a[0] + b[1]
o Iteration [2]: a[2] = a[1] + b[2]
o Iteration [3]: a[3] = a[2] + b[3]
There is a true dependence across loop iterations. The iterations must be executed in the
original order.
Dependence in a loop
for (i=1; i<n; i++)
S: a[i] = a[i-1] + b[i];
Statement S1 in iteration vector ⃗ and S2 in iteration ⃗ have dependence when
1. S1 in iteration ⃗ and S2 in iteration ⃗ access a common memory location.
2. One of the access is a write.
In the above code example, statement S in iteration ⃗ =[i] write to a[i] and in iteration ⃗ =
i + 1 read from a[i]. Thus, statement S in iteration vector ⃗ = [ ] and S in iteration ⃗ = [ +
1] have dependence.
Loop carried dependence
• A loop-carried dependence is a dependence that is present only when the dependence is
between statements in different iterations of a loop.
• Otherwise, we call it loop-independent dependence.
• Loop-carried dependence must be preserved when we restructure the loop (change the
order in the iteration space).
• Loop-carried dependence is also what prevents loops from being parallelized.
Data dependence analysis
• Data dependence analysis (by the compiler or human) is to compute the set of statement
instances that are dependent.
• This is an important step for the compiler (or human) to restructure or parallelize the
loops.
• The dependence is often represented as dependence distance vectors or dependence
direction vectors.
Distance vector and direction vector
• Let S1 in iteration ⃗ and S2 on iteration ⃗ be two statements. Let there be dependence
between the two statements. The
dependence distance ⃗ of length n is defined as:
⃗⃗, ⃗ = ⃗ − ⃗
Let ⃗ be a vector, ⃗(k) be the kth item of the vector, then
⃗⃗, ⃗ ( ) = ⃗( ) − ⃗(k)
Distance vector and direction vector
• Let ⃗, ⃗ be the direction vector
"<", ⃗⃗, ⃗ >0
⃗, ⃗ ( )= "=", ⃗⃗, ⃗ =0
" > ", ⃗⃗, ⃗ <0
Distance vector and direction vector example
for (int i = 0; i<n; i++)
for (int j=0; j<n; j++)
for (int k=0; k<n; k++) {
S1: a[i+1][j][k-2] = a[i][j][k] + 1;
S2: a[i+1][j-1][k+2] = c[i][j][k] + d[i][j][k];
}
Get each pair of array references for potential dependence
Five array references in the loop. Arrays c and d are used only once – no dependence.
Array a is used three times. We need to analyze the dependence between three pairs in the program:
a[i+1][j][k-2] and a[i][j][k], a[i+1][j][k-2] and a[i+1][j-1][k+2], and a[i+1][j-1][k+2] and a[i][j][k]. Each pair results
in a distance vector (or direction vector).
Distance vector and direction vector example
a[i+1][j][k-2] and a[i][j][k]:
o a[ +1][ ][ -2] and a[ ][ ][ ] is the same memory +1= , = , and -2 =
o Hence, distance vector ⃗ = ⃗ − ⃗ = [ , , ]−[ , , ]=[ − , − , − ]
= [1, 0, -2].
o Direction vector = [<, =, >].
a[i+1][j][k-2] and a[i+1][j-1][k+2]?
a[i+1][j-1][k+2] and a[i][j][k]?
Reordering transformation
A reordering transformation preserves a dependence if it preserves and relative execution
order of the source and sink of the dependence.
Any reordering transformation that preserves every dependence in a program preserves
the semantics of the program
Loop optimization: Finding the reordering transformation that preserves all dependence
in the program while achieving better locality!
o Find all dependence distance/direction vector in the loops, make sure that the
transformations preserves all of them.
Scalar replacement of array elements
for (i=0; i<N; i++)
for(j=0; j<N; j++) Scalar replacement allows registers
for (k=0; k<N; k++) to be allocated to the scalar, which
c(i, j) = c(i, j) + a(i, k)* b(k, j);
reduces memory references.
o Registers are almost never
for (i=0; i<N; i++)
for(j=0; j<N; j++) { allocated to array elements.
ct = c(i, j)
for (k=0; k<N; k++)
ct = ct + a(i, k)* b(k, j);
c(i, j) = ct;
}
Loop normalization
for (i=a; i<b; i+= c) { Make the loop index variable starts
// loop body
from 1 (or 0) with a step of 1.
}
It does not do too much by itself.
But it simplifies the iteration space
for (ii=1; ii<(b-a)/c; ii++) {
i = a + (ii-1) *c; and makes it easier to manipulate,
// loop body which enables other optimizations.
}
Loop transformations
Change the shape and direction of loop iterations
o Change the access pattern
Increase data reuse (locality)
Reduce overheads
o Valid transformations need to maintain all loop carried dependence.
o Unimodular transformations
Loop interchange, loop permutation, loop reversal, loop skewing, and many others
o Loop fusion and distribution
o Loop tiling
o Loop unrolling
Uni-modular transformations: loop interchange
for (i=0; i<n; i++) for (j=0; j<n; j++)
for (j=0; j < n; j++) for (i=0; i < n; i++)
a(i, j) = a(i-1, j) + 1; a(i,j) = a(i-1, j) + 1;
Distance vector = [1, 0]
0 1 1 0 1 1 0
U D UD
1 0 0 1 0 0 1
Ud >= 0, the transformation for this loop nest is legal.
Uni-modular transformations: loop permutation
for (i=0; i<n; i++) for (k=0; k<n; k++)
for (j=0; j < n; j++) for (l=0; l < n; l++)
for (k=0; k < n; k++) for (i=0; i < n; i++)
for (l=0; l<n; l++) for (j=0; j<n; j++)
// loop body // loop body
0 0 1 0 0 0 1 0 i1 i3
0 0 0 1 0 0 0 1 i 2 i 4
U U
1 0 0 0 1
0 0 0 i3 i1
0 1 0 0 0 1 0 0 i 4 i 2
Uni-modular transformations: loop permutation
for (i=0; i<n; i++) for (k=0; k<n; k++)
How to derive U?
for (j=0; j < n; j++) for (l=0; l < n; l++)
for (k=0; k < n; k++) for (i=0; i < n; i++) , , , ,
for (l=0; l<n; l++) for (j=0; j<n; j++) =
, , , ,
, , , ,
// loop body // loop body , , , ,
How to derive U?
+ + +
0 0 1 0
, , , , , , , ,
, , , , , + , + , + , 0 0 0 1
= = = U
, , , , , + , + , + , 1 0 0 0
, , , , + + +
, , , , 0 1 0 0
Hence, , + , + , + , = ⇒ , = 1, , = 0, , = 0, , =0
for (int i = 0; i<n; i++)
Is this legal?
for (int j=0; j<n; j++) Distance vectors for the loop
for (int k=0; k<n; k++) { [1, 0, -2], [1, -1, 2], and [0, 1, -4].
S1: a[i+1][j][k-2] = a[i][j][k] + 1;
S2: a[i+1][j-1][k+2] = c[i][j][k] + d[i][j][k];
}
for (int k = 0; k<n; k++)
for (int i=0; i<n; i++)
for (int j=0; j<n; j++) {
S1: a[i+1][j][k-2] = a[i][j][k] + 1;
S2: a[i+1][j-1][k+2] = c[i][j][k] + d[i][j][k];
}
UNI-MODULAR TRANSFORMATIONS: LOOP REVERSAL
for (i=0; i<n; i++) for (i=0; i<n; j++)
for (j=0; j < n; j++) for (j=9; j >=0; i--)
a(i,j) = a(i-1, j) + 1.0; a(i,j) = a(i-1, j) + 1.0;
1 0 1
U d
0 1 0
1 0 1 1
Ud
0 1 0 0
UNI-MODULAR TRANSFORMATIONS: LOOP SKEWING
for (i=0; i<n; i++) for (i=0; i<n; i++)
for (j=0; j < n; j++) for (j=i+1; j <i+n; j++)
a(i) = a(i+ j) + 1.0; a(i) = a(j) + 1.0;
1 0
U
1 1
Loop fusion
Takes two adjacent loops that have the same
for (i=0; i<n; i++) iteration space and combines the body.
a(i) = 1.0; o Legal when there are no flow, anti- and output
dependences in the fused loop.
for (j=0; j<n; j++) o Why
b(j) = 1.0
Increase the loop body, reduce loop
overheads
Increase the chance of instruction
for (i=0; i<n; i++) { scheduling
a(i) = 1.0; May improve locality
b(i) = 1.0;
}
Loop distribution
Takes one loop and partitions it into two loops.
for (i=0; i<n; i++) {
a(i) = 1.0; o Why
b(i) = a(i); Reduce memory trace
}
Improve locality
Increase the chance of instruction
for (i=0; i<n; i++) scheduling
a(i) = 1.0;
for (i=0; i<n; i++)
b(i) = a(i)
Loop tiling
Replaceing a single loop into two loops.
for(I=0; I<n; I++) … for(I=0; I<n; I+=t) for (ii=I, ii < min(I+t,n); ii++) …
T is call tile size;
N-deep nest can be changed into n+1-deep to 2n-deep nest.
for (i=0; i<n; i++) For (i=0; i<n; i+=t)
for (j=0; j<n; j++) for (ii=I; ii<min(i+t, n); ii++)
for (k=0; j<n; k++) for (j=0; j<n; j+=t)
for (jj=j; jj < min(j+t, n); jj++)
for (k=0; j<n; k+=t)
for (kk = k; kk<min(k+t, n); kk++)
Loop tiling
When using with loop interchange, loop tiling create inner loops with smaller memory
trace – great for locality.
Loop tiling is one of the most important techniques to optimize for locality
o Reduce the size of the working set and change the memory reference pattern.
For (i=0; i<n; i+=t) For (i=0; i<n; i+=t)
for (ii=I; ii<min(i+t, n); ii++) for (j=0; j<n; j+=t)
for (j=0; j<n; j+=t) for (k=0; k<n; k+=t)
for (jj=j; jj < min(j+t, n); jj++) for (ii=I; ii<min(i+t, n); ii++)
for (k=0; j<n; k+=t) for (jj=j; jj < min(j+t, n); jj++)
for (kk = k; kk<min(k+t, n); kk++) for (kk = k; kk<min(k+t, n); kk++)
Loop unrolling
• Reduce control overheads.
for (i=0; i<100; i++) a(i) = 1.0;
• Increase chance for instruction scheduling.
for (i=0; i<100; i+=4) { • Large body may require more resources (register).
a(i) = 1.0;
a(i+1) = 1.0; • This can be very effective!!!!
a(i+2) = 1.0;
a(i+3) = 1.0;
}
Loop optimization in action
Optimizing matrix multiply:
for (i=1; i<=N; i++)
for (j=1; j<=N; j++)
for(k=1; k<=N; k++)
c(i, j) = c(i, j) + A(i, k)*B(k, j)
o First look: memory references -- c(i, j), A(i, 1..N), B(1..N, j)
Spatial locality: memory reference stride = 1 is the best
Temporal locality: hard to reuse cache data since the memory trace is too large.
Loop optimization in action
• Initial improvement: increase spatial locality in the inner loop, references to both A and B
have a stride 1.
– Transpose A before go into this operation (assuming column-major storage).
– Demonstrate my_mm.c method 1
Transpose A /* for all i, j, A’(i, j) = A(j, i) */
For (i=1; i<=N; i++)
for (j=1; j<=N; j++)
for(k=1; k<=N; k++)
c(I, j) = c(I, j) + A’(k, I)*B(k, j)
Loop optimization in action
C(i, j) are repeatedly referenced in the inner loop: scalar replacement
(method 2)
Transpose A Transpose A
for (i=1; i<=N; i++) for (i=1; i<=N; i++)
for (j=1; j<=N; j++) {
for (j=1; j<=N; j++)
t = c(i, j);
for(k=1; k<=N; k++) for(k=1; k<=N; k++)
c(i, j) = c(i, j) + A(k, i)*B(k, j) t = t + A(k, i)*B(k, j);
c(i, j) = t;
}
Loop optimization in action
• Inner loops memory footprint is too large:
– A(1..N, i), B(1..N, i)
– Loop tiling + loop interchange
• Memory footprint in the inner loop A(1..t, i), B(1..t, i)
• Using blocking, one can tune the performance for the memory hierarchy:
– Innermost loop fits in register; second innermost loop fits in L2 cache, …
• Method 4
FOR (J=1; J<=N; J+=T)
FOR(K=1; K<=N; K+=T)
FOR(I=1; I<=N; I+=T)
FOR (II=I; II<=MIN(I+T-1, N); II++)
FOR (JJ = J; JJ<=MIN(J+T-1,N);JJ++) {
T = C(II, JJ);
FOR(KK=K; KK <=MIN(K+T-1, N); KK++)
T = T + A(KK, II)*B(KK, JJ)
C(II, JJ) = T
}
Loop optimization in action
• Loop unrolling (method 5)
for (j=1; j<=N; j+=t)
for(k=1; k<=N; k+=t)
for(I=1; i<=N; i+=t)
for (ii=I; ii<=min(I+t-1, N); ii++)
This assumes the loop can be nicely unrolled,
for (jj = j; jj<=min(j+t-1,N);jj++) {
t = c(ii, jj); you need to take care of the boundary condition.
t = t + A(kk, ii) * B(kk, jj);
t = t + A(kk+1, ii) * B(kk+1, jj);
……
t = t + A(kk+15, ii) * B(kk + 15, jj);
c(ii, jj) = t
}
Loop optimization in action
Instruction scheduling (method 6)
• ‘+’ would have to wait on the results of ‘*’ in a typical processor.
• ‘*’ is often deeply pipelined: feed the pipeline with many ‘*’ operation.
• Effective for older machine, no longer effective on the current linprog.
for (j=1; j<=N; j+=t)
for(k=1; k<=N; k+=t)
for(I=1; i<=N; i+=t)
for (ii=I; ii<=min(I+t-1, N); ii++)
for (jj = j; jj<=min(j+t-1,N);jj++) {
t0 = A(kk, ii) * B(kk, jj);
t1 = A(kk+1, ii) * B(kk+1, jj);
……
t15 = A(kk+15, ii) * B(kk + 15, jj);
c(ii, jj) = c(ii, jj) + t0 + t1 + … + t15;
}
Loop optimization in action
• FURTHER LOCALITY IMPROVE: BLOCK ORDER STORAGE OF A, B,
AND C. (METHOD 7)
FOR (J=1; J<=N; J+=T)
FOR(K=1; K<=N; K+=T) 0 1 2
FOR(I=1; I<=N; I+=T) 3 4 5
FOR (II=I; II<=MIN(I+T-1, N); II++) 6 7 8
FOR (JJ = J; JJ<=MIN(J+T-1,N);JJ++) {
T0 = A(KK, II) * B(KK, JJ);
T1 = A(KK+1, II) * B(KK+1, JJ);
…… Each block is a 16x16 sub-matrix
T15 = A(KK+15, II) * B(KK + 15, JJ);
C(II, JJ) = C(II, JJ) + T0 + T1 + … + T15;
}
Summary
Dependence
Loop Optimization
Affine Loops