Algorithms for Parallel Machines
1) Problems which will be answered in the following
sections.
1 In what way is design PP different from design of sequential
programs?
2 Can we always write sequential programs first and then get
them to run on parallel machines?
3 Is SuperLinear Speedup achievable?
2) Speedup, Complexity and Cost
Speedup measures the degree of improvement in speed when
a problem is solved on a parallel machine as compared to
sequential machine.
Complexity can be considered of two types:
1. Worst case time complexity it is the maximum time taken
by a program to execute over all inputs.
2. Expected time complexity it is the average of execution
times over all inputs.
Analogously, we can define Worst case Space complexity and
Expected Space complexity.
3) Histogram Computation
Given an image as an array, Image[m][n] of integers in range
[0,255], find the Histogram of distribution of pixels over gray
scales in the image.
Sequential algorithm to solve above problem:
For(I = 0; I < 256; I ++)
Histogram[i] = 0;
For (I = 0; I < m; I ++)
For (j = 0; j < n; j ++){
Color = image[i][j];
Histogram[color]++;
Complexity of the above program is O(mn).
Lets parallelize the above program in following way.
Let there p processes numbered 0 to (p-1) and m rows (m > p) in
the image. Each process can be assigned (m / p) rows to compute
the Histogram. The Histogram is shared across the processes for
concurrent updation. Hence kth process executes the following:
For(i=k*(m/p); I <(k+1)*(m/p); i++)
For(j=0; j<n; j++){
Mutex_begin();
Color = image[i][j];
Histogram[color]++;
Mutex_end();
}
Still above program takes longer than the previous one to execute
due to use of mutex.
The better approach is that each process independently its
histogram which can be done in parallel. After all processes have
done the work, a single process accumulates the count in the P
separate histograms to update the final result.
For(i=k*(m/p); I <(k+1)*(m/p); i++)
For(j=0; j<n; j++){
Color = image[i][j];
Hist[color][k]++;
For(I = 0; I < 256; I ++)
For(j = 0; j < p; j ++)
Histogram [i] = histogram [i] + hist[i][j];
4) Parallel reduction :
Given a set of n values a0, a1, ..,a(n-1) and an associative
operator Reduction is the process of computing a0 a1 ..
a(n-1).
Eg of associative operator are :
Addition , multiplication etc
Algorithm for sequential reduction will be :
a) Initailise sum=0 b) iterate over the n integers in the input
adding it to the sum.
For parallel processing we view the process of reduction as a tree
structured operation. The element of input to the reduction
problem are placed at the leaf node of the binary tree.
The reduction operator is applied to the children of each parent
node and the result is propagated towards the root of the tree.
When the result at root node is completed the work of the
algorithm is over.
Analysis of parallel Reduction
There is dependency across levels of reduction. Unless
reduction at lower level is completed , higher level cannot
proceed with the operation.
After each step , the algorithm must have an equivalent of a
barrier operation.
5) Quadrature Problems :
For a function of type y=f(x) , we want to find the area under the
curve.
The range of methods for numerically computing this integral is
referred to as quadrature methods.
EG : Trapezoidal Rule.
The trapezoidal rule statically sub divides the domain into
uniformly spaced partitions .
The integral is estimated by summing up the areas of trapezia
approximating the area under the curve.
It is parallised by allotting non overlapping ranges of domain to
different process. Each process finds the local sum of areas in its
allocated range of domain and finally the global sum is updated.
Problems :
Accuracy depends on function : Since domain is divided
statically , the region where the domain has sharp variations
cannot be found very easily.Hence the accuracy suffers.
Complexity Increases.
Solution 1 : Adaptive Quadrature Algorithm :
It uses the divide and conquer method .
To find the integral of a function of type y=f(x) in the range [a,b],
we take the following steps.
1. Lets A= Area of the trapezium considering the end points
a,b.
B= Area of the trapezium considering the end points
[a, (a+b)/2]
C= Area of the trapezium considering the end points
[(a+b)/2 , b].
2. Sub-divide the range [a,b] if |A-(B+C)|>=0
3. Assign the responsibility of one sub domain to another process
and take the other part for local processing .
4. If the above condition does not hold , compute the quadrature
in the same process and update the global result.
Solution 2 : Self Scheduling Implementation
This scheme creates a group of worker processes and the
workers schedule themselves dynamically to pick up work that is
available from the shared global stack.
When there is no more work to be done , the workers perish.
This implementing strategy is called Self Scheduling.
The algorithm is :
The parent process :
1) Stores the task on shared stack
2) Creates the worker process.
Each worker Process :
1) Pops a task from the stack and starts working on it .
2) If problem cannot be solved immediately , partition the tasks
further and push the subtasks on the stack. Go to 1 and repeat
the same steps.
3) If there is no space on the stack then process must
sequentially do the tasks and update the global result.
4) All processes terminate when there is no more work to be
done.
6.Matrix Multiplication
Sequential algorithm for matrix multiplication
For(i=0 ; i<l ; i++)
For(j=0 ; j<n ; i++)
For(i=0 ; i<l ; i++)