Parallel Algorithms
Parallel algorithms
 TOPICS:
 Sorting
 Merging
  List ranking in PRAMs and
 Applications.
What is a parallel approach?
  Imagine you needed to find a lost
child in the woods. Even in a small
area, searching by yourself would be
very time consuming.     Now if you
gathered some friends and family to
help you, you could cover the woods
in much faster manner…
 Parallel Architectures
Single Instruction Stream , Multiple
 Data Stream (SIMD)
 One global control unit connected
  to each processor
Multiple     Instruction   Stream,
 Multiple Data Stream (MIMD)
 Each processor has a local control
  unit
 Parallel Architectures
Shared-Address-Space
 Each processor has access to main memory
 Processors may be given a small private memory for local
   variables
Message-Passing
 Each processor is given its own block of memory
 Processors communicate by passing messages directly
   instead of modifying memory locations
 Interconnection Networks
Static
 Each processor is hard-wired to every
  other processor
  Completely
  Connected     Star-       Bounded-Degree
                Connected     (Degree 4)
Dynamic
 Processors are connected to a series of
  switches
Parallel Algorithms
 A parallel algorithm is an algorithm that has been
 specifically written for execution on a computer with
 two or more processing units.
 Can be run on computers with single
 processor
  (multiple functional units, pipelined
 functional units, pipelined memory
 systems).
 When designing algorithm, take into
 account the cost of communication, the
 number of processors (efficiency)
Parallel Algorithms
Parallel: perform more than one operation at a time.
PRAM model: Parallel Random Access Model.
   p0               Multiple processors connected to a shared
   p1     Shared    memory. Each processor access any
          memory    location in unit time. All processors can
                    access memory in parallel. All processors
  pn-1              can perform operations in parallel.
The PRAM Model
Parallel Random Access Machine
Number of processors is not limited
All processors have local memory
One global memory accessible to all
 processors
Processors must read and write global
 memory
 The PRAM Model
Parallel Random Access Machine
 Theoretical model for parallel
  machines
 P processors with uniform access to a
  large memory bank
 MIMD
 UMA (uniform memory access) – Equal
  memory access time for any processor
  to any address
Memory access protocols
Exclusive-Read    Exclusive-Write
Exclusive-Read    Concurrent-
 Write
Concurrent-Read   Exclusive-Write
Concurrent-Read   Concurrent-Write
Example: Finding the Largest key in an array
     In order to find the largest key in an
       array of size n, at least               n-1
       comparisons must be done.
     A parallel version of this algorithm
       will still perform the same amount
       of compares, but by doing them in
       parallel it will finish sooner.
Example: Finding the Largest key in an array
Assume that n is a power of 2 and
 we have n/2 processors executing
 the algorithm in parallel.
Each Processor reads two array
 elements into local variables called
 first and second
It then writes the larger value into
 the first of the array slots that it
 has to read.
Takes log (n) steps for the largest
            2
 key to be placed in to the first slot
 of the array.
Example: Finding the Largest key in an array
                    3     1         7   6      8       1      1      1
                          2                            1      9      3
             P1               P2              P3             P4
Read
             1            3     6       7     1         8     1       1
             2
            First       Second First Second   1
                                              First           3
                                                      Second First    9
                                                                     Second
Write
                1         1         7   6      1       1      1      1
                2         2                    1       1      9      3
Example: Finding the Largest key in an array
              1       1       7       6       1        1       1        1
              2       2                       1        1       9        3
           P1                                 P3
Read
          7               12                  19                   11
Write
          1       1       7       6       1        1       1       1
          2       2                       9        1       9       3
Example: Finding the Largest key in an array
                  1    1   7   6   1       1   1   1
                  2    2           9       1   9   3
Read              P1
              1                        1
              9                        2
Write
                  1    1   7   6   1       1   1   1
                  9    2           9       1   9   3
Example: Merge Sort
      8                1            4                5                2                7            3                6
              P1                            P2                                P3                            P4
          1        8                    4        5                        2        7                    3        6
                               P1                                                          P2
               1           4        5       8                                 2        3        6       7
                                                             P1
                           1        2        3           4        5           6        7        8
Merge Sort Analysis
Number of comparisons
 = 1 + 3 + … + (2i-1) + … + (n-1)
 = ∑i=1.. log2(n)(2i-1) < 2n- log2(n) =
  Θ(n)
We have improved from n × loge(n)
 to n simply by applying the old
 algorithm to parallel computing,
 by altering the algorithm we can
 further improve merge sort to
 (log2(n) )2
O(1) Sorting Algorithm
We assume a CRCW PRAM where concurrent
write is handled with addition :
for(int i=1; i<=n; i++)
{
 for(int j=1; j<=n; j++)
 {
     if(X[i] > X[j])
      Processor Pij stores 1 in memory location
mi
     else
      Processor Pij stores 0 in memory location
mi
 }}
O(1) Sorting Algorithm
                     Unsorted
                     List {4, 2,
                        1, 3}
      P11(4,4)   P12(4,2)        P13(4,1)
      P14(4,3)    0+1+1+1 = 3
      P21(2,4)   P22(2,2)    P (2,1)
                  0+0+1+0 = 1 23
      P24(2,3)
                 0+0+0+0 = 0
      P31(1,4)   P32(1,2)    P33(1,1)
      P34(1,3)
                 0+1+1+0 = 2
    Applications
Computer Graphics Processing
Video Encoding
Accurate weather forecasting
                Potential Speedup
O(n logn) optimal sequential sorting algorithm
Best we can expect based upon a sequential sorting algorithm
using n processors is:
                                   O( n log n)
optimal parallel time complexity              O(log n)
                                        n
   Odd-Even Transposition Sort - example
Parallel time complexity:   Tpar = O(n)   (for P=n)
          Odd-Even Transposition Sort – Example (N >> P)
Each PE gets n/p numbers. First, PEs sort n/p locally, then they run
odd-even trans. algorithm each time doing a merge-split for 2n/p
numbers.         P            P                    P
                   0                1                     2
                    P3
               13 7 12         8 5 4      6 1 3       9 2 10
         Local sort
             7 12 13           4 5 8      1 3 6       2 9 10
         O-E
               4 5 7           8 12 13    1 2 3       6 9 10
         E-O
               4 5 7           1 2 3      8 12 13     6 9 10
         O-E
               1 2 3           4 5 7       6 8 9      10 12 13
         E-O
         SORTED:       1 2 3     4 5 6    7 8 9      10 12 13
Time complexity: Tpar = (Local Sort) + (p merge-splits) +(p exchanges)
                  Tpar = (n/p)log(n/p) + p*(n/p) + p*(n/p) = (n/p)log(n/p) + 2n
Parallelizing Merge sort
        Merge sort - Time complexity
Sequential :
                   n  2  n        log n   n
Tseq   1* n  2 *  2 * 2    2 * log n
                   2    2               2
Tseq   O ( n log n)
Parallel :
         n n n                   n         
T par 2 0  1  2    k   2  1
        2 2 2                   2          
                                    
     2n 20  2  1  2  2    2  log n
T par O( 4n)
               Bitonic Merge sort
                   Bitonic Sequence
A bitonic sequence is defined as a list with no more than one
LOCAL MAXIMUM and no more than one LOCAL
MINIMUM.(Endpoints must be considered - wraparound )
                           Binary Split
1.   Divide the bitonic list into two equal halves.
2.   Compare-Exchange each item on the first half
     with the corresponding item in the second half.
Result:
Two bitonic sequences where the numbers in one sequence are all less
than the numbers in the other sequence.
      Repeated application of binary split
Bitonic list:
     24   20    15   9   4   2   5   8   |   10   11   12   13   22   30   32   45
Result after Binary-split:
     10   11    12   9   4   2   5   8   |   24   20   15   13   22   30   32
45
If you keep applying the BINARY-SPLIT to each half repeatedly, you
will get a SORTED LIST !
  10 11 12 9 . 4 2 5 8 | 24 20 15 13 . 22 30 32
45
   4   2 . 5 8 10 11 . 12 9 | 22 20 . 15 13    24 30 . 32
45
   4 . 2   5 . 8 10 . 9 12 .11 15 . 13 22 . 20 24 . 30 32 .
45
   2   4   5 8    9 10 11 12   13 15 20 22     24 30 32
          Sorting a bitonic sequence
Compare-and-exchange moves smaller numbers of each pair to left
and larger numbers of pair to right.
Given a bitonic sequence,
recursively performing ‘binary split’ will sort the list.
           Sorting an arbitrary sequence
To sort an unordered sequence, sequences are merged into larger
bitonic sequences, starting with pairs of adjacent numbers.
By a compare-and-exchange operation, pairs of adjacent numbers
formed into increasing sequences and decreasing sequences. Pairs
form a bitonic sequence of twice the size of each original
sequences.
By repeating this process, bitonic sequences of larger and larger
lengths obtained.
In the final step, a single bitonic sequence sorted into a single
increasing sequence.
               Number of steps (P=n)
In general, with n = 2k, there are k phases, each of 1, 2, 3,
…, k steps. Hence the total number of steps is:
               i log n
                     log n(log n  1)
T   bitonic
    par        i                         2
                                      O(log n)
                i 1        2
         Here Bitonic sort (for N >> P)
x x x   x x x   x x x   x x x   x x x   x x x   x x x   x x x
  x       x       x       x       x       x       x       x
         Parallel sorting - summary
Computational time complexity using P=n processors
• Odd-even transposition sort -   O(n)
• Parallel merge sort -   O(n)
  unbalanced processor load and Communication
• Bitonic Merge sort -   O(log2n)    (** BEST! **)
  Pointer Jumping –list ranking
 Given a single linked list L with n objects, compute, for each
  object in L, its distance from the end of the list.
 Formally: suppose next is the pointer field
          d[i]= 0      if next[i]=nil
          d[next[i]]+1 if next[i]nil
Serial algorithm: (n).
List ranking –EREW algorithm
      LIST-RANK(L)                   (in O(log2 n) time)
1.   for each processor i, in parallel
2.       do if next[i]=nil
3.          then d[i]0
4.          else d[i]1
5.   while there exists an object i such that next[i]nil
6.       do for each processor i, in parallel
7.          do if next[i]nil
8.              then d[i] d[i]+ d[next[i]]
9.                    next[i] next[next[i]]
      List-ranking –EREW algorithm
          3     4    6     1    0     5
(a)      1     1    1     1    1     0
         3      4    6     1    0        5
(b)      2     2    2     2     1    0
          3     4    6     1    0        5
(c)      4     4    3     2     1    0
          3     4    6     1    0        5
(d)      5     4    3     2     1    0
 List ranking –correctness of EREW algorithm
Loop invariant: for each i, the sum of d
 values in the sub list headed by i is the
 correct distance from i to the end of the
 original list L.
Parallel memory must be synchronized:
 the reads on the right must occur before
 the writes on the left. Moreover, read d[i]
 and then read d[next[i]].
An EREW algorithm: every read and write
 is exclusive.     For an object i,      its
 processor reads d[i], and then its
 precedent processor reads its d[i].
 Writes are all in distinct locations.
LIST ranking EREW algorithm running time
O(log2 n):
 The initialization for loop runs in O(1).
 Each iteration of while loop runs in O(1).
 There are exactly log2 n  iterations:
   Each iteration transforms each list into two
   interleaved lists: one consisting of objects
   in even positions,      and the other odd
   positions. Thus, each iteration double the
   number of lists but halves their lengths.
 The termination test in line 5 runs in O(1).
 Define work = #processors running time.
  O(n log2 n).
             Sorting on Specific Networks
     • Two network structures have received special attention:
                    mesh and hypercube
       Parallel computers have been built with these networks.
     • However, it is of less interest nowadays because networks got
       faster and clusters became a viable option.
     • Besides, network architecture is often hidden from the user.
     • MPI provides libraries for mapping algorithms onto meshes,
       and one can always use a mesh or hypercube algorithm even
       if the underlying architecture is not one of them.
40
                             Shear-sort
Alternate row and column sorting until list is fully sorted.
Alternate row directions to get snake-like sorting:
                                                               41