Parallel Algorithm Main Single
Parallel Algorithm Main Single
Parallel Algorithms
Harald Räcke
http://www14.in.tum.de/lehre/2014WS/pa/
PA
© Harald Räcke                                        1
                         Part I
Organizational Matters
PA
© Harald Räcke                            2
                            Part I
Organizational Matters
ñ   Modul: IN2011
ñ   Name: “Parallel Algorithms”
          “Parallele Algorithmen”
ñ   ECTS: 8 Credit points
ñ   Lectures:
      ñ   4 SWS
          Mon 14:00–16:00 (Room 00.08.038)
          Fri 8:30–10:00 (Room 00.08.038)
ñ   Webpage: http://www14.in.tum.de/lehre/2014WS/pa/
ñ   Required knowledge:
      ñ   IN0001, IN0003
          “Introduction to Informatics 1/2”
          “Einführung in die Informatik 1/2”
      ñ   IN0007
          “Fundamentals of Algorithms and Data Structures”
          “Grundlagen: Algorithmen und Datenstrukturen” (GAD)
      ñ   IN0011
          “Basic Theoretic Informatics”
          “Einführung in die Theoretische Informatik” (THEO)
      ñ   IN0015
          “Discrete Structures”
          “Diskrete Strukturen” (DS)
      ñ   IN0018
          “Discrete Probability Theory”
          “Diskrete Wahrscheinlichkeitstheorie” (DWT)
      ñ   IN2003
          “Efficient Algorithms and Data Structures”
          “Effiziente Algorithmen und Datenstrukturen”
The Lecturer
    ñ    Harald Räcke
    ñ    Email: raecke@in.tum.de
    ñ    Room: 03.09.044
    ñ    Office hours: (per appointment)
    PA
    © Harald Räcke                         5
Tutorials
    ñ    Tutors:
           ñ   Chris Pinkau
           ñ   pinkau@in.tum.de
           ñ   Room: 03.09.037
           ñ   Office hours: Tue 13:00–14:00
    ñ    Room: 03.11.018
    ñ    Time: Tue 14:00–16:00
    PA
    © Harald Räcke                             6
Assignment sheets
    PA
    © Harald Räcke                                      7
Assessment
   ñ    Assignment Sheets:
          ñ   An assignment sheet is usually made available on Monday
              on the module webpage.
          ñ   Solutions have to be handed in in the following week before
              the lecture on Monday.
          ñ   You can hand in your solutions by putting them in the right
              folder in front of room 03.09.019A.
          ñ   Solutions will be discussed in the subsequent tutorial on
              Tuesday.
   PA
   © Harald Räcke                                                           8
1 Contents
    ñ    PRAM algorithms
           ñ   Parallel Models
           ñ   PRAM Model
           ñ   Basic PRAM Algorithms
           ñ   Sorting
           ñ   Lower Bounds
    ñ    Networks of Workstations
           ñ   Offline Permutation Routing on the Mesh
           ñ   Oblivious Routing in the Butterfly
           ñ   Greedy Routing
           ñ   Sorting on the Mesh
           ñ   ASCEND/DESCEND Programs
           ñ   Embeddings between Networks
    PA                             1 Contents
    © Harald Räcke                                       9
2 Literatur
         Tom Leighton:
         Introduction to Parallel Algorithms and Architecture:
         Arrays, Trees, Hypercubes,
         Morgan Kaufmann: San Mateo, CA, 1992
         Joseph JaJa:
         An Introduction to Parallel Algorithms,
         Addison-Wesley: Reading, MA, 1997
         Jeffrey D. Ullman:
         Computational Aspects of VLSI,
         Computer Science Press: Rockville, USA, 1984
         Selim G. Akl.:
         The Design and Analysis of Parallel Algorithms,
         Prentice Hall: Englewood Cliffs, NJ, 1989
    PA                            2 Literatur
    © Harald Räcke                                               10
                   Part II
Foundations
PA
© Harald Räcke                 11
3 Introduction
   Parallel Computing
   A parallel computer is a collection of processors usually of the
   same type, interconnected to allow coordination and exchange
   of data.
   Distributed Systems
   A set of possibly many different types of processors are
   distributed over a larger geographic area.
     PA                        3 Introduction
     © Harald Räcke                                                   12
Cost measures
    ñ    time efficiency
    ñ    space utilization
    ñ    energy consumption
    ñ    programmability
    ñ    ...
    PA                        3 Introduction
    © Harald Räcke                                                13
Cost measures
  How do we evaluate parallel algorithms?
    ñ   time efficiency
    ñ   space utilization
    ñ   energy consumption
    ñ   programmability
    ñ   communication requirement
    ñ   ...
  Problems
    ñ   performance (e.g. runtime) depends on problem size n and
        on number of processors p
    ñ   statements usually only hold for restricted types of parallel
        machine as parallel computers may have vastly different
        characteristics (in particular w.r.t. communication)
Speedup
  Suppose a problem P has sequential complexity T ∗ (n), i.e.,
  there is no algorithm that solves P in time o(T ∗ (n)).
  Definition 1
  The speedup Sp (n) of a parallel algorithm A that requires time
  Tp (n) for solving P with p processors is defined as
                                     T ∗ (n)
                         Sp (n) =            .
                                     Tp (n)
    PA                        3 Introduction
    © Harald Räcke                                                  15
Efficiency
   Definition 2
   The efficiency of a parallel algorithm A that requires time Tp (n)
   when using p processors on a problem of size n is
                                       T1 (n)
                           Ep (n) =           .
                                      pTp (n)
                        T1 (n)
   Note that Ep (n) ≤ pT  ∞ (n)
                                . Hence, the efficiency goes down
   rapidly if p ≥ T1 (n)/T∞ (n).
     PA                         3 Introduction
     © Harald Räcke                                                     16
Parallel Models — Requirements
  Simplicity
  A model should allow to easily analyze various performance
  measures (speed, communication, memory utilization etc.).
  Implementability
  Parallel algorithms developed in a model should be easily
  implementable on a parallel machine.
    PA                        3 Introduction
    © Harald Räcke                                             17
DAG model — computation graph
   PA                           3 Introduction
   © Harald Räcke                                                   18
Example: Addition
+ +
+ + + +
A1 A2 A3 A4 A5 A6 A7 A8
A1 + + + + + + +
A2 A3 A4 A5 A6 A7 A8
    PA                                           3 Introduction
    © Harald Räcke                                                                                 19
DAG model — computation graph
  Definition 3
  A scheduling of a DAG G = (V , E) on p processors is an
  assignment of pairs (tv , pv ) to every internal node v ∈ V , s.t.,
    ñ    pv ∈ {1, . . . , p}; tv ∈ {1, . . . , T }
    ñ    tu = tv ⇒ pu ≠ pv
    ñ    (u, v) ∈ E ⇒ tv ≥ tu + 1
  where a non-internal node x (an input node) has tx = 0.
  T is the length of the schedule.
    PA                                  3 Introduction
    © Harald Räcke                                                      20
DAG model — computation graph
  The parallel complexity of a DAG is defined as
  Clearly,
                         Tp (n) ≥ T∞ (n)
                         Tp (n) ≥ T1 (n)/p
  Lemma 4
  A schedule with length O(T1 (n)/p + T∞ (n)) can be found easily.
  Lemma 5
  Finding an optimal schedule is in general NP-complete.
Note that the DAG model as defined is a non-uniform model of
computation.
An algorithm (e.g. for a RAM) must work for every input size and
must be of finite description length.
  PA                         3 Introduction
  © Harald Räcke                                                    22
PRAM Model
P1 P2 P3 P4 P5 P6 P7 P8
    PA                           3 Introduction
    © Harald Räcke                                               23
PRAM Model
                 Algorithm 1 copy
                  1: if id = 1 then round ← 1
                  2: while round ≤ p and id = round do
                  3:       x[id + 1] ← x[id]
                  4:       round ← round + 1
    PA                             3 Introduction
    © Harald Räcke                                                  24
PRAM Model
    ñ   processors can effectively execute different code because of
        branching according to id
    ñ   however, not arbitrarily; still uniform model of computation
    Algorithm 2 sum
     1: // computes sum of x[1] . . . x[p]
     2: // red part is executed only by processor 1
     3: r ← 1
     4: while 2r ≤ p do
     5:      for id mod 2r = 1 pardo
     6:      // only executed by processors whose id matches
     7:           x[id] = x[id] + x[id + 2r −1 ]
     8:      r ←r +1
     9: return x[1]
Different Types of PRAMs
  Simultaneous Access to Shared Memory:
    ñ    EREW PRAM:
         simultaneous access is not allowed
    ñ    CREW PRAM:
         concurrent read accesses to the same location are allowed;
         write accesses have to be exclusive
    ñ    CRCW PRAM:
         concurrent read and write accesses allowed
           ñ   commom CRCW PRAM
               all processors writing to x[i] must write same value
           ñ   arbitrary CRCW PRAM
               values may be different; an arbitrary processor succeeds
           ñ   priority CRCW PRAM
               values may be different; processor with smallest id succeeds
    PA                             3 Introduction
    © Harald Räcke                                                            26
  Algorithm 3 sum
   1: // computes sum of x[1] . . . x[p]
   2: r ← 1
   3: while 2r ≤ p do
   4:      for id mod 2r = 1 pardo
   5:           x[id] = x[id] + x[id + 2r −1 ]
   6:      r ←r +1
   7: return x[1]
 PA                          3 Introduction
 © Harald Räcke                                  27
PRAM Model — remarks
   ñ    similar to a RAM we either need to restrict the size of values
        that can be stored in registers, or we need to have a
        non-uniform cost model for doing a register manipulation
        (cost for manipulating x[i] is proportional to the bit-length
        of the largest number that is ever being stored in x[i])
          ñ   in this lecture: uniform cost model but we are not
              exploiting the model
   ñ    global shared memory is very unrealistic in practise as
        uniform access to all memory locations does not exist
   ñ    global synchronziation is very unrealistic; in real parallel
        machines a global synchronization is very costly
   ñ    model is good for understanding basic parallel
        mechanisms/techniques but not for algorithm development
   ñ    model is good for lower bounds
   PA                             3 Introduction
   © Harald Räcke                                                        28
Network of Workstations — NOWs
    PA                          3 Introduction
    © Harald Räcke                                                    29
Typical Topologies
                                       111                                           3
             110
                                                 101                     4                      2
                         100
                                       011
             010                                                         6                      8
                                                 001
                         000                                                         7
1, 4 2, 4 3, 4 4, 4
1, 3 2, 3 3, 3 4, 3
      1     2        3         4   5         6     7     8
                          linear array                                 1, 2   2, 2       3, 2   4, 2
1, 1 2, 1 3, 1 4, 1
mesh/grid
    PA                                            3 Introduction
    © Harald Räcke                                                                                         30
Network of Workstations — NOWs
  Computing the sum on a d-dimensional hypercube. Note that
  x[0] . . . x[2d − 1] are stored at the individual nodes.
    Algorithm 4 sum
     1: // computes sum of x[0] . . . x[2d − 1]
     2: r ← 1
     3: while 2r ≤ 2d do // p = 2d
     4:        if id mod 2r = 0 then
     5:              temp ← receive(id + 2r −1 )
     6:              x[id] = x[id] + temp
     7:        if id mod 2r = 2r −1 then
     8:              send(x[id], id − 2r −1 )
     9:        r ←r +1
    10:   if id = 0 then return x[id]
Network of Workstations — NOWs
  Remarks
    ñ    One has to ensure that at any point in time there is at most
         one active communication along a link
    ñ    There also exist synchronized versions of the model, where
         in every round each link can be used once for
         communication
    ñ    In particular the asynchronous model is quite realistic
    ñ    Difficult to develop and analyze algorithms as a lot of low
         level communication has to be dealt with
    ñ    Results only hold for one specific topology and cannot be
         generalized easily
    PA                           3 Introduction
    © Harald Räcke                                                      32
Performance of PRAM algorithms
    PA                         3 Introduction
    © Harald Räcke                                                 33
Performance of PRAM algorithms
  Suppose we have a PRAM algorithm that takes time T (n) and
  work W (n), where work is the total number of operations.
  Idea:
    ñ   Wi (n) denotes operations in parallel step i, 1 ≤ i ≤ T (n)
    ñ   simulate each step in dWi (n)/pe parallel steps
    ñ   then we have
          X             X
           dWi (n)/pe ≤   bWi (n)/pc + 1 ≤ bW (n)/pc + T (n)
                                        
           i               i
Performance of PRAM algorithms
    PA                           3 Introduction
    © Harald Räcke                                                   35
Optimal PRAM algorithms
                      T ∗ (n)                pT ∗ (n)
                                  !                      !
   Sp (n) = Ω                       =Ω                     = Ω(p)
                T ∗ (n)/p + T (n)       T ∗ (n) + pT (n)
    PA                         3 Introduction
    © Harald Räcke                                                  36
This means by improving the time T (n), (while using same
work) we improve the range of p, for which we obtain optimal
speedup.
  PA                       3 Introduction
  © Harald Räcke                                                37
Example
  One can show that any CREW PRAM requires Ω(log n) time to
  compute the sum.
   PA                        3 Introduction
   © Harald Räcke                                              38
Communication Cost
    PA                       3 Introduction
    © Harald Räcke                                                 39
Communication Cost
    Algorithm 5 MatrixMult(A, B, n)
     1: Input: n × n matrix A and B; n = 2k
     2: Output: C = AB
     3: for 1 ≤ i, j, ` ≤ n pardo
     4:          X[i, j, `] ← A[i, `] · B[`, j]
     5: for r ← 1 to log n
     6:      for 1 ≤ i, j ≤ n; ` mod 2r = 1 pardo
     7:           X[i, j, `] ← X[i, j, `] + X[i, j, ` + 2r −1 ]
     8: for 1 ≤ i, j ≤ n pardo
     9:      C[i, j] ← X[i, j, 1]
    PA                              3 Introduction
    © Harald Räcke                                                40
What happens if we have n processors?
Phase 1
pi computes X[i, j, `] = A[i, `] · B[`, j] for all 1 ≤ j, ` ≤ n
n2 time; n2 communication for every processor
Phase 2 (round r)
pi updates X[i, j, `] for all 1 ≤ j ≤ n; 1 ≤ ` mod 2r = 1
O(n · n/2r ) time; no communication
Phase 3
pi writes i-th row into C[i, j]’s.
n time; n communication
  PA                          3 Introduction
  © Harald Räcke                                                  41
Alternative Algorithm
  Split matrix into blocks of size n2/3 × n2/3 .
A1,1 A1,2 A1,3 A1,4 B1,1 B1,2 B1,3 B1,4 C1,1 C1,2 C1,3 C1,4
    A2,1 A2,2 A2,3 A2,4           B2,1   B2,2       B2,3   B2,4       C2,1   C2,2       C2,3   C2,4
            A                 ·                 B                 =                 C
    A3,1 A3,2 A3,3 A3,4           B3,1   B3,2       B3,3   B3,4       C3,1   C3,2       C3,3   C3,4
A4,1 A4,2 A4,3 A4,4 B4,1 B4,2 B4,3 B4,4 C4,1 C4,2 C4,3 C4,4
    PA                        3 Introduction
    © Harald Räcke                                                    43
                     Part III
PRAM Algorithms
PA
© Harald Räcke                     44
Prefix Sum
  input: x[1] . . . x[n]
                                     Pi
  output: s[1] . . . s[n] with s[i] = j=1 x[i] (w.r.t. operator ∗)
z1 z2 z3 z4 z5 z6 z7 z8
                                              ẑ1                                       ẑ2
 time steps
z10
a10
â1 â2
a1 a2 a3 a4 a5 a6 a7 a8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
                                              x-values
Prefix Sum
  The algorithm uses work O(n) and time O(log n) for solving
  Prefix Sum on an EREW-PRAM with n processors.
It is clearly work-optimal.
  Theorem 6
  On a CREW PRAM a Prefix Sum requires running time Ω(log n)
  regardless of the number of processors.
4 3 7 8 2 1 6 5
          Algorithm 7 ParallelPrefix
           1: for 1 ≤ i ≤ n pardo
           2:      P [i] ← S[i]
           3:      while S[i] ≠ S[S[i]] do
           4:            x[i] ← x[i] ∗ x[S[i]]
           5:            S[i] ← S[S[i]]
           6:      if P [i] ≠ i then x[i] ← x[i] ∗ x[S(i)]
  Definition 7
  Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
  X is
                        rank(y : X) = |{x ∈ X | x ≤ y}|
  Observation:
  We can assume wlog. that elements in A and B are different.
  Lemma 8
  On a CREW PRAM, Merging can be done in O(log n) time and
  O(n log n) work.
not optimal
     Algorithm 8 GenerateSubproblems
      1: j0 ← 0
      2: jk ← n
      3: for 1 ≤ i ≤ k − 1 pardo
      4:      ji ← rank(bi log n : A)
      5: for 0 ≤ i ≤ k − 1 pardo
      6:      Bi ← (bi log n+1 , . . . , b(i+1) log n )
      7:      Ai ← (aji +1 , . . . , aji+1 )
  Parallelizing the last step gives total work O(n) and time
  O(log n).
  Lemma 9
  On a CRCW PRAM the maximum of n numbers can be computed
  in time O(1) with n2 processors.
proof on board...
  Lemma 10
  On a CRCW PRAM the maximum of n numbers can be computed
  in time O(log log n) with n processors and work O(n log log n).
proof on board...
  Lemma 11
  On a CRCW PRAM the maximum of n numbers can be computed
  in time O(log log n) with n processors and work O(n).
proof on board...
a1 a4
a0 a2 a3 a5 a6
a0 a1 a2 a3 a4 a5 a6 ∞
a1 a4
x5
                          a0             x
                                         a23        a23        x5     aa335 xx
                                                                             a969        aa55 aa66
                     a0        a1   x
                                    a2
                                     3         a2
                                                3         a4
                                                          x5        aa5
                                                                     33   xx
                                                                           a
                                                                           969   aa
                                                                                  ∞4
                                                                                  4    aa
                                                                                        55   aa
                                                                                              66   ∞∞
                                    x3     x5         x9
     ñ    Step 3, works in phases; one phase for every level of the tree
     ñ    Step 4, works in rounds; in each round a different set of
          elements is inserted
   Observation
   We can start with phase i of round r as long as phase i of round
   r − 1 and (of course), phase i − 1 of round r has finished.
   Algorithm 9 BasicColoring
    1: for 1 ≤ i ≤ n pardo
    2:      col(i) ← i
    3:      ki ← smallest bitpos where col(i) and col(S(i)) differ
    4:      col0 (i) ← 2ki + col(i)ki
                 5    4                 v    col    k       col0
             6                           1   0001       1      2
                          15
                                         3   0011       2      4
                                         7   0111       0      1
        8
14 1110 2 5
                               2
                                         2   0010       0      0
   10
                                        15   1111       0      1
                                         4   0100       0      0
                                   14
                                         5   0101       0      1
   11
                                         6   0110       1      3
                                         8   1000       1      2
                               7
                                        10   1010       0      0
        12
                                        11   1011       0      1
                          3
                                        12   1100       0      0
                                         9   1001       2      4
             9
                      1
                 13
                                        13   1101       2      5
4.6 Symmetry Breaking
2(t − 1) + 1
    Algorithm 10 ReColor
     1: for ` ← 5 to 3
     2:          for 1 ≤ i ≤ n pardo
     3:               if col(i) = ` then
     4:               col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}
  Lemma 12
  We can color vertices in a ring with three colors in O(log∗ n)
  time and with O(n log∗ n) work.
  Lemma 13
  Given n integers in the range 0, . . . , O(log n), there is an
  algorithm that sorts these numbers in O(log n) time using a
  linear number of operations.
Proof: Exercise!
       Algorithm 11 OptColor
        1: for 1 ≤ i ≤ n pardo
        2:      col(i) ← i
        3: apply BasicColoring once
        4: sort vertices by colors
        5: for ` = 2dlog ne to 3 do
        6:      for all vertices i of color ` pardo
        7:           col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}
We can perform Lines 6 and 7 in time O(n` ) only because we sorted before. In general a state-
ment like “for constraint pardo” should only contain a contraint on the id’s of the processors
but not something complicated (like the color) which has to be checked and, hence, induces
work. Because of the sorting we can transform this complicated constraint into a constraint on
just the processor id’s.
  Input:
  A list given by successor pointers;
4 5 7 3 1 2 6 8 9
  Output:
  For every node number of hops to end of the list;
   4            5       7   3       1            2   6   8   9
   8            7       6   5       4            3   2   1   0
  Observation:
  Special case of parallel prefix
       PA                       5 List Ranking
       © Harald Räcke                                            70
List Ranking
   4            5       7   3       1              2   6   8     9
   1            3       1   1       1              2   2   1     0
       PA                         5 List Ranking
       © Harald Räcke                                                71
List Ranking
   4            5       7    3       1              2   6   8    9
   1            3       1    1       1              2   2   1    0
       PA                          5 List Ranking
       © Harald Räcke                                                71
List Ranking
   4            5       7     3       1              2   6   8     9
   4            3       1     2       1              4   2   1     0
       PA                           5 List Ranking
       © Harald Räcke                                                  71
List Ranking
   4            5           7       3         1              2       6       8   9
   4            3           1       2         1              4       2       1   0
                        3       4       2           1            5       6
                        4       1       2           4            1       0
       PA                                   5 List Ranking
       © Harald Räcke                                                                71
List Ranking
   4            5           7       3         1              2       6       8   9
   4            3           1       2         1              4       2       1   0
                        3       4       2           1            5       6
                    12          8       7           5            1       0
       PA                                   5 List Ranking
       © Harald Räcke                                                                71
List Ranking
   4            5           7       3         1              2       6       8   9
   12           3           8       7         1              5       2       1   0
                        3       4       2           1            5       6
                    12          8       7           5            1       0
       6. Map the values back into the larger list. Time: O(1);
          Work: O(n)
       PA                                   5 List Ranking
       © Harald Räcke                                                                71
List Ranking
   4            5           7       3         1              2       6       8   9
   12          11           8       7         6              5       3       1   0
                        3       4       2           1            5       6
                    12          8       7           5            1       0
       PA                                   5 List Ranking
       © Harald Räcke                                                                71
We need O(log log n) shrinking iterations until the size of the
remaining list reaches O(n/ log n).
The work for all shrinking operations is just O(n), as the size of
the list goes down by a constant factor in each round.
List Ranking can be solved in time O(log n log log n) and work
O(n) on an EREW-PRAM.
  PA                         5 List Ranking
  © Harald Räcke                                                     72
Optimal List Ranking
    PA                       5 List Ranking
    © Harald Räcke                                                  73
1   2     3   4   5   6    7   8   9    10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
p1 p1 p2 p2 p3 p4 p4 p5 p5 p3
B1 B2 B3 B4 B5 B6
  Each iteration requires constant time and work O(n/ log n),
  because we just work on one node in every block.
    PA                        5 List Ranking
    © Harald Räcke                                               75
Observations/Remarks:
 ñ    If the p-pointer of a block cannot be advanced without
      leaving the block, the processor responsible for this block
      simply stops working; all other blocks continue.
 ñ    The p-node of a block (the node pi is pointing to) at the
      beginning of a round is either a ruler with a living subject or
      the node will become active during the round.
 ñ    The subject nodes always lie to the left of the p-node of the
      respective block (if it exists).
Measure of Progress:
 ñ    a ruler will delete a subject
 ñ    an active node either
        ñ   becomes a ruler (with a subject)
        ñ   becomes a subject
        ñ   is isolated and therefore gets deleted
 PA                             5 List Ranking
 © Harald Räcke                                                         76
Analysis
  Definition 15
  The weight of the i-th node in a block is
(1 − q)i
                     1
  with q = log log n , where the node-numbering starts from 0.
  Hence, a block has nodes {0, . . . , log n − 1}.
    PA                        5 List Ranking
    © Harald Räcke                                                   77
Definition of Rulers
   Properties:
     ñ   A ruler should have at most log log n subjects.
     ñ   The weight of a ruler should be at most the weight of any of
         its subjects.
     ñ   Each ruler must have at least one subject.
     ñ   We must be able to remove the next subject in constant
         time.
     ñ   We need to make the ruler/subject decision in constant
         time.
    PA                           5 List Ranking
    © Harald Räcke                                                      78
Given a sublist of active nodes.
Color the sublist with O(log log n) colors. Take the local minima
w.r.t. this coloring.
  PA                           5 List Ranking
  © Harald Räcke                                                    79
Now we change the ruler definition.
We make all local minima w.r.t. the weight function into a ruler;
ties are broken according to block-id (so that comparing weights
gives a strict inequality).
A ruler gets as subjects the nodes left of it until the next local
maximum (or the start of the chain) (including the local
maximum) and the nodes right of it until the next local
maximum (or the end of the chain) (excluding the local
maximum).
In case the first node is a ruler the above definition could leave it
without a subject. We use constant time to fix this in some
arbitrary manner
  PA                          5 List Ranking
  © Harald Räcke                                                        80
               1
Set q =    log log n .
The total weight of a block is at most 1/q and the total weight of
                        n
all items is at most q log n .
to show:
After O(log n) iterations the weight is at most
(n/ log n)(1 − q)log n
  PA                         5 List Ranking
  © Harald Räcke                                                     81
In every iteration the weight drops by a factor of
(1 − q/4) .
  PA                         5 List Ranking
  © Harald Räcke                                     82
We consider subject nodes to just have half their weight.
  PA                         5 List Ranking
  © Harald Räcke                                                   83
The weight is reduced because
  ñ   An isolated node is removed.
  ñ   A node is labelled as ruler, and the corresponding subjects
      reduce their weight by a factor of 1/2.
  ñ   A node is a ruler and deletes one of its subjects.
 PA                           5 List Ranking
 © Harald Räcke                                                     84
Each p-node is responsible for some other nodes; it has to
generate a weight reduction large enough so that the weight of
all nodes it is responsible for decreases by the desired factor.
A ruler is responsible for all nodes that come after it in its block
and for all its subjects.
  PA                          5 List Ranking
  © Harald Räcke                                                       85
Case 1: Isolated Node
  Suppose we delete an isolated node v that is the i-th node in its
  block.
                                            (1 − q)j
                                     X
i≤j<log n
                             (1 − q)j ≤ (1 − q)                (1 − q)j
                        X                                 X
i<j<log n i≤j<log n
    PA                                5 List Ranking
    © Harald Räcke                                                        86
Case 2: Creating Subjects
   Suppose we generate a ruler with at least one subject.
Initial weight:
            k                             k                 k
                                       1 X               2 X
                          (1 − q)` ≤         (1 − q)ij ≤      (1 − q)ij
            X        X
      Q=
           j=1 ij ≤`<log n
                                       q j=1             q j=2
New weight:
                              k
                           1 X                  q
                      0
                     Q =Q−      (1 − q)ij ≤ (1 − )Q
                           2 j=2                4
Case 3: Removing Subjects
  weight of ruler: (1 − q)i1 ; weight of subjects: (1 − q)ij , 2 ≤ j ≤ k
  Initial weight:
                                                   k
                                                1 X
                                   (1 − q)` +        (1 − q)ij
                             X
                    Q=
                                                2 j=2
                         i1 ≤`<log n
                      1            k
                     ≤  (1 − q)i1 + (1 − q)i2
                      q            2
                      1             1
                     ≤ (1 − q)i2 +    (1 − q)i2
                      q            2q
  New weight:
                                   1                 q
                     Q0 = Q −        (1 − q)i2 ≤ (1 − )Q
                                   2                 3
After s iterations the weight is at most
                      n         q s !   n
                                 
                             1−     ≤       (1 − q)log n
                   q log n      4     log n
  PA                             5 List Ranking
  © Harald Räcke                                                      89
Tree Algorithms
1 2 3 4
2 1 5 6 7
                           3   1
           5           3
                           4   1
6 2 1 5 2
                           6   2
   8       7           4
7 2 8 9
       9
                           8   7
                           9   7
Euler Circuits
     PA                          6 Tree Algorithms
     © Harald Räcke                                              91
Euler Circuits
   Lemma 16
   An Euler circuit can be computed in constant time O(1) with
   O(n) operations.
     PA                      6 Tree Algorithms
     © Harald Räcke                                              92
Euler Circuits — Applications
   Rooting a tree
     ñ   split the Euler tour at node r
     ñ   this gives a list on the set of directed edges (Euler path)
     ñ   assign x[e] = 1 for every edge;
     ñ   perform parallel prefix; let s[·] be the result array
     ñ   if s[(u, v)] < s[(v, u)] then u is parent of v;
    PA                          6 Tree Algorithms
    © Harald Räcke                                                     93
Euler Circuits — Applications
   Postorder Numbering
    ñ    split the Euler tour at node r
    ñ    this gives a list on the set of directed edges (Euler path)
    ñ    assign x[e] = 1 for every edge (v, parent(v))
    ñ    assign x[e] = 0 for every edge (parent(v), v)
    ñ    perform parallel prefix
    ñ    post(v) = s[(v, parent(v))]; post(r ) = n
    PA                          6 Tree Algorithms
    © Harald Räcke                                                     94
Euler Circuits — Applications
   Level of nodes
     ñ   split the Euler tour at node r
     ñ   this gives a list on the set of directed edges (Euler path)
     ñ   assign x[e] = −1 for every edge (v, parent(v))
     ñ   assign x[e] = 1 for every edge (parent(v), v)
     ñ   perform parallel prefix
     ñ   level(v) = s[(parent(v), v)]; level(r ) = 0
    PA                          6 Tree Algorithms
    © Harald Räcke                                                     95
Euler Circuits — Applications
   Number of descendants
    ñ    split the Euler tour at node r
    ñ    this gives a list on the set of directed edges (Euler path)
    ñ    assign x[e] = 0 for every edge (parent(v), v)
    ñ    assign x[e] = 1 for every edge (v, parent(v)), v ≠ r
    ñ    perform parallel prefix
    ñ    size(v) = s[(v, parent(v))] − s[(parent(v), v)]
    PA                          6 Tree Algorithms
    © Harald Räcke                                                     96
Rake Operation
                             3
                             2                               5
1 8 9 6 8
    PA                               6 Tree Algorithms
    © Harald Räcke                                                   97
We want to apply rake operations to a binary tree T until T just
consists of the root with two children.
Possible Problems:
 1. we could concurrently apply the rake-operation to two
    siblings
 2. we could concurrently apply the rake-operation to two
    leaves u and v such that p(u) and p(v) are connected
By choosing leaves carefully we ensure that none of the above
cases occurs
  PA                       6 Tree Algorithms
  © Harald Räcke                                                   98
Algorithm:
  ñ   label leaves consecutively from left to right (excluding
      left-most and right-most leaf), and store them in an array A
  ñ   for dlog(n + 1)e iterations
        ñ   apply rake to all odd leaves that are left children
        ñ   apply rake operation to remaining odd leaves (odd at start
            of round!!!)
        ñ   A=even leaves
 PA                            6 Tree Algorithms
 © Harald Räcke                                                          99
Observations
 ñ    the rake operation does not change the order of leaves
 ñ    two leaves that are siblings do not perform a rake operation
      in the same round because one is even and one odd at the
      start of the round
 ñ    two leaves that have adjacent parents either have different
      parity (even/odd) or they differ in the type of child
      (left/right)
 PA                         6 Tree Algorithms
 © Harald Räcke                                                      100
Cases, when the left edge btw. p(u) and p(v) is a left-child
edge.
v v
u u
1 2 2
3 4
  PA                       6 Tree Algorithms
  © Harald Räcke                                                   101
Example
17
12 14
9 2 3 15
8 1 13 16
10 11
4 5 6 7
   PA                                6 Tree Algorithms
   © Harald Räcke                                                                            102
ñ    one iteration can be performed in constant time with O(|A|)
     processors, where A is the array of leaves;
ñ    hence, all iterations can be performed in O(log n) time and
     O(n) work;
ñ    the intial parallel prefix also requires time O(log n) and
     work O(n)
PA                          6 Tree Algorithms
© Harald Räcke                                                     103
Evaluating Expressions
  Suppose that we want to evaluate an expression tree, containing
  additions and multiplications.
                                                          +
∗ +
∗ + + ∗
A1 A2 A3 A4 A5 A6 A7 A8
                     A1        +         +            +
                                                      ∗       +            +        +
                                                                                    ∗         +
A2 A3 A4 A5 A6 A7 A8
    PA                                           6 Tree Algorithms
    © Harald Räcke                                                                                 104
We can use the rake-operation to do this quickly.
Invariant:
Let u be internal node with children v and w. Then
  PA                          6 Tree Algorithms
  © Harald Räcke                                                105
Rake Operation
                                                r
                                                ∗
                               w
                               u
                               +                              ∗
                      v
                      x1 x          x3                   x4            x5
                          2
Lemma 17
We can evaluate an arithmetic expression tree in time O(log n)
and work O(n) regardless of the height or depth of the tree.
  PA                       6 Tree Algorithms
  © Harald Räcke                                                    107
Lemma 18
We compute tree functions for arbitrary trees in time O(log n)
and a linear number of operations.
proof on board...
  PA                      6 Tree Algorithms
  © Harald Räcke                                                 108
In the LCA (least common ancestor) problem we are given a tree
and the goal is to design a data-structure that answers
LCA-queries in constant time.
 PA                       6 Tree Algorithms
 © Harald Räcke                                                  109
Least Common Ancestor
  LCAs on complete binary trees (inorder numbering):
                                                        1000
                                                         8
                                   0100                                      1100
                                    4                                        12
                   1          3             5      7            9      11            13          15
                 0001       0011          0101   0111          1001   1011          1101        1111
z1 z2 . . . zi 1 0 . . . 0
    PA                                           6 Tree Algorithms
    © Harald Räcke                                                                                     110
Least Common Ancestor
2 8 9
3 4
5 6 7
nodes 1 2 3 2 4 5 4 6 4 7 4 2 1 8 1 9 1
levels 0 1 2 1 2 3 2 3 2 3 2 1 0 1 0 1 0
   PA                                     6 Tree Algorithms
   © Harald Räcke                                                                                 111
`(v) is index of first appearance of v in node-sequence.
  PA                       6 Tree Algorithms
  © Harald Räcke                                             112
Least Common Ancestor
Lemma 19
   PA                        6 Tree Algorithms
   © Harald Räcke                                          113
Range Minima Problem
    PA                          6 Tree Algorithms
    © Harald Räcke                                                           114
Observation
Given an algorithm for solving the range minima problem in time
T (n) and work W (n) we can obtain an algorithm that solves the
LCA-problem in time O(T (n) + log n) and work O(n + W (n)).
Remark
In the sequential setting the LCA-problem and the range minima
problem are equivalent. This is not necessarily true in the
parallel setting.
  PA                        6 Tree Algorithms
  © Harald Räcke                                                       115
Prefix and Suffix Minima
   Tree with prefix-minima and suffix-minima:
                                                0       0   0       0   0       0   0       0   0       1   1       3   3       3   3       3
                                                6       4   2       2   2       2   1       1   0       0   0       0   0       0   0       0
                        1       1   1       1   1       1   1       6                                                   0       1   1       3   3       3   3       3
                        6       4   2       2   2       2   1       1                                                   0       0   0       0   0       0   0       0
            2       2   2       3                           1       1   1       6                           0       1   1       6                           3       3   3       3
            6       4   2       2                           4       4   1       1                           0       0   0       0                           3       3   3       3
       4    4                   2   3                   4   5                   1   6                   0   5                   1   6                   3   4                   3   3
       6    4                   2   2                   4   4                   1   1                   0   0                   1   1                   3   3                   5   3
6 4 2 3 4 5 1 6 0 5 1 6 3 4 5 3
       PA                                                                       6 Tree Algorithms
       © Harald Räcke                                                                                                                                                                       116
  ñ   Suppose we have an array A of length n = 2k
  ñ   We compute a complete binary tree T with n leaves.
  ñ   Each internal node corresponds to a subsequence of A. It
      contains an array with the prefix and suffix minima of this
      subsequence.
 PA                         6 Tree Algorithms
 © Harald Räcke                                                     117
Lemma 20
We can solve the range minima problem in time O(log n) and
work O(n log n).
 PA                      6 Tree Algorithms
 © Harald Räcke                                              118
Reducing the Work
  For each block Bi compute the minimum xi and its prefix and
  suffix minima.
    PA                           6 Tree Algorithms
    © Harald Räcke                                                     119
Answering a query (`, r):
  ñ   if ` and r are from the same block the data-structure for
      this block gives us the result in constant time
  ñ   if ` and r are from different blocks the result is a minimum
      of three elements:
 PA                           6 Tree Algorithms
 © Harald Räcke                                                      120
Searching
                                           log n
                      logp+1 (n) =
                                        log(p + 1)
  Definition 21
  Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
  X is
                        rank(y : X) = |{x ∈ X | x ≤ y}|
         j(i) := rank(bi√m : A)
    3. Let Bi = (bi√m+1 , . . . , b(i+1)√m−1 ); and
       Ai = (aj(i)+1 , . . . , aj(i+1) ).
proof on board...
  Lemma 22
  A straightforward parallelization of Mergesort can be
  implemented in time O(log n log log n) and with work O(n log n).
    Algorithm 11 ColeSort()
     1: initialize L0 [v] = Av for leaf nodes; L0 [v] =  otw.
     2: for s ← 1 to 3 · height(T ) do
     3:       for all active nodes v do
     4:            // u and w children of v
     5:            L0s [u] ← sample(Ls−1 [u])
     6:            L0s [w] ← sample(Ls−1 [w])
     7:            Ls [v] ← merge(L0s [u], L0s [w])
                      
                       sample4 (Ls [v])              s ≤ 3 height(v)
                      
                      
     sample(Ls [v]) =   sample2 (Ls [v])              s = 3 height(v) + 1
                       sample (L [v])                s = 3 height(v) + 2
                      
                      
                              1   s
                  0 0 0 1 1 2 2 2 3
                                  0 1
                                    3 3
                                      2 2 3
                                        4 4 2 5
                                          1 4 2 595
                                              4   9 6
                                                  6
                                                  7 9 6
                                                      9 6
                                                        9 6
                                                          8 9
                                                            6 6
                                                              9 9
                                                                6 7 8 8 9 9 9 9 9
                  0 0 0 1 1 2 2 2 3
                                  0 3
                                    1 3
                                      2 4 3
                                        2 4 2 5
                                          1 4 4 5
                                                9 5
                                                  9 6
                                                  6 9 6
                                                    8 9 6
                                                        9 6
                                                        7 8 6
                                                            9 6
                                                              9 6
                                                                9 7 8 8 9 9 9 9 9
         0 2 2 2 3 4 5
                 2 2 2 596
                     4   9 8
                         8 6 6
                           9 9 9
                               8 9 9 9 9                                         0 0 1 1 3
                                                                                         0 1
                                                                                           3 4
                                                                                             3 595
                                                                                             1 4 9 6
                                                                                                 6 9 6
                                                                                                     7 9
                                                                                                       6 6 7 8 9
         0 2 2 2 3 2 5
                 2 4 4 5
                       8 6
                         9 6
                         8 9 6
                           8 9 8
                               9 9 9 9 9                                         0 0 1 1 3
                                                                                         1 3 3 4
                                                                                           1 4 6 5
                                                                                               5 9 6
                                                                                                 6 9 6
                                                                                                     7 6
                                                                                                       9 6 7 8 9
                                                               ss == 17
                                                                     75
                                                                     10
                                                                     13
                                                                     15
                                                                     16
                                                                     18
                                                                      0
                                                                      3
                                                                      6
                                                                      8
                                                                      9
                                                                     11
                                                                     12
                                                                     14
                                                                      1
                                                                      2
                                                                      4
     0 2 3
         2 499 9 9 9                     2 2 5
                                             2 586
                                                 8 8
                                                   6 6 8                     0 1 4
                                                                                 1 696
                                                                                     9 9
                                                                                     7 7 8 9                     0 1 3
                                                                                                                     1 364
                                                                                                                         6 6
                                                                                                                         5 5 6 6
     0 2 3 9 9 9 9 9
         2 4                             2 2 5 5
                                               6 6
                                                 8 6
                                                   8 6 8                     0 1 4 6
                                                                                   8 6
                                                                                     9 7
                                                                                     8 9 8 9                     0 1 3
                                                                                                                     1 3
                                                                                                                       5 4
                                                                                                                         6 5
                                                                                                                         5 6 6 6
499      399       299     099      688      266      566      255      477      099      688      166      155      044      333      666
4 9      3 9       2 9     0 9      6 8      2 6      5 6      2 5      4 7      0 9      6 8      1 6      1 5      0 4      3 3      6 6
4    9   9    3   9    2   0    9   6    8   6    2   6    5   5    2   7    4   0    9   6    8   1    6   5    1   4    0   3    3   6    6
4    9   9    3   9    2   0    9   6    8   6    2   6    5   5    2   7    4   0    9   6    8   1    6   5    1   4    0   3    3   6    6
  Lemma 23
  After round s = 3 height(v), the list Ls [v] is complete.
  Proof:
    ñ    clearly true for leaf nodes
    ñ    suppose it is true for all nodes up to height h;
    ñ    fix a node v on level h + 1 with children u and w
    ñ    L3h [u] and L3h [w] are complete by induction hypothesis
    ñ    further sample(L3h+2 [u]) = L[u] and
         sample(L3h+2 [v]) = L[v]
    ñ    hence in round 3h + 3 node v will merge the complete list
         of its children; after the round L[v] will be complete
  Lemma 24
  The number of elements in lists Ls [v] for active nodes v is at
  most O(n).
proof on board...
  Lemma 26
  L0s [v] is a 4-cover of L0s+1 [v].
  Lemma 27
  If [a, b] with a, b ∈ L0s [v] ∪ {−∞, ∞} intersects (−∞, L0s [v], ∞) in
  k ≥ 2 items, then [a, b] intersects (−∞, L0s+1 , ∞) in at most 2k
  items.
=k
Ls−1 [v] ≤ 4k − 3
p + q ≤ 4k − 1
Ls [v] ≤ 2p + 2q ≤ 8k − 2
Note that the last step holds as long L0s+1 [v] = sample4 (Ls [v]). Otw. Ls−1 [v] has already been
full, and hence, L0s [v], L0s+1 [v], L0s+2 [v] are 4-covers of the complete list L[v], and also 4-covers
of each other.
Merging with a Cover
  Lemma 28
  Given two sorted sequences A and B. Let X be a c-cover of A and
  B for constant c, and let rank(X : A) and rank(X : B) be known.
  Lemma 29
  Given two sorted sequences A and B. Let X be a c-cover of B for
  constant c, and let rank(A : X) and rank(X : B) be known.
  Lemma 30
  Given two sorted sequences A and B. Let X be a c-cover of B for
  constant c, and let rank(A : X) and rank(X : B) be known.
  We can compute
    ñ    rank(L0s+1 [v] : L0s+2 [v])
    ñ    rank(L0s+1 [u] : L0s+1 [w])
    ñ    rank(L0s+1 [w] : L0s+1 [u])
  in constant time and O(|Ls+1 [v]|) operations, where v is the
  parent of u and w.
  PA                        8 Sorting Networks
  © Harald Räcke                                              145
Bitonic Merger
  If we feed a bitonic 0-1 sequence S into the
  network on the right we obtain two bitonic
  sequences ST and SB s.t.
    1. SB ≤ ST (element-wise)                          ST
    2. SB and ST are bitonic
  Proof:
    ñ   assume wlog. S more 1’s than 0’s.          S
  Bitonic Merger Bd
  The bitonic merger Bd
  of dimension d is con-
                               Bd−1
  structed by combining
  two bitonic mergers of
  dimension d − 1.
              0
             Sd−1
             Sd−1
Bitonic Merger: (n = 2d )
  ñ    comparators: C(n) = 2C(n/2) + n/2 ⇒ C(n) = O(n log n).
  ñ    depth: D(n) = D(n/2) + 1 ⇒ D(d) = O(log n).
Bitonic Sorter: (n = 2d )
  ñ    comparators: C(n) = 2C(n/2) + O(n log n) ⇒
       C(n) = O(n log2 n).
  ñ    depth: D(n) = D(n/2) + log n ⇒ D(n) = Θ(log2 n).
  PA                        8 Sorting Networks
  © Harald Räcke                                                149
Odd-Even Merge
  How to merge two sorted sequences?
  A = (a1 , a2 , . . . , an ), B = (b1 , b2 , . . . , bn ), n even.
Let
Then
    PA                                8 Sorting Networks
    © Harald Räcke                                                              150
Odd-Even Merge
Md−1
                 Md−1
Theorem 34
There exists a sorting network with depth O(log n) and
O(n log n) comparators.
 PA                       8 Sorting Networks
 © Harald Räcke                                          152
Parallel Comparison Tree Model
    PA                          9 Lower Bounds
    © Harald Räcke                                                     153
Comparison PRAM
    PA                         9 Lower Bounds
    © Harald Räcke                                                  154
A Lower Bound for Searching
  Theorem 35
  Given a sorted table X of n elements and an element y.
                                     log n
  Searching for y in X requires Ω( log(p+1) ) steps in the parallel
  comparsion tree with parallelism p < n.
    PA                         9 Lower Bounds
    © Harald Räcke                                                    155
A Lower Bound for Maximum
  Theorem 36
  A graph G with m edges and n vertices has an independent set
               n2
  on at least 2m+n vertices.
  base case (n = 1)
    ñ   The only graph with one vertex has m = 0, and an
        independent set of size 1.
   PA                         9 Lower Bounds
   © Harald Räcke                                                156
induction step (1, . . . , n → n + 1)
  ñ   Let G be a graph with n + 1 vertices, and v a node with
      minimum degree (d).
  ñ   Let G0 be the graph after deleting v and its adjacent
      vertices in G.
  ñ   n0 = n − (d + 1)
                 d
  ñ   m0 ≤ m − 2 (d + 1) as we remove d + 1 vertices, each with
      degree at least d
  ñ   In G0 there is an independent set of size ((n0 )2 /(2m0 + n0 )).
  ñ   By adding v we obtain an indepent set of size
                               (n0 )2       n2
                         1+             ≥
                              2m + n
                                0     0   2m + n
A Lower Bound for Maximum
  Theorem 37
  Computing the maximum of n elements in the comparison tree
  requires Ω(log log n) steps whenever the degree of parallelism is
  p ≤ n.
  Theorem 38
  Computing the maximum of n elements requires Ω(log log n)
  steps on the comparison PRAM with n processors.
    PA                       9 Lower Bounds
    © Harald Räcke                                                    158
An adversary can specify the input such that at the end of the
(i + 1)-st step the maximum lies in a set Ci+1 of size si+1 such
that
  ñ    no two elements of Ci+1 have been compared
                     si2
  ñ    si+1 ≥      2p+ci
  PA                        9 Lower Bounds
  © Harald Räcke                                                   159
Theorem 39
The selection problem requires Ω(log n/ log log n) steps on a
comparison PRAM.
  PA                       9 Lower Bounds
  © Harald Räcke                                                160
A Lower Bound for Merging
    PA                            9 Lower Bounds
    © Harald Räcke                                                            161
A Lower Bound for Merging
  Lemma 40
  Suppose we are given a parallel comparison tree with
  parallelism p to solve the (k, s) merging problem. After the first
  step an adversary can specify the input such that an arbitrary
  (k0 , s 0 ) merging problem has to be solved, where
                                    3q
                             k0 =     pk
                                    4s
                                    s     k
                             s0 =
                                    4     p
    PA                        9 Lower Bounds
    © Harald Räcke                                                     162
A Lower Bound for Merging
    PA                         9 Lower Bounds
    © Harald Räcke                                                      163
Choose for every i the diagonal of M i that has most zeros.
We can choose value s.t. elements for the j-th pair along the
diagonal are all smaller than for the (j + 1)-th pair.
  PA                            9 Lower Bounds
  © Harald Räcke                                                              164
How many pairs do we have?
 ñ    there are k` blocks in total
 ñ    there are k · `2 matrix entries in total
 ñ    there are at least k · `2 − p zeros.
 ñ    choosing a random diagonal (same for every matrix M i ) hits
      at least
                         k`2 − p    k`   p
                                 ≥     −
                          2` − 1    2    2`
      zeroes.
 ñ    Choosing ` = d2 p/ke gives
                     p
                                                      s
                   0  3q              s      s    s       k
                  k ≥   pk and s 0 = b c ≥ p    =
                      4               `   4 p/k   4       p
 PA                           9 Lower Bounds
 © Harald Räcke                                                      165
Lemma 41
Let T (k, s, p) be the number of parallel steps required on a
comparison tree to solve the (k, s) merging problem. Then
                                               p
                                     1     log k
                     T (k, p, s) ≥     log     p
                                     4     log ks
  PA                         9 Lower Bounds
  © Harald Räcke                                                166
Induction Step:
Assume that
                                                     p
                                   1      log k0
                   T (k0 , s 0 , p) ≥log      p
                                   4     log k0 s 0
                                             4 p
                                               q
                                   1     log 3 k
                                  ≥ log      16 p
                                   4     log 3 ks
                                                 1       p
                                   1       log k
                                  ≥ log 2      p
                                   4    7 log ks
                                                     p
                                   1    log k
                                  ≥ log     p −1
                                   4    log ks
  PA                            9 Lower Bounds
  © Harald Räcke                                             167
Theorem 42
Merging requires at least Ω(log log n) time on a CRCW PRAM
with n processors.
 PA                       9 Lower Bounds
 © Harald Räcke                                              168
Simulations between PRAMs
  Theorem 43
  We can simulate a p-processor priority CRCW PRAM on a
  p-processor EREW PRAM with slowdown O(log p).
  Theorem 44
  We can simulate a p-processor priority CRCW PRAM on a
  p log p-processor common CRCW PRAM with slowdown O(1).
  Theorem 45
  We can simulate a p-processor priority CRCW PRAM on a
                                                    log p
  p-processor common CRCW PRAM with slowdown O( log log p ).
  Theorem 46
  We can simulate a p-processor priority CRCW PRAM on a
  p-processor arbitrary CRCW PRAM with slowdown O(log log p).
  Ideal PRAM:
    ñ    every processor has unbounded local memory
    ñ    in each step a processor reads a global variable
    ñ    then it does some (unbounded) computation on its local
         memory
    ñ    then it writes a global variable
  Definition 47
  An input index i affects a memory location M at time t on some
  input I if the content of M at time t differs between inputs I and
  I(i) (i-th bit flipped).
  Definition 48
  An input index i affects a processor P at time t on some input I
  if the state of P at time t differs between inputs I and I(i) (i-th
  bit flipped).
  Lemma 49
  If i ∈ K(P , t, I) with t > 1 then either
    ñ    i ∈ K(P , t − 1, I), or
    ñ    P reads a global memory location M on input I at time t,
         and i ∈ L(M, t − 1, I).
  Lemma 50
  If i ∈ L(M, t, I) with t > 1 then either
    ñ    A processor writes into M at time t on input I and
         i ∈ K(P , t, I), or
    ñ    No processor writes into M at time t on input I and
           ñ   either i ∈ L(M, t − 1, I)
           ñ   or a processor P writes into M at time t on input I(i).
Lemma 51
|K(P , t, I)| ≤ kt and |L(M, t, I)| ≤ `t for any t ≥ 0
Hence,
Case 1:
A processor P writes into location M at time t + 1 on input I.
Then,
Fact:
For all pairs us , ut with Pws ≠ Pwt either
us ∈ K(Pwt , t + 1, I(ut )) or ut ∈ K(Pws , t + 1, I(us )).
Otherwise, Pwt and Pws would both write into M at the same
time on input I(us )(ut ).
                                 1
Hence, there must be at least    2 r (r   − kt+1 ) pairs ui , uj with
Pwi ≠ Pwj .
Hence,
                               1
                       |E| ≥     r (r − kt+1 )
                               2
Eigenvalues:
                      1     √            1    √
               λ1 =     (5 + 21) and λ2 = (5 − 21)
                      2                  2
Eigenvectors:
                                       !                            !
                          1                                   1
           v1 =                                and v2 =
                      −(1 − λ1 )                          −(1 − λ2 )
                               1                                   1
                                       !                                  !
           v1 =        3       1√              and v2 =   3        1√
                       2   +   2 21                       2    −   2 21
                1                           1
                       !                            !
v1 =   3        1√         and v2 =   3       √
       2   +    2    21               2   − 12 21
       k0
            !           !
                       0     1
                =         = √ (v1 − v2 )
       `0              1     21
           kt
                !
                       1  t
                          λ1 v1 − λt2 v2
                                         
                    = √
           `t          21
Solving the recurrence gives
                                 λt    λt
                           kt = √ 1 − √ 2
                                  21    21
                         √                √
                      3 + 21 t       −3 + 21 t
                   `t = √     λ1 +      √   λ2
                       2 21           2 21
                 √                    √
with λ1 = 21 (5 + 21) and λ2 = 12 (5 − 21).
C(T ) ≥ n, C(0) = 0
Claim:
C(t) ≤ 6C(t − 1) + 3|P |
                       6T −1
This gives C(T ) ≤       5 3|P |     and hence T = Ω(log n − log |P |).
This means that the number of new indices in any set L(M, t)
(over all M) is at most  X
                        2 |K(P , t)|
                             P
X                                  X                                       X
     max{0, |L(M, t)| − 1} ≤           max{0, |L(M, t − 1)| − 1} + 2           |K(P , t)|
M                                  M                                       P
           X                   X                         X
                |K(P , t)| ≤       |K(P , t − 1)| +           |L(M, t − 1)|
            P                  P                       M∈Jt
Recall
X                                       X                                    X
     max{0, |L(M, t)| − 1} ≤                 max{0, |L(M, t − 1)| − 1} + 2        |K(P , t)|
M                                        M                                    P
Hence,
                           C(t) ≤ 6C(t − 1) + 3|P |
  Theorem 56
  Let f : {0, 1}n → {0, 1} be an arbitrary Boolean function. f can
  be computed in O(1) time on a common CRCW PRAM with ≤ n2n
  processors.
Definition 58
A family {Cn } of circuits is logspace uniform if there exists a
deterministic Turing machine M s.t
  ñ    M runs in logarithmic space.
  ñ    For all n, M outputs Cn on input 1n .
   0
    0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
111
110
101
100
011
010
001
                    000
                          -3   -2    -1     0      1      2        3
 0
     000 001 002 010 011 012 020 021 022 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222
111
110
101
100
011
010
001
                  000
                        -3   -2   -1    0     1     2     3
    PA                        11 Some Networks
    © Harald Räcke                                   210
Permutation Routing
  Lemma 59
  On the linear array M(n, 1) any permutation can be routed
  online in 2n steps with buffersize 3.
    PA                     11 Some Networks
    © Harald Räcke                                            211
Permutation Routing
  Lemma 60
  On the Beneš network any permutation can be routed offline in
  2d steps between the sources level (+d) and target level (−d).
    PA                      11 Some Networks
    © Harald Räcke                                                 212
Recursive Beneš Network
B(d − 1)
                      B(d − 1)
Permutation Routing
  base case d = 0
  trivial
  induction step d → d + 1
    ñ   The packets that start at (ā, d) and (ā(d), d) have to be
        sent into different sub-networks.
    ñ   The packets that end at (ā, −d) and (ā(d), −d) have to
        come out of different sub-networks.
  We can color such a graph with n colors such that no two nodes
  in a hyperedge share a color.
 PA                      11 Some Networks
 © Harald Räcke                                              216
We can simulate the algorithm for the n-ary Beneš Network.
  PA                       11 Some Networks
  © Harald Räcke                                                   217
We simulate the behaviour of the Beneš network on the
n-dimensional mesh.
  PA                       11 Some Networks
  © Harald Räcke                                            219
The nodes are of the form (`, x̄), x̄ ∈ [n]d , ` ∈ −d, . . . , d.
We route in 3 phases:
 1. Permute packets along the rows such that afterwards no
    column contains packets that have the same target row.
    O(d) steps.
 2. We can use pipeling to permute every column, so that
    afterwards every packet is in its target row. O(2d + 2d)
    steps.
 3. Every packet is in its target row. Permute packets to their
    right destinations. O(d) steps.
  PA                          11 Some Networks
  © Harald Räcke                                                    220
Lemma 63
We can do offline permutation routing of (partial) permutations
in 2d steps on the hypercube.
Lemma 64
We can sort on the hypercube M(2, d) in O(d2 ) steps.
Lemma 65
We can do online permutation routing of permutations in O(d2 )
steps on the hypercube.
  PA                      11 Some Networks
  © Harald Räcke                                                  221
Bitonic Sorter Sd
              0
             Sd−1
             Sd−1
ASCEND/DESCEND Programs
    PA                         11 Some Networks
    © Harald Räcke                                                       223
              Algorithm 11 oper(a, a0 , dim, Ta , Ta0 )
                  1: if adim , . . . , a0 = 0dim+1 then
                  2:      Ta = min{Ta , Ta0 }
 PA                                 11 Some Networks
 © Harald Räcke                                                224
We can perform an ASCEND/DESCEND run on a linear array
M(2d , 1) in O(2d ) steps.
 PA                     11 Some Networks
 © Harald Räcke                                          225
The CCC network is obtained from a hypercube by replacing
every node by a cycle of degree d.
constand degree
 PA                           11 Some Networks
 © Harald Räcke                                             226
Lemma 66
Let d = 2k . An ASCEND run of a hypercube M(2, d + k) can be
simulated on CCC(d) in O(d) steps.
 PA                      11 Some Networks
 © Harald Räcke                                                227
The shuffle exchange network SE(d) is defined as follows
  ñ    nodes: V = [2]d
  ñ    edges:
          n                                  o n                         o
       E = {x ᾱ, ᾱx} | x ∈ [2], ᾱ ∈ [2]d−1 ∪ {ᾱ0, ᾱ1} | ᾱ ∈ [2]d−1
constand degree
Edges of the first type are called shuffle edges. Edges of the
second type are called exchange edges
  PA                            11 Some Networks
  © Harald Räcke                                                             228
Shuffle Exchange Networks
100 101
010 011
    PA                                     11 Some Networks
    © Harald Räcke                                                                              229
Lemma 67
We can perform an ASCEND run of M(2, d) on SE(d) in O(d)
steps.
 PA                      11 Some Networks
 © Harald Räcke                                            230
Simulations between Networks
    PA                         11 Some Networks
    © Harald Räcke                                                        231
Simulations between Networks
  Definition 68
  A configuration Ci of processor Pi is the complete description of
  the state of Pi including local memory, program counter,
  read-register, write-register, etc.
    PA                        11 Some Networks
    © Harald Räcke                                                       232
Simulations between Networks
  Definition 69
  Let C = (C0 , . . . , Cp−1 ) a configuration of M. A machine M 0 with
  q ≥ p processors weakly simulates t steps of M with slowdown k
  if
    ñ    in the beginning there are p non-empty processors sets
         A0 , . . . , Ap−1 ⊆ M 0 so that all processors in Ai know Ci ;
    ñ    after at most k · t steps of M 0 there is a processor Q(i) that
         knows the t-th successors configuration of C for processor
         Pi .
    PA                           11 Some Networks
    © Harald Räcke                                                         233
Simulations between Networks
  Definition 70
  M 0 simulates M with slowdown k if
    ñ    M 0 weakly simulates machine M with slowdown k
    ñ    and every processor in Ai knows the t-th successor
         configuration of C for processor Pi .
    PA                        11 Some Networks
    © Harald Räcke                                            234
We have seen how to simulate an ASCEND/DESCEND run of the
hypercube M(2, d + k) on CCC(d) with d = 2k in O(d) steps.
 PA                      11 Some Networks
 © Harald Räcke                                              235
Lemma 71
Suppose a network S with n processors can route any
permutation in time O(t(n)). Then S can simulate any constant
degree network M with at most n vertices with slowdown
O(t(n)).
 PA                      11 Some Networks
 © Harald Räcke                                                 236
Map the vertices of M to vertices of S in an arbitrary way.
  PA                       11 Some Networks
  © Harald Räcke                                                 237
Lemma 72
Suppose a network S with n processors can sort n numbers in
time O(t(n)). Then S can simulate any network M with at most
n vertices with slowdown O(t(n)).
 PA                      11 Some Networks
 © Harald Räcke                                                238
Lemma 73
There is a constant degree network on O(n1+ ) nodes that can
simulate any constant degree network with slowdown O(1).
 PA                       11 Some Networks
 © Harald Räcke                                                 239
Suppose we allow concurrent reads, this means in every step all
neighbours of a processor Pi can read Pi ’s read register.
Lemma 74
A constant degree network M that can simulate any n-node
network has slowdown Ω(log n) (independent of the size of M).
  PA                      11 Some Networks
  © Harald Räcke                                                  240
We show the lemma for the following type of simulation.
  ñ    There are representative sets Ati for every step t that specify
       which processors of M simulate processor Pi in step t
       (know the configuration of Pi after the t-th step).
  ñ    The representative sets for different processors are disjoint.
  ñ    for all i ∈ {1, . . . , n} and steps t, Ati ≠ .
  PA                             11 Some Networks
  © Harald Räcke                                                         241
Suppose processor Pi reads from processor Pji in step t.
                                    `
                                 1 X
                            k=         kt
                                 ` t=1
  PA                        11 Some Networks
  © Harald Räcke                                              242
We show
 ñ    The simulation of a step takes at least time γ log n, or
 ñ    the size of the representative sets shrinks by a lot
                                          1 X t
                              |At+1
                          X
                                i   |≤         |Ai |
                          i
                                          n i
 PA                           11 Some Networks
 © Harald Räcke                                                  243
Suppose there is no pair (i, j) such that i reading from j
requires time γ log n.
  ñ    For every i the set Γ2k (Ai ) contains a node from Aj .
  ñ    Hence, there must exist a ji such that Γ2k (Ai ) contains at
       most
                               |Ai | · c 2k   |Ai | · c 3k
                     |Cji | :=              ≤              .
                                 n−1               n
       processors from |Aji |
  PA                            11 Some Networks
  © Harald Räcke                                                      244
If we choose that i reads from ji we get
                       |A0i | ≤ |Cji | · c k
                                      |Ai | · c 3k
                             ≤ ck ·
                                           n
                                 1
                             =     |Ai | · c 4k
                                 n
  PA                         11 Some Networks
  © Harald Räcke                                              245
Let ` be the total number of steps and s be the number of short
steps when kt < γ log n.
  PA                       11 Some Networks
  © Harald Räcke                                                    246
                                 1 s Y
                                             kt +1    n      `+ t kt
                                                               P
                  n ≤ h` ≤ h0              c       ≤     · c
                                 n t∈long           ns
                                                        
                                    n ≤ n1−s+` 2
     PA                              11 Some Networks
     © Harald Räcke                                                    247
Deterministic Online Routing
  Lemma 75
  A permutation on an n × n-mesh can be routed online in O(n)
  steps.
    PA                     11 Some Networks
    © Harald Räcke                                              248
Deterministic Online Routing
    PA                      11 Some Networks
    © Harald Räcke                                                   249
Deterministic Online Routing
    PA                     11 Some Networks
    © Harald Räcke                                               250
Deterministic Online Routing
  Definition 80 (dilation)
  For a given path system the dilation is the maximum length of a
  path.
    PA                       11 Some Networks
    © Harald Räcke                                                  251
Lemma 81
Any oblivious routing protocol requires at least max{Cf , Df }
steps, where Cf and Df , are the congestion and dilation,
respectively, of the path-system used. (node congestion or edge
congestion depending on the communication model)
Lemma 82
Any reasonable oblivious routing protocol requires at most
O(Df · Cf ) steps (unbounded buffers).
  PA                      11 Some Networks
  © Harald Räcke                                                  252
Theorem 83 (Borodin, Hopcroft)
For any path system W there exists a permutation π : V → V
                                       √
and an edge e ∈ E such that at least Ω( n/∆) of the paths go
through e.
 PA                      11 Some Networks
 © Harald Räcke                                                253
Let Wv = {Pv,u | u ∈ V }.
  PA                        11 Some Networks
  © Harald Räcke                                                    254
For any node v there are many edges that are are quite popular
for v.
|V | × |E|-matrix A(z):
                                        e is z-popular for v
                                (
                                    1
                   Av,e (z) =
                                    0   otherwise
Define
  ñ
                                              X
                                Av (z) =           Av,e (z)
                                               e
  ñ
                                              X
                                Ae (z) =           Av,e (z)
                                               v
  PA                                11 Some Networks
  © Harald Räcke                                                 255
Lemma 84
       n−1
Let z ≤ ∆ .
                                               n
For every node v ∈ V there exist at least     2∆z   edges that are z
popular for v. This means
                                       n
                         Av (z) ≥
                                      2∆z
  PA                       11 Some Networks
  © Harald Räcke                                                       256
Lemma 85
There exists an edge e0 that is z-popular for at least z nodes
            √
with z = Ω( n∆).
                     X               X                     n2
                          Ae (z) =       Av (z) ≥
                      e              v                    2∆z
                                  n2                       n
                        &                   '                    
                   Ae0 (z) ≥                      ≥
                               |E| · 2∆z                  2∆2 z
  PA                           11 Some Networks
  © Harald Räcke                                                      257
                             n                   √   √
We choose z such that z =   2∆2 z
                                    (i.e., z =    n/( 2∆)).
  PA                        11 Some Networks
  © Harald Räcke                                                   258
Deterministic oblivious routing may perform very poorly.
  PA                      11 Some Networks
  © Harald Räcke                                           259
Suppose every source on level 0 has p packets, that are routed
to random destinations.
Hence,
                                                 2d−i   1
                   Pr[packet goes over v] ≤           = i
                                                  2d   2
  PA                          11 Some Networks
  © Harald Räcke                                                 260
Expected number of packets:
                                                  1
                   E[packets over v] = p · 2i ·      =p
                                                  2i
  PA                           11 Some Networks
  © Harald Räcke                                          261
What is the probability that at least r packets go through v.
                                               p · 2i  1 r
                                                 !  
          Pr[at least r path through v] ≤           ·
                                                 r    2i
                                                    !r  
                                            p2i · e      1
                                        ≤             ·
                                              r          2i
                                            pe r
                                             
                                        =
                                            r
  PA                        11 Some Networks
  © Harald Räcke                                                  262
 Pr[there exists a node v sucht that at least r path through v]
                          pe r
                           
                     d
                 ≤ d2 ·
                          r
                                                          1
      Pr[exists node v with more than r paths over v] ≤
                                                          N`
 PA                        11 Some Networks
 © Harald Räcke                                                   263
Scheduling Packets
    PA                      11 Some Networks
    © Harald Räcke                                                 264
Definition 86 (Delay Sequence of length s)
  ñ   delay path W
  ñ   lengths `0 , `1 , . . . , `s , with `0 ≥ 1, `1 , . . . , `s ≥ 0 lengths of
      delay-free sub-paths
  ñ   collision nodes v0 , v1 , . . . , vs , vs+1
  ñ   collision packets P0 , . . . , Ps
 PA                               11 Some Networks
 © Harald Räcke                                                                    265
Properties
  ñ   rank(P0 ) ≥ rank(P1 ) ≥ · · · ≥ rank(Ps )
      Ps
        i=0 `i = d
  ñ
 PA                           11 Some Networks
 © Harald Räcke                                                      266
Definition 87 (Formal Delay Sequence)
 PA                            11 Some Networks
 © Harald Räcke                                                        267
We say a formal delay sequence is active if rank(Pi ) = ki holds
for all i.
  PA                           11 Some Networks
  © Harald Räcke                                                   268
Lemma 88
                                                 s+1
                                  2eC(s + k)
                              
                       Ns ≤
                                    s+1
 PA                           11 Some Networks
 © Harald Räcke                                                      269
Hence the probability that the routing takes more than d + s
steps is at most
                                                 s+1
                              2e · C · (s + k)
                          
                   N3 ·
                                 (s + 1)k
  PA                          11 Some Networks
  © Harald Räcke                                                  270
  ñ    With probability 1 − 1`1 the random routing problem has
                           N
       congestion at most O(p + `1 d).
  ñ    With probability 1 − 1`2 the packet scheduling finishes in at
                           N
       most O(C + `2 d) steps.
  PA                         11 Some Networks
  © Harald Räcke                                                       271
Valiants Trick
   We only used
     ñ    all routing paths are of the same length d
     ñ    there are a polynomial number of delay paths
     PA                          11 Some Networks
     © Harald Räcke                                                 272
Valiants Trick
     PA                               11 Some Networks
     © Harald Räcke                                                     273
Valiants Trick
    PA                     11 Some Networks
    © Harald Räcke                                              274
Valiants Trick
     PA                       11 Some Networks
     © Harald Räcke                                              275
For a network G = (V , E, c) we define the characteristic flow
problem via
                        c(u)c(v)
  ñ    demands du,v =     c(V )
  PA                         11 Some Networks
  © Harald Räcke                                                 276
Definition 89
A (randomized) oblivious routing scheme is given by a path
system P and a weight function w such that
                         X
                                 w(p) = 1
                        p∈Ps,t
 PA                       11 Some Networks
 © Harald Räcke                                              277
Construct an oblivious routing scheme from S as follows:
  ñ    let fx,y be the flow between x and y in S
  ñ
                                                   1 c(x)c(y)
                   fx,y ≥ dx,y /C(S) ≥ dx,y /F =
                                                   F   c(V )
  ñ    for p ∈ Px,y set w(p) = fp /fx,y
  PA                           11 Some Networks
  © Harald Räcke                                                278
Valiants Trick
    PA                         11 Some Networks
    © Harald Räcke                                                     279
Example: hypercube.
 PA                   11 Some Networks
 © Harald Räcke                          280
Oblivious Routing for the Mesh
    PA                        11 Some Networks
    © Harald Räcke                                                 281
Let for a multicommodity flow problem P Copt (P ) be the
optimum congestion, and Dopt (P ) be the optimum dilation (by
perhaps different flow solutions).
Lemma 90
There is an oblivious routing scheme for the mesh that obtains a
flow solution S with C(S) = O(Copt (P ) log n) and
D(S) = O(Dopt (P )).
  PA                      11 Some Networks
  © Harald Räcke                                                   282
Lemma 91
For any oblivious routing scheme on the mesh there is a demand
P such that routing P will give congestion Ω(log n · Copt ).
 PA                      11 Some Networks
 © Harald Räcke                                                  283