100% (6) 100% found this document useful (6 votes) 6K views 191 pages Solution Manual
Solution manual for Kai Hwang and Naresh Jotwani- Advanced Computer Architecture :Parallelism,Scalability,Programmability
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Solution Manual For Later SoLuTIoNs MANUAL To ACCOMPANY HWANG
ADVANCED
COMPUTER
ARCHITECTURE
PARALLELISM
SCALABILITY
PROGRAMMABILITY
HWANG-CHENG WANG
University of Southern California
Junc-Gen Wu
National Tiawan Normal University
‘McGraw-Hil, Inc.
New York St.Louis San Francisco Auckland Bogota
Caracas Lisbon London Madrid Mexico City Milan Montroal
New Delhi SanJuan Singapore Sydney Tokyo Toronto‘Solutions Manual to Accompany Hivang
ADVANCED COMPUTER ARCHITECTURE
Paralelism, Scalability, Progranmatilty
Copyright © 1999 by MeGrave-Hif, nc. All rights reserved.
Print in ho Unitod Statos of Amavica. Tho contonis or
parts thereof may be reproduced for use with
ADVANCED COMPUTER ARCHITECTURE
Panalleism, Scalability, Programabiliy
by Kai Hwang
provided such reproductions bear copytight notice, ut may not
be reprodueedin any form for any odor purpose without
permission of the publisher.
ISBN 0-07-0916236
234567890 HAM HAM co98765Foreword
Preface
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
L
2
3
4
5
6
7
8
9
Chapter 10
Chapter 11
Chapter 12
Bibliography
Contents
Parallel Computer Models
Program and Network Properties
Principles of Scalable Performance
Processors and Memory Hierarchy .....
Bus, Cache, and Shared Memory
Pipelining and Superscalar Techniques ...
Multiprocessors and Multicomputers
Multivector and SIMD Computers...
Scalable, Multithreaded, and Dataflow Architectures
Parallel Models, Languages, and Compilers
Parallel Program Development and Environments
UNIX, Mach, and OSF/1 for Parallel Computers
173
. 183Foreword
Dr. Hwang-Cheng Wang and Dr. Jung-Gen Wa have timely produced this Solutions
Manual, I believe it will benefit many instructors using the Advanced Architecture:
Parallelism, Scalability, Prograramatility (ISBN 0-07-031622-8) as a required textbook.
Drs, Wang and Wa have provided solutions to all the problems in the text. Some
of the solutions are unique and have been carefully worked out. Others contain just a
sketch of the underlying principles or computations involved. For such problems, they
hhave provided references which should help instructors find out more information in
relevant sources.
‘The authors have done an excellent job in putting together the solutions. How-
ever, as with any scholarly work, there is always room for improvement. ‘Therefore,
instructors are encouraged to communicate with us regarding possible refinement to the
solutions, Comments or errata can be sent to Kai Hwang at the University of South-
em California. They will be incorporated in future printings of this Solutions Manual.
Sample test questions and solutions will also be included in the future to make it more
‘comprehensive.
‘Finally, I want to thank Dz. Wang and Dr. Wa and congratulate them for a difficult
job well done within such a short time period.
Kai Hwang
{
{
|Preface
‘This Solutions Manual is intended for the exclusive use of instructors only. Repro-
duction without permission is prohibited by copyright laws
‘The solutions in this Manual roughly fall in three categories
« For problem-solving questions, detailed solutions have been provided. In some
cases alternative solutions are also discussed. More complete answers can be
found in the text for definition-type questions.
© For research-oriented questions, a summary of the ideas in key papers is pre-
sented. Instructors are urged to consult the original and more recent publications
in literature.
« For questions that require computer programming, algorithms or basic compu-
tation steps are specified where appropriate. Example programs can often be
obtained from on-line archives or libraries available at many research sites.
Equations and figures in the solutions are numbered separately from those in the
text. When an equation or a figure in the text is referenced, itis cleatly indicated. Code
‘segments have been written in assembly and high-level languages. Most codes should
be self explanatory. Comments have been added in some places to help understanding
the function performed by each instruction.
We have made tremendous effort to improve the correctness of the answers. But a
few errors might have been undetected, and some factors might have been overlooked in
our analysis. Moreover, several questions are likely to have more than one valid solution;
solutions for research-oriented problems are especially sensitive to progress in related
areas. In the light of these, we welcome suggestions and corrections from instructors.
Acknowledgments
‘We have received a great deal of help from our colleagues and experts during the
preparation of this Manual. Dr. Chi-Yuan Chin, Myungho Lee, Weihua Mac, Fong Pong,
Dr. Viktor Prasanna, and Shisheng Shang have contributed solutions to a number of
the problems. Chien-Ming Cheng, Cho-Chin Lin, Myungho Lee, Jih-Cheng Lin, Weihua,
Mao, Fong Pong, Stanley Wang, and Namhoon Yoo have geuerously shared their ideas,
through stimulating discussions. We are indebted to Dr. Bill Nitzberg and Dr. David
Black for providing useful information and pointing to additional references. Finally,
our foremost thanks go to Professor Kai Hwang for many insightful suggestions and
Judicious guidance.
H.C. Wang
3.6. WaChapter 1
Parallel Computer Models
Problem 1.1
AS X1432x241SK2+RXZ _ 185
ASX 14+ 32x 2415x248? _ 158 L155, r
Pl Pore ESCEN Joo = 155 eveles/instruction,
40_x 10%cycles/sec
1.55 cycles/instruction
(48000 x 1 + 32000 x 2+ 15000 x 2 + 8000 x 2}cycles
(40 x 10®)cycles/s
‘The execution time can also be obtained by dividing the total number of instructions
by the MIPS rate:
(45000 + 32000 + 15000 + 8000)instructions
25.8 x 10° instructions/s
MIPS rate = 10-8 x 25.8MIPS.
Execution time = = 3.875 ms.
3.875 ms.
Problem 1.2 Instruction set and compiler technology affect the length of the ex-
ecutable code and the memory access frequency. CPU implementation and control
determines the clock rate. Memory hierarchy impacts the effective memory access time.
‘These factors together determine the effective CPI, as explained in Section 1.1.4.
Problem 1.3
(a) The effective CPI of the processor is calculated as
15 x 10° cycles/sec
opr = 1o x10
PI = T6510 instructions/see
= 15 cycles/instroction.
(b) The effective CPI of the new processor is
(140.8 x 240.05 x 4) =
.8 cycles/instruction.
i2 Parallel Computer Models
‘Therefore, the MIPS rate is
30 x 10° eyeles/sec
TS cyelesfinstruction ~ 16-7 MIPS.
Problem 1.4
(a) Average CPI = 1x 0.6 +2% 0.18 +4 x 0.12 +8 x 0.1 = 2.24 cycles / instruction.
(b) MIPS rate = 40/2.24 = 17.86 MIPS.
Problem 1.5
(a) False. The fundamental idea of multiprogramming is to overlap the computations
of some programs with the 1/0 operations of other programs.
(b) True. In an SIMD machine, all processors execute the same instruction at the same
time, Hence it is easy to implement: synchronization in hardware. In an MIMD
machine, different processors may execute different instructions at the same time
and it is difficult to support synchronization in hardware.
(c) True, Interprocessor communication is facilitated by sharing variables on a mal-
tiprocessor and by passing messages among nodes of a multicomphter. The mul-
ticomputer approach is usually mare difficult to program since the programmer
must pay attention to the actual distribution of data among the processors.
(a) False. In general, an MIMD machine executes different instruction streams on
different processors.
(e) True, Contention among processors to access the shared memory may create hot
spots, making multiprocessors less scalable than multicomputers.
Problem 1.6 The MIPS rates for different machine-program combinations are shown
in the following table
es Machine
Program | Computer A [ Computer B | Computer C
Program i | _100 0 5
Program 2] ___07 T 5
Program 3 | 02] __ Oa 2
[Program 4 1} 612s 1
‘Various means of these values can be used to compare the relative performance of
the computers. Definition of the means for a sequence of positive numbers 41,02, ...,0n
are summarized below. (See also the discussion in Section 3.1.2.)
(a) Arithmetic mean: AM = (SZ, a,)/n.
() Geometric mean: GM = (JTL, ai)!"Parallel Computer Models 8
(c) Harmonic mean: HM = n/[S2.,(1/a:)}.
In general,
AM > GM > HM. qa)
Based on the definitions, the following table of mean MIPS rates is obtained:
Computer A | Computer B | Computer C
‘Arithmetic mean |___25.3 281 3.25
Geometric mean 119 0.59 2.66
‘Harmonic mean 0.25 0.20 21
Note that the arithmetic mean of MIPS rates is proportional to the inverse of the
harmonic mean of the execution times. Likewise, the harmonic mean of the MIPS
rates is proportional to the inverse of the arithmetic mean of execution times. The two
observations are consistent with Eq. 1.1
If we use the harmonic mean of MIPS rates as the performance criterion (i.e., each
program is executed the same number of times on each computer), computer C has
the best performance. On the other hand, if the arithmetic mean of MIPS rates is
‘used, which is equivalent to allotting an equal amount of time for the execution of each
program on each computer (i.e. fast-running programs are executed more frequently),
then computer A is the best choice.
Problem 1.7
© An SIMD computer has a single control unit. The other processors are simple
slave processors which accept instructions from the control unit and perform an
identical operation at the same time on different data. Bach processor in an
MIMD computer has its own control unit. and execution unit. At any moment,
‘a processor can execute an instruction different from the other processors,
‘* Multiprocessors have a shared memory structure. The degree of resource sharing
is high, and interprocessor communication is carried out via shared variables in
the shared memory. In multicomputers, each node typically consists of a pro:
cessor and local snemory. The nodes are connected by communication channels
which provide the mechanism for message interchanges among processors. Re-
source sharing is light among processors.
+ In UMA architecture, each memory location in the system is equally accessible
to all processors, and the access time is uniform. In NUMA architecture, the
access time to a memory location depends on the proximity of @ processor to
the memory location, Therefore, the access time is nonuniform. In NORMA
architecture, each processor has its own private memory; no memory is shared
among processors, Each processor is allowed to access its private memory only.
In COMA architecture, such as that adopted by KSR-1, each processor lias its
private cache, which together constitutes the global address space of the system,
Its like a NUMA with cache in place of memory. A page of data can be migrated
to a processor upon demand or be replicated on more than one processor.4 Parallel Computer Models
Problem 1.8
(2) The total number of cycles needed on a sequential processor is (44448444 2-4
4) x 64 = 1664 cycles. :
(b) Bach PE executes the same instruction on the corresponding elements of the vectors
involved. There is no communication among the processors. Hence the tent
number of cycles on each PE is 4+4+8444244= 26,
(©) The speedup is 64 with a perfectly parallel execution of the code.
Problem 1.9
Because the processing power of a CRCW-PRAM and an BREW-PRAM is the
same, we need only focus on memory accessing. Below, we prove that the time com:
Plexity of simulating a concurrent write or a coneurrent read on an EREW.PRAM in
Ologn). Before the proof, we assume it is known that an EREW-PRAM can sort a
numbers or write a number to n memory locations in O(log n) time.
(a) We present the proof for simulating concurrent writes below.
1. Create an auxiliary array A of length n. When CROW processor P,, for i
0,1,...0—1, desires to write a datum, @; to a location I;, each corresponding
EREW processor P; writes the ordered pair (I,2:) to location Ali) These
writes are exclusive, since each processor writes toa distinct memory location.
> Sort the array by the first coordinate of the ordered pairs in O(log n) time,
which canses all data written to the same location to be brought together in
the output,
§: Bach EREW processor P,, for i = 1,2,..,m—1, now inspects Afi) = (lj)
and Afi 1] = (l4,z4), where j and & are values in the range 0< j,k ) We present the proof for simulating concurrent reads as follows:
1. Create an auxiliary array A of length n. When CROW processor P,, for
£5 Ot. =1, desires to read a datum from a location I, each corresponding
ZREW processor P, writes the ordered three-tuple (i,1,.2,) t0 location Alf},
in which the a is an arbitrary number. These writes are exclusive, since cack
Processor writes to a distinct memory location,
2 Sort the array A by the second coordinate of the three-tuple in O(log) time,
which causes all data read from the same location to be brouzht together in
the output.Parallel Computer Models 5
3, Bach EREW processor P,, for i = 1,2,....m—1, now inspects Ali) = (j,1j,23)
and Afi~ 1] = (k,lk,2x), where j and k are values in the range 0 < j,k <
n-1 Il #k, ori = 0, then processor P,, for i = 0,1,..)n—4, reads
the datum from location |; in global memory. Otherwise, the processor does
nothing, Since the array A is sorted by the second coordinate, only one of
‘the processors reading from any given location actually succeeds, and thus
the read is exclusive.
4, Each EREW processor P; that read a datum stores the datum to the third
coordinate in Ali], and then broadcasts it to the third coordinate of lj)’s for
G=i+i+2,..., and lj =k. This takes O(log) time.
5. Sort the array A by the first coordinate of the ordered three-tuple in O(log n)
time.
6. Each EREW processor P; reads data in the third coordinate from Afi]. These
reads are exclusive, since each processor reads from a distinct memory loca-
tion,
‘This process thus implements each step of concurrent reading in the common-
RCW model in O(log n) time.
Problem 1.10 For multiplying two n-bit binary integers, there are 2n bits of input,
and 2n bits of output,
‘Suppose the circuit in question, in the grid model, is a rectangle of height h and
width w as shown in the following diagram:
Assume without loss of generality that h < w. and there is at most one word along:
each grid line. It is possible to divide the circuit by a line as shown in the figure. This
line runs betweon the grid lines and runs vertically, except possibly for « single jog of
one grid unit. Most importantly, we can select. the line so that at least 2n/3 of the
output bits (ie., 1/3 of the output bits) are emitted on each side. We select the line
by sliding it from left to right, until the first point at which at least 2n/3 of the output
bits are output to the left of the line.
If no more than 4n/3 of these bits are output to the left, we are done. If not, start
from the top, considering places to jog the line back one unit to the left. We know that
if the line jog at the every top, fewer than 4n/3 of the bits are emitted to the left, and
if the line jogs at the very bottom, more than 2n/3 are. Thus, as no single grid point6 Parallel Computer Models
can be the place where as many as n/3 of the bits are emitted, we can find a suitable
place in the middle to jog the line. There, we shall have between 2n/3 and 4n/3 of the
output bits on each side.
Now assume without loss of generality that at least half of the input bits are read
on the left of the line, and let us, by renumbering bits, if necessary, assume that these
aT Day Sak; --» Fqr- Suppose also that output bits yi,, Yiay ——» Yiggyy ATE OUPUL on.
the right. We can pick values ¢o that 4, =i. Yi = 22k, and so on. ‘Thus information
regarding the 2n/3 input bits, 2, Zak, -. Zakny3, Inust cross the line.
We may assume at most one wire or circuit element along any grid line, so the
number of bits crossing the line in one time unit is at most +1 (h horizontal and one
vertical, at the jog). It follows that (h+1)P > 2n/3, or else the required 2n/3 bits
‘cannot cross the line in time. Since we assume w > h, we have both hT = %n) and
wT = O(n). Since wh = A, we have AT? = O(n4) by taking the product. That is,
AT? > kn,
Problem 1.11
(a) Since the processing clements of an SIMD machine read and write data from dif
ferent memory modules synchronously, no access conflicts should arise, Thus any
PRAM variant can be used to model SIMD machines.
(b) ‘The processors in an MIMD machine can read the same memory location simul-
taneously. However, writing to a same memory location is prohibited. ‘Thus the
CREW-PRAM can best model an MIMD machine.
Problem 1.12
(a) Phe memory organization changed from UMA model (global shared memory) to
NUMA model (distributed shared memory).
(b) The medium-grain multicomputers use hypercube as their interconnection net-
works, while the fine-grain multicomputers use lower dimensional k-ary n-cube
(e.g., 2D or 3-D torus) as their interconnection networks.
(c) In the register-to-register architecture, vector registers are used to hold vector
operands, intermediate and final vector results, In the memory-to-memory archi-
tecture, vector operands and results are retrieved directly from the main memory
by using a vector stream unit.
(4) Ima single threaded architecture, each processor maintains a single thread of control
with limited hardware resources. In a multithreaded architecture, each processor
can exernte multiple contexts by switching among threads.
Problem 1.13
Assume the input is A(i) for 0 | 10 ° a 3
NS | NN :
2
8 2
WW 0
Problem 2.11
(a) To design a direct network for a 64-node multicomputer, we can use
« A3D torus with 4 nodes along each dimension. The relevant parameters are:
d= 3[r/2| =6, D = 3[r/2] = 6, andl = 3N = 192. Also, dx Dx! = 6912.
@ A G-dimensional hypercube. The relevant parameters are: d 6,
D=n=6, and l= nx N/2=6 x 64/2 = 192. We have dx Dx | = 6912.
= A CCC with dimension & = 4. The relevant parameters are: d= 3, D =
Bk — 1+ [k/2| = 2x 4-14 [4/2] = 9, andl = 3N/2 = 96. The value of
ax Dx Lis 259218
(b)
Program and Network Properties
If the quality of a network is measured by (dx Dx 1)", then a CCC is better
than a 3-D torus or a 6-cube. A 3-D toms and a 6-cube have the same quality.
+ The torus and hypercube have similar network properties and are treated
(14 6)x6
i 2
of information ii. between nodes at a distance i. Then we have
6
P(2) 2 PB) = A.Pa= FP) = BPO) = zz
‘Therefore, the mean internode distance is 5
>
together. We have ) = 21. Denote by P(i) the probability {
P(t)
3g 2
Ix ptax Sean can Saxe ox d=
G49)x9
= 2
ode communication for distance 7 are {
PQ) one = re =$re@=3
PC) = 2,P)= Pe =%
»
* For the CCC, we have )° 45. The probabilities of intern- t
- P=
S
Hence, the mean internode distance is
9 Sia tig 6 1 _ 165
Datta tox toe +5 gtx 1 tt Zeox Bb e
In conclusion, the mean internode distance of 4-CCC is greater than that of 6-
cube and 3-D torus. 6-cube and 3-D torus have identical mean internode distance.
‘The similarity of the 6-cube and 3-D torus in the above is more than incidental. In
fact, it has been shown [Wang88] that when k = 4 (as is the case for this problem),
a k-ary n-cube is exactly a 2n-dimensional binary hypercube.
Problem 2.12
fa)
It should be noted that we are looking for nodes that can be reached from No in
exactly 3 steps. Therefore, nodes that can be reached in 1 or 2 steps have to be
excluded.
© For an &x8 Dliac mesh, they can be caleulated by the equation (a-+b-+e) mod
64, where a,b, and ¢ can be +1,~1,+8, or ~8, There are 20 combinations
(4 if a,b, and c are all different; 12 if two of them are equal; 4 if a = 6 = c)
However, 8 of the combinations contain the pair +1 and -1 or the pair
+8 and —8, making them reachable in one step. Such nodes have to be
eliminated from the list. Hence, 12 nodes can be reached from Np in three
steps. The addresses of these nodes are 3, 6, 10, 15, 17, 24, 40, 47, 49, 54,
58, and 61.Program and Network Properties 19
« Fora binary 6-cube, the binary address as...0; 49 of a node reachable in three
steps from No has exactly three 1's. There are 20 possible combinations
(C(6,3)).. The addresses of these nodes are 7, 11, 13, 14, 19, 21, 22, 25, 26,
28, 35, 37, 38, 41, 42, 44, 49, 50, 52, and 56.
# The nodes reachable in exactly three steps can be determined as follows. List
all 6-bit murmbers which contain three 1s. There are 20 such numbers. First
take 1's complement of each number and then add 1 to each of the resulting
numbers. (Equivalently, the new numbers are obtained by subtracting each
of the original numbers from 64.) If a new mumber has three or four 1s in
its binary representation and the Is are separated by at least one 0, then
both nodes whose addresses are the original number and the new number
can be reached in exactly three steps. (The last point of the rule is due to
the fact that clustered 1s can always be replaced by two 1s.) The addresses
of these nodes are 11, 13, 19, 21, 22, 28, 25, 26, 27, 29, 35, 37, 38, 39, 41,
42, 43, 45, 51, and 53.
(b) The upper bound on the minimum number of routing steps needed to send data
from any node to another for an 8 x 8 Iliac mesh is 7 (= V64 — 1), for a 6-cube is
6, and for a 64-node barrel shifter is 3 (= logy 64/2).
(c) The upper bound on the minimum number of routing steps needed to send data
from any node to another is 31 for a 32 x 32 Illiac mesh, 10 for a 10-cube, and 5
for a 1024-node barrel shifter
Problem 2.13 Part of Table 2.4 in the text is duplicated below:
Network Bus Waitistage Crossbar
Characteristics | __ System Network Switch,
Mininnim Tateney
for unit data | Constant O(log, n) ‘Constant
transfer
Bandwidth TGafnj to Ow) | Ova} to Otway | Ola) to OCR)
per processor
Wiring OCay ‘Onawtog ny Oiaray
Complexity,
Switching omy ‘OGriog, wy Ow
Complexity 7 jo
Connectivity | Only oie to one Some permutations | All permutal
and routing ata time. and broadcast, if | one at a time.
capability network unblocked
Remarks — ‘Assume n proce- [nx n MIN ‘Assume n Xn
ssors on the bus; | using k x & crossbar with
bus width is w | | switehes with line | tine width of |
bits, width of w bits. | w bits |20 Program and Network Properties
Problem 2.14
(a) For each output terminal, there are 4 possible connections (one from each of the
input terminals), so that there are 4 x 4 x 4x 4 = 256 legitimate states.
(b) 48 (= 16 x 3) 4 x 4 switch modules are needed to construct a 64-input Omega
network. There are 24 (= 4x 3 x 2x 1) permutation connections in a 4 x 4 switch
module. Therefore a total of (244) permutations can be implemented in a single
pass through the network without blocking.
{c) The total number of permutations of 64 inputs is 641. So the fraction is 24**/6
La x 10-7,
Problem 2.15
(a) We label the switch modules of a 16 x 16 Baseline network as below.
OG
TE SEE SE!
Eee:
‘Then, we change the positions of some switch modules, the Baseline network
becomes:Program and Network Properties 21
* rR i a
which is just an Omega network.
(b) If we change the positions of some switch
modules in the Baseline network, it
becomes:
-EE ERE ah
a oe /{= S
4 "Vp A Ke *
ca a = RE ‘
Fea > eee ee cs
ENE >RIERE
ep 7 ee 3
opp See ye oy ;
which is just the Flip network,
{c) Since both the Omega network and the Flip network are topologically equivalent
to the baseline network, they are topologically equivalent to each other.22,
Program and Network Properties
Problem 2.16
(a)
(b) nlk/2}
(c) 2K
(a) 2n,
fe)
A k-ary Loube is a ring with k nodes.
A kary 2cube is a 2D k x k torus.
‘A mesh is a torus without end-around connections.
A 2ary n-cube is a binary n-cube.
An Omega network is the multistage network implementation of shuffie-
exchange network. Its switch modules can be repositioned to have the same
interconnection topology as a binary n-cube
{f) The conventional torus has long end-around connections, but the folded torus has
equal-length connections. (See Figure 2.21 in the text).
(gs)
# The relation
B= 2wN/k
will be shown in the solution of Problem 2.18. Therefore, if both the number
of nodes NV and wire bisection width B are constants, the channel width W
‘will be proportional to &
w= B/b = Bk/(2N).
‘The latency of a wormhole-routed network is
Zyfp
uty
Twa =
which is inversely proportional to w, hence also inversely proportional to
k. This means a network with a higher k will have lower latency. For two
k-ary n-cube networks with the same number of nodes, the one with a lower
dimension has a larger k, and hence @ lower latency.
+ If will be shown in the solution of Problem 2.18 thas the hot-spot throughput
is equal to the bandwidth of a single channel
Ons =k/2
‘Low-dimensional networks have a larger k, hence a higher hot-spot through-
put.Program and Network Properties 23
Problem 2.17
(a) In a tree network, a message going from processor i to processor j goes up the
tree to their least common ancestor and then back down according to the least
significant bits of j. Message traffic through lower-level (closer to the root) nodes
is heavier than that of higher-level nodes. The lower-level channels in a fat tree
has a greater number of wires, and hence a higher bandwidth. This will prevent
congestion in the lower-level channels,
(b) The capacity of a universal fat tree at level k is
cy = min([n/2*), fw/2*
#9 Ik > Blog(n/w), [n/2*} < [w/2*!9), Therefore, cx = [n/2*] = (n+1)/2*,
which is 1, 2,4, .., for k = log(n + 1),log(n + 1) = 1, log(n + 1) = 2,
© If & < 3log(n/w), then [70/2] > [w/2%*/}. Hence cg = [wo/2*/5], which
5s ny w/879,w/42!, wo /29, for b= «.58,2,1.
‘© Initially, the capacities double from one level to the next toward the root,
but at levels less than 3log(n/w) away from the root, the channel capacities,
grow at the rate of 4,
Problem 2.18
(a) A Kary n-cube network has V nodes where N = k*, Assume k is even. If the
network is partitioned along one dimension into two parts of equal size, the “cross
section” separating the two parts is of size N/k. Corresponding to each node in
the cross section, there are two wires, one being the nearest-neighbor link and the
other wraparound link in the original network. Therefore, the cross section contains
2N/k wires each w bits wide, giving a wire bisection width B = bu = 2w.N/k.
‘The argument also holds for k odd, although the partitioning is slightly more
complex.
(b) The hot-spot throughput of a network is the maximum rate at which message
can be sent from one specific node P, to another specific node P;. For a k-ary
n-cube with deterministic routing, the hot-spot throughput, @5, is equal to the
bandwidth of a single channel w. From (a), w = KB/(2N). Therefore,
us = kB/(2N),
which is proportional to & for a fixed B.24 Program and Network Properties
Problem 2.19
(a) Embedding of am rx r torus.in a hypercube is shown in the following diagrams
for r = 2 and 4, respectively ((2) and (c)). As can be seen, if the nodes of a torus
‘are numbered properly, we obtain inter-node connections identical to those of a
hypercube (nodes whose numbers differ by a power of 2 are linked directly),
A 2r x 2r torus can be constructed from r xr tori in two steps. In step one,
# 2r xr torus is built by combining an r x r with its “anirror” image (in the sense
of node numbering) and connecting the corresponding nodes, as shown in diagram
(b). In step 2, the 2r xr torus is combined with its mirror image to form a 2r x 2r
torus. In this manner, a torus can be fully embedded in a hypercube of dimension
dwith 24 =r? nodes.
In general, it has been shown that any my x2ny--- m; torus, where my = 2°
can be embedded in a hypereube of dimension d = py + pe ++-+ p, with the
proximity property preserved using binary reflected geay code for the mapping
[Chans6}.
{b) Embedding of ring on a COC ie equivalent to finding a Hamiltonian cycle on
the CCC. In the following figure, the embedding of rings ox CCCs for k = 3 and
4, respectively, is shown. It is easy to first consider the embedding of a ring on
a binary hypercube by treating the cycle at each vertex of the hypercube as a
supernode. ‘This step can be carried out easily and there are several possible ways
to embed a ring on a hypercube. Then, whenever a supernode on the embedded
ing is visited, all the nodes in the corresponding eycle are Tinked.
|
|Program and Network Properties 25
(c) Embedding of a complete balanced tree in a mesh is shown in the following diagram
for trees of different heights. In general, the root of a tree is mapped to the center
node of a mesh, and leaf nodes are mapped to outlying mesh nodes. The process
is recursive. Suppose a tree of height ! > 3 has been embedded in an r x 7 mesh.
‘When embedding a tree of height !+ 1, an (r +2) x (r +2) mesh is needed, with
the new leaf nodes mapped to the boundary nodes of the augmented mesh. This
is illustrated for J = 3.26 Program and Network Properties
©
Problem 2.20
(a) A hypernet combines the hierarchical structure of a tree in the overall architecture
and the uniform counectivity of a hypercube in its building blocks.
(b) By construction, a hypernet built with identical modules (buslets, treelets, cubelets,
etc.) has a constant node degree. This is achioved by a systematic use of the ex-
terual links of each cubelet when building larger and larger systems.
(c} The hypemet architecture was proposed to take advantage of localized communi-
cation patterns present in some applications such as connectionist neural networks,
‘The connection structure of hypernets gives effective support for communications
between adjacent lower-lovel clusters. Global communication is also. supported,
‘but the bandwidth provided is lower. Algorithms with commensurate nonuniform
‘communication requirements among different components are suitable candidates
for implementation on hypernets.
1/0 capability of a hypernet is furnished by the external links of each building
Block. Asa result, I/O devices can be spread throughout the hierarchy to meet I/O
demand. Fault. tolerance is built into the hypernet architecture to allow graceful
degradation. Execation of a program can be switched to a subnet in case’of node
or link failures. The modular construction also facilitates isolation and subsequent
replacement of faulty nodes or subnets.Chapter 3
Principles of Scalable Performance
Problem 3.1
(a) CPL = 1x 0.642% 0.18-+4 x 0.12 +12 x 0.1 = 2.64 cycles / instruction.
(b) MIPS rate = —=stthevcles/s__. _ 69 goMIPS.
Tecycles instruction
(c) When a single processor is used, the exccution time is ty = 200000/17.86 = 1.12 x
10* ys. When four processors are used, the time is reduced to ty = 220000/60.60
3.63% 10° ss, Hence the speedup is 11,2/3.63 = 3.08 and the efficiency is 3.08/1
0.77.
Problem 3.2
(a) If the vector mode is not used at all, the execution time will be
0.75T + 9 x 0.25T = 37.
‘Therefore, effective speedup = 32/T = 3. Let the fraction of vectorized code be
a. Then a =9 x 0.257 /3T = 0.75.
(b) Suppose the speed ratio between the vector mode and the scalar mode is doubled.
‘The execution time becomes
O.78T + 0.257 /2 = O.8TST-
‘The effective speedup is 31/0.875T = 24/7 = 3.43.
(c) Suppose the speed for vector mode computation is still nine times as fast as that
for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio
‘@ must satisfy the following relation:28 Principles of Scalable Performance
Solving the equation, we obtain a = 51/64 = 0.8.
Problem 3.3
{a) Suppose the total workload is W million instructions. Then the execution time
seconds is w
pa e® U-ow
nz Sr
‘Therefore, the effective MIPS rate is
w ne ne
TF atal—a) n-(—ija’
(b) Substituting the given data into the expression in (a), we have {
ex 4
ext Lay,
i6-i5a
which can be solved to give a = 24/25 = 0.96.
Problem 3.4 Assume the speed in enhanced mode is n times as fast as that in regular
mode, the harmonic mean execution time T’ is calculated as
T(a} = a/R+ (1-a)/(nR),
where R is the execution rate in regular mode.
(a) Ha varies linearly between a and b, the average execution time is '
_ £EPla)da _ (A= b+0) +2
Fong = “ay ‘mR
‘The average execution rate is
Rea Be HTP
and the average speedup factor is
6, = Row
jag R D(b+a) +2
({b) Ifa +0 and 6 1, then
Sug = Inf (n+ 1).
Problem 3.5Principles of Scalable Performance 29
(a) The harmonic mean execution rate in MIPS is
1
Ro ATR
‘The arithmetic mean execution time is
T= AIR.
rt
(b) Given fi = 0.4, fe = 0.3, fs = 0.2, fy = 0.1, and R, = 4 MIPS, Ry = 8 MIPS,
Ry = 11 MIPS, Ry = 15 MIPS, the arithmetic mean execution time is T =
04/4 + 0.3/8 + 0.2/11 +0.1/15 = 0.162 ps per instruction.
Several factors cause Rj to be smaller than 5i. First, there might be memory
access operations which take extra machine cycles. Second, when the number of
processors is increased, more memory eecess conficts arise, which increase the ex-
ccution time and lower the effective MIPS rate. Third, part of the program may
have to be executed sequentially or can be executed. by only a limited number of
processors simultaneously. Finally, there is an overhead for processors to synchro-
nize with each other. Because of these overheads, Rj/i typically decreases with
i
(c) Given a new distribution fy = 0.1, fo = 0.2, fs = 0.3, and f, = 0.4 due to the
use of an intelligent compiler, the arithmetic mean execution time becomes T =
0.1/4 + 0.2/8 + 0.3/11 +0.4/15 = 0.104 ys per instruction.
Problem 3.6 Amdahl’s law is based on a fixed workload, where the problem size is
fixed regardless of the machine size. Gustafson’s law is based on a scaled workload,
where the problem size is increased with the machine size so that the solution time is
the same for sequential and parallel executions. Sun and Ni's law is also applied to
scaled problems, where the problem size is increased to the maxinauu memory capacity,
Problem 3.7
(a) The total number of clock cycles needed is
1024
y= S(2 421) = 2 1004-4 1024 x 1095 = 1,051,628
Ss
(b) If consecutive outer loops are assigned to a single processor, the workload is not
balanced and the parallel execution time is dominated by that on processor 32.
‘The clock cycles needed is,
10%
Y e+2)
So
= (993 +1025) x 18
= 64,608,
a