KEMBAR78
Solution | PDF | Central Processing Unit | Parallel Computing
0% found this document useful (0 votes)
64 views14 pages

Solution

Uploaded by

dimansego213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views14 pages

Solution

Uploaded by

dimansego213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1.4 [2] <§1.

4> Assume a color display using 8 bits for each of the


primary colors (red, green, blue) per pixel and a frame size of
1280 ×1024.
a. What is the minimum size in bytes of the frame buffer to
store a frame?
b. How long would it take, at a minimum, for the frame to be
sent over a 100 Mbit/s network?
1.5 [4] <§1.6> Consider three different processors P1, P2, and P3
executing the same instruction set. P1 has a 3 GHz clock rate and
a CPI of 1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a
4.0 GHz clock rate and has a CPI of 2.2.
a. Which processor has the highest performance expressed in
instructions per second?
b. If the processors each execute a program in 10 seconds, find
the number of cycles and the number of instructions.
c. We are trying to reduce the execution time by 30%, but this
leads to an increase of 20% in the CPI. What clock rate
should we have to get this time reduction?
1.6 [20] <§1.6> Consider two different implementations of the same
instruction set architecture. The instructions can be divided into
four classes according to their CPI (classes A, B, C, and D). P1
with a clock rate of 2.5 GHz and CPIs of 1, 2, 3, and 3, and P2
with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2.
156
Given a program with a dynamic instruction count of 1.0E6
instructions divided into classes as follows: 10% class A, 20%
class B, 50% class C, and 20% class D, which is faster: P1 or P2?
a. What is the global CPI for each implementation?
b. Find the clock cycles required in both cases.
1.7 [15] <§1.6> Compilers can have a profound impact on the
performance of an application. Assume that for a program,
compiler A results in a dynamic instruction count of 1.0E9 and
has an execution time of 1.1 s, while compiler B results in a
dynamic instruction count of 1.2E9 and an execution time of 1.5 s.
a. Find the average CPI for each program given that the
processor has a clock cycle time of 1 ns.
b. Assume the compiled programs run on two different
processors. If the execution times on the two processors are
the same, how much faster is the clock of the processor
running compiler A’s code versus the clock of the processor
running compiler B’s code?
c. A new compiler is developed that uses only 6.0E8
instructions and has an average CPI of 1.1. What is the
speedup of using this new compiler versus using compiler A
or B on the original processor?
1.8 The Pentium 4 Prescott processor, released in 2004, had a clock
rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it
consumed 10 W of static power and 90 W of dynamic power.
The Core i5 Ivy Bridge, released in 2012, has a clock rate of
3.4 GHz and voltage of 0.9 V. Assume that, on average, it
consumed 30 W of static power and 40 W of dynamic power.
1.8.1 [5] <§1.7> For each processor find the average capacitive
loads.
1.8.2 [5] <§1.7> Find the percentage of the total dissipated power
comprised by static power and the ratio of static power to
dynamic power for each technology.
1.8.3 [15] <§1.7> If the total dissipated power is to be reduced by
10%, how much should the voltage be reduced to maintain the
same leakage current? Note: power is defined as the product of
voltage and current.
1.9 Assume for arithmetic, load/store, and branch instructions, a
processor has CPIs of 1, 12, and 5, respectively. Also assume that
on a single processor a program requires the execution of 2.56E9
157
arithmetic instructions, 1.28E9 load/store instructions, and 256
million branch instructions. Assume that each processor has a
2 GHz clock frequency.
Assume that, as the program is parallelized to run over multiple
cores, the number of arithmetic and load/store instructions per
processor is divided by 0.7 × p (where p is the number of
processors) but the number of branch instructions per processor
remains the same.
1.9.1 [5] <§1.7> Find the total execution time for this program on
1, 2, 4, and 8 processors, and show the relative speedup of the
2, 4, and 8 processors result relative to the single processor
result.
1.9.2 [10] <§§1.6, 1.8> If the CPI of the arithmetic instructions was
doubled, what would the impact be on the execution time of
the program on 1, 2, 4, or 8 processors?
1.9.3 [10] <§§1.6, 1.8> To what should the CPI of load/store
instructions be reduced in order for a single processor to match
the performance of four processors using the original CPI
values?
1.10 Assume a 15 cm diameter wafer has a cost of 12, contains 84
dies, and has 0.020 defects/cm2. Assume a 20 cm diameter wafer
has a cost of 15, contains 100 dies, and has 0.031 defects/cm2.
1.10.1 [10] <§1.5> Find the yield for both wafers.
1.10.2 [5] <§1.5> Find the cost per die for both wafers.
1.10.3 [5] <§1.5> If the number of dies per wafer is increased by
10% and the defects per area unit increases by 15%, find the die
area and yield.
1.10.4 [5] <§1.5> Assume a fabrication process improves the yield
from 0.92 to 0.95. Find the defects per area unit for each version
of the technology given a die area of 200 mm2.
1.11 The results of the SPEC CPU2006 bzip2 benchmark running on
an AMD Barcelona has an instruction count of 2.389E12, an
execution time of 750 s, and a reference time of 9650 s.
1.11.1 [5] <§§1.6, 1.9> Find the CPI if the clock cycle time is
0.333 ns.
1.11.2 [5] <§1.9> Find the SPECratio.
1.11.3 [5] <§§1.6, 1.9> Find the increase in CPU time if the number
of instructions of the benchmark is increased by 10% without
affecting the CPI.
158
1.11.4 [5] <§§1.6, 1.9> Find the increase in CPU time if the number
of instructions of the benchmark is increased by 10% and the
CPI is increased by 5%.
1.11.5 [5] <§§1.6, 1.9> Find the change in the SPECratio for this
change.
1.11.6 [10] <§1.6> Suppose that we are developing a new version
of the AMD Barcelona processor with a 4 GHz clock rate. We
have added some additional instructions to the instruction set
in such a way that the number of instructions has been reduced
by 15%. The execution time is reduced to 700 s and the new
SPECratio is 13.7. Find the new CPI.
1.11.7 [10] <§1.6> This CPI value is larger than obtained in 1.11.1
as the clock rate was increased from 3 GHz to 4 GHz.
Determine whether the increase in the CPI is similar to that of
the clock rate. If they are dissimilar, why?
1.11.8 [5] <§1.6> By how much has the CPU time been reduced?
1.11.9 [10] <§1.6> For a second benchmark, libquantum, assume
an execution time of 960 ns, CPI of 1.61, and clock rate of
3 GHz. If the execution time is reduced by an additional 10%
without affecting the CPI and with a clock rate of 4 GHz,
determine the number of instructions.
1.11.10 [10] <§1.6> Determine the clock rate required to give a
further 10% reduction in CPU time while maintaining the
number of instructions and with the CPI unchanged.
1.11.11 [10] <§1.6> Determine the clock rate if the CPI is reduced
by 15% and the CPU time by 20% while the number of
instructions is unchanged.
1.12 Section 1.10 cites as a pitfall the utilization of a subset of the
performance equation as a performance metric. To illustrate this,
consider the following two processors. P1 has a clock rate of
4 GHz, average CPI of 0.9, and requires the execution of 5.0E9
instructions. P2 has a clock rate of 3 GHz, an average CPI of 0.75,
and requires the execution of 1.0E9 instructions.
1.12.1 [5] <§§1.6, 1.10> One usual fallacy is to consider the
computer with the largest clock rate as having the highest
performance. Check if this is true for P1 and P2.
1.12.2 [10] <§§1.6, 1.10> Another fallacy is to consider that the
processor executing the largest number of instructions will
need a larger CPU time. Considering that processor P1 is
159
executing a sequence of 1.0E9 instructions and that the CPI of
processors P1 and P2 do not change, determine the number of
instructions that P2 can execute in the same time that P1 needs
to execute 1.0E9 instructions.
1.12.3 [10] <§§1.6, 1.10> A common fallacy is to use MIPS (millions
of instructions per second) to compare the performance of two
different processors, and consider that the processor with the
largest MIPS has the largest performance. Check if this is true
for P1 and P2.
1.12.4 [10] <§1.10> Another common performance figure is
MFLOPS (millions of floating-point operations per second),
defined as
but this figure has the same problems as MIPS. Assume that 40% of the
instructions executed on both P1 and P2 are floating-point instructions. Find
the
MFLOPS figures for the processors.
1.13 Another pitfall cited in Section 1.10 is expecting to improve the
overall performance of a computer by improving only one aspect
of the computer. Consider a computer running a program that
requires 250 s, with 70 s spent executing FP instructions, 85 s
executed L/S instructions, and 40 s spent executing branch
instructions.
1.13.1 [5] <§1.10> By how much is the total time reduced if the
time for FP operations is reduced by 20%?
1.13.2 [5] <§1.10> By how much is the time for INT operations
reduced if the total time is reduced by 20%?
1.13.3 [5] <§1.10> Can the total time can be reduced by 20% by
reducing only the time for branch instructions?
1.14 Assume a program requires the execution of 50 ×106 FP
instructions, 110 ×106 INT instructions, 80 ×106 L/S instructions,
and 16 ×106 branch instructions. The CPI for each type of
instruction is 1, 1, 4, and 2, respectively. Assume that the
processor has a 2 GHz clock rate.
1.14.1 [10] <§1.10> By how much must we improve the CPI of FP
instructions if we want the program to run two times faster?
1.14.2 [10] <§1.10> By how much must we improve the CPI of L/S
160
instructions if we want the program to run two times faster?
1.14.3 [5] <§1.10> By how much is the execution time of the
program improved if the CPI of INT and FP instructions is
reduced by 40% and the CPI of L/S and Branch is reduced by
30%?
1.15 [5] <§1.8> When a program is adapted to run on multiple
processors in a multiprocessor system, the execution time on each
processor is comprised of computing time and the overhead time
required for locked critical sections and/or to send data from one
processor to another.
Assume a program requires t =100 s of execution time on one
processor. When run p processors, each processor requires t/p s,
as well as an additional 4 s of overhead, irrespective of the
number of processors. Compute the per-processor execution time
for 2, 4, 8, 16, 32, 64, and 128 processors. For each case, list the
corresponding speedup relative to a single processor and the
ratio between actual speedup versus ideal speedup (speedup if
there was no overhead)

Personal computer: Computer that emphasizes delivery of good performance


to a single user at low cost and usually executes third-party soware.
Server: Computer used for large workloads and usually accessed via a network.
Embedded computer: Computer designed to run one application or one set
of related applications and integrated into a single system.
1.2
a. Performance via Pipelining
b. Dependability via Redundancy
c. Performance via Prediction
d. Make the Common Case Fast
e. Hierarchy of Memories
f. Performance via Parallelism
g. Use Abstraction to Simplify Design
1.3 -e program is compiled into an assembly language program, which is itself
assembled into a machine-language program.
1.4
a. 1280 × 1024 pixels = 1,310,720 pixels => 1,310,720 × 3 = 3,932,160
bytes/frame.
b. 3,932,160 bytes × (8 bits/byte) /100E6 bits/second = 0.31 seconds
1.5
a. performance of P1 (instructions/sec) = 3 × 109/1.5 = 2 × 109
performance of P2 (instructions/sec) = 2.5 × 109/1.0 = 2.5 × 109
performance of P3 (instructions/sec) = 4 × 109/2.2 = 1.8 × 109
b. cycles(P1) = 10 × 3 × 109 = 30 × 109 s
cycles(P2) = 10 × 2.5 × 109 = 25 × 109 s
cycles(P3) = 10 × 4 × 109 = 40 × 109 s
c. No. instructions(P1) = 30 × 109/1.5 = 20 × 109
No. instructions(P2) = 25 × 109/1 = 25 × 109
No. instructions(P3) = 40 × 109/2.2 = 18.18 × 109
CPI
new
= CPI
old × 1.2, then CPI(P1) = 1.8, CPI(P2) = 1.2, CPI(P3) = 2.6
f = No. instr. × CPI/time, then
f(P1) = 20 × 109 × 1.8/7 = 5.14 GHz
f(P2) = 25 × 109 × 1.2/7 = 4.28 GHz
f(P1) = 18.18 × 109 × 2.6/7 = 6.75 GHz

1.7
a. Class A: 105 instr. Class B: 2 × 105 instr. Class C: 5 × 105 instr.
Class D: 2 × 105
instr.
Time = No. instr. × CPI/clock rate
Total time P1 = (105 + 2 × 105 × 2 + 5 × 105 × 3 + 2 × 105 × 3)/(2.5 ×
109) =
10.4 × 10-4 s
Total time P2 = (105 × 2 + 2 × 105 × 2 + 5 × 105 × 2 + 2 × 105 × 2)/(3
× 109) =
6.66 × 10-4 s
CPI(P1) = 10.4 × 10-4 × 2.5 × 109/106 = 2.6
CPI(P2) = 6.66 × 10-4 × 3 × 109/106 = 2.0
b. clock cycles(P1) = 105 × 1 + 2 × 105 × 2 + 5 × 105 × 3 + 2 × 105 ×
3 = 26 × 105
clock cycles(P2) = 105 × 2 + 2 × 105 × 2 + 5 × 105 × 2 + 2 × 105 × 2 =
20 × 105

1.8
a. CPI = T
exec
× f/No. instr.
Compiler A CPI = 1.1
Compiler B CPI = 1.25
b. f
B/fA = (No. instr.(B) × CPI(B))/(No. instr.(A) × CPI(A)) = 1.37
c. T
A/Tnew = 1.67
T
B/Tnew = 2.27
1.9
1.9.1 C = 2 × DP/(V2 × F)
Pentium 4: C = 3.2E–8F
Core i5 Ivy Bridge: C = 2.9E–8F
1.9.2 Pentium 4: 10/100 = 10%
Core i5 Ivy Bridge: 30/70 = 42.9%
1.9.3 (S
new
+ D
new
)/(Sold + Dold) = 0.90
D
new
= C × V
new
2 × F
S
old = Vold × I
S
new
= V
new
× I
Therefore:
V
new
= [D
new
/(C × F)]1/2
D
new
= 0.90 × (Sold + Dold) - Snew
S
new
= V
new
× (Sold/Vold)
Pentium 4:
S
new
= V
new
× (10/1.25) = V
new
× 8
D
new
= 0.90 × 100 - V
new
× 8 = 90 - V
new
× 8
V
new
= [(90 - V
new
× 8)/(3.2E8 × 3.6E9)]1/2
V
new
= 0.85 V
Core i5:
S
new
= V
new
× (30/0.9) = V
new
× 33.3
D
new
= 0.90 × 70 - V
new
× 33.3 = 63 - V
new
× 33.3
V
new
= [(63 - V
new
× 33.3)/(2.9E8 × 3.4E9)]1/2
V
new
= 0.64 V

1.11
1.11.1 die area
15cm = wafer area/dies per wafer = π × 7.52/84 = 2.10 cm2
yield15cm = 1/(1 + (0.020 × 2.10/2))2 = 0.9593
die area
20cm = wafer area/dies per wafer = π × 102/100 = 3.14 cm2
yield20cm = 1/(1 + (0.031 × 3.14/2))2 = 0.9093
1.11.2 cost/die
15cm = 12/(84 × 0.9593) = 0.1489
cost/die
20cm = 15/(100 × 0.9093) = 0.1650
1.11.3 die area
15cm = wafer area/dies per wafer = π × 7.52/(84 × 1.1) = 1.91 cm2
yield15cm = 1/(1 + (0.020 × 1.15 × 1.91/2))2 = 0.9575
die area
20cm = wafer area/dies per wafer = π × 102/(100 × 1.1) = 2.86 cm2
yield20cm = 1/(1 + (0.03 × 1.15 × 2.86/2))2 = 0.9082
1.11.4 defects per area0.92 = (1–y.5)/(y.5 × die_area/2) = (1 - 0.92.5)/
(0.92.5 × 2/2) = 0.043 defects/cm

defects per area0.95 = (1–y.5)/(y.5 × die_area/2) = (1 - 0.95.5)/


(0.95.5 × 2/2) = 0.026 defects/cm2
1.12
1.12.1 CPI = clock rate × CPU time/instr. count
clock rate = 1/cycle time = 3 GHz
CPI(bzip2) = 3 × 109 × 750/(2389 × 109) = 0.94
1.12.2 SPEC ratio = ref. time/execution time
SPEC ratio(bzip2) = 9650/750 = 12.86
1.12.3 CPU time = No. instr. × CPI/clock rate
If CPI and clock rate do not change, the CPU time increase is equal to the
increase in the number of instructions, that is 10%.
1.12.4 CPU time(before) = No. instr. × CPI/clock rate
CPU time(aer) = 1.1 × No. instr. × 1.05 × CPI/clock rate
CPU time(aer)/CPU time(before) = 1.1 × 1.05 = 1.155. -us, CPU time is
increased by 15.5%.
1.12.5 SPECratio = reference time/CPU time
SPECratio(aer)/SPECratio(before) = CPU time(before)/CPU time(aer)
= 1/1.1555 = 0.86. -e SPECratio is decreased by 14%.
1.12.6 CPI = (CPU time × clock rate)/No. instr.
CPI = 700 × 4 × 109/(0.85 × 2389 × 109) = 1.37
1.12.7 Clock rate ratio = 4 GHz/3 GHz = 1.33
CPI @ 4 GHz = 1.37, CPI @ 3 GHz = 0.94, ratio = 1.45
-ey are dierent because, although the number of instructions has been
reduced by 15%, the CPU time has been reduced by a lower percentage.
1.12.8 700/750 = 0.933. CPU time reduction: 6.7%
1.12.9 No. instr. = CPU time × clock rate/CPI
No. instr. = 960 × 0.9 × 4 × 109/1.61 = 2146 × 109
1.12.10 Clock rate = No. instr. × CPI/CPU time.
Clock rate
new
= No. instr. × CPI/0.9 × CPU time = 1/0.9 clock rate
old =
3.33 GHz
1.12.11 Clock rate = No. instr. × CPI/CPU time

Clock rate
new
= No. instr. × 0.85 × CPI/0.80 CPU time = 0.85/0.80, clock
rate
old = 3.18 GHz
1.13
1.13.1 T(P1) = 5 × 109 × 0.9/(4 × 109) = 1.125 s
T(P2) = 109 × 0.75/(3 × 109) = 0.25 s
clock rate(P1) > clock rate(P2), performance(P1) < performance(P2)
1.13.2 T(P1) = No. instr. × CPI/clock rate
T(P1) = 2.25 3 1021 s
T(P2) 5 N × 0.75/(3 × 109), then N = 9 × 108
1.13.3 MIPS = Clock rate × 10-6/CPI
MIPS(P1) = 4 × 109 × 10-6/0.9 = 4.44 × 103
MIPS(P2) = 3 × 109 × 10-6/0.75 = 4.0 × 103
MIPS(P1) > MIPS(P2), performance(P1) < performance(P2) (from 11a)
1.13.4 MFLOPS = No. FP operations × 10-6/T
MFLOPS(P1) = .4 × 5E9 × 1E-6/1.125 = 1.78E3
MFLOPS(P2) = .4 × 1E9 × 1E-6/.25 = 1.60E3
MFLOPS(P1) > MFLOPS(P2), performance(P1) < performance(P2) (from
11a)
1.14
1.14.1 T
fp = 70 × 0.8 = 56 s. Tnew = 56 + 85 + 55 + 40 = 236 s. Reduction: 5.6%
1.14.2 T
new
= 250 × 0.8 = 200 s, T
fp + Tl/s + Tbranch = 165 s, Tint = 35 s. Reduction time
INT: 58.8%
1.14.3 T
new
= 250 × 0.8 = 200 s, T
fp + Tint + Tl/s = 210 s. NO
1.15
1.15.1 Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. +
CPIl/s × No.
L/S instr. + CPI
branch × No. branch instr.
T
CPU = clock cycles/clock rate = clock cycles/2 × 109
clock cycles = 512 × 106; TCPU = 0.256 s
To have the number of clock cycles by improving the CPI of FP instructions:
CPI
improved fp × No. FP instr. + CPIint × No. INT instr. + CPIl/s × No. L/S
instr. +
CPI
branch × No. branch instr. = clock cycles/2

CPI
improved fp = (clock cycles/2 - (CPIint × No. INT instr. + CPIl/s × No. L/S
instr. + CPI
branch × No. branch instr.)) / No. FP instr.
CPI
improved fp = (256 - 462)/50 < 0 = = > not possible
1.15.2 Using the clock cycle data from a.
To have the number of clock cycles improving the CPI of L/S instructions:
CPI
fp × No. FP instr. + CPIint × No. INT instr. + CPIimproved l/s × No. L/S
instr.
+ CPI
branch × No. branch instr. = clock cycles/2
CPI
improved l/s = (clock cycles/2 - (CPIfp × No. FP instr. + CPIint × No. INT
instr. + CPI
branch × No. branch instr.)) / No. L/S instr.
CPI
improved l/s = (256 - 198)/80 = 0.725
1.15.3 Clock cycles = CPIfp × No. FP instr. + CPIint × No. INT instr. +
CPIl/s × No.
L/S instr. + CPI
branch × No. branch instr.
T
CPU = clock cycles/clock rate = clock cycles/2 × 109
CPI
int = 0.6 × 1 = 0.6; CPIfp = 0.6 × 1 = 0.6; CPIl/s = 0.7 × 4 = 2.8;
CPIbranch =
0.7 × 2 = 1.4
T
CPU (before improv.) = 0.256 s; TCPU (aer improv.) = 0.171 s

You might also like