William Stallings
Computer Organization
and Architecture
8th Edition
Chapter 2
Computer Evolution and
Performance
Performance
• Why is some hardware better than others
for different programs?
• What factors of system performance are
hardware related? (e.g., Do we need a
new machine, or a new operating
system?)
• How does the machine's instruction set
affect performance?
What determines performance of a
machine?
Performance evaluation of the airplane?
Airplane Passengers Range (mi) Speed (mph)
Boeing 777 375 4630 610
Boeing 747 470 4150 610
BAC/Sud Concorde 132 4000 1350
Douglas DC-8-50 146 8720 544
Performance evaluation of the airplane?
Airplane Passengers Range (mi) Speed (mph)
Boeing 777 375 4630 610
Boeing 747 470 4150 610
BAC/Sud Concorde 132 4000 1350
Douglas DC-8-50 146 8720 544
How much faster is the Concorde compared to the 747?
How much bigger is the 747 than the Douglas DC-8?
Example of Performance Measure
What is “Performance”
• Processor frequency?
—Is it better an Intel Pentium IV at 3.2GHz or an
AMD at 2.2GHz? And how much better the best
one is?
• Memory speed? Cache efficiency?
—Is it better to have 8M of cache or 16M of cache?
Is it better to have a large single-level cache or a
very large slower L2 cache and a very fast small
L1 cache?
Time Time Time Time!!
• Time can be defined in different ways, depending
on what we are measuring:
—Response time : Total time to complete a
task, including time spent executing on the
CPU, accessing disk and memory, waiting for
I/O and other processes, and operating
system overhead.
—CPU execution time : Total time a CPU
spends computing on a given task (excludes
time for I/O or running other programs). This
is also referred to as simply CPU time.
—User CPU time : Total time CPU spends in the
program
—System CPU execution time : Total time
operating system spends executing tasks for
the program.
Performance Metrics
• Computer Designers use clock ticks to
measure how fast the hardware can
perform basic functions
• Clock ticks running at constant rate
—Determines when events take place
—Synchronize activities
—Discrete time intervals are called Clock ticks,
ticks, clock periods, clocks, cycles
Clock Cycles
• Instead of using seconds to measure execution
time, often we use clock cycles, clock ticks, clock
periods, clocks, or cycles.
—Clock rate (frequency) = cycles per second.
—Measured in Hertz (1 Hz = 1 cycle/s).
• Clock period is the time between ticks of the
clock and is measured in seconds per cycle.
—Period = 1/frequency
• Example: A 200 MHz (MegaHertz) clock has a
clock period of 5nanoseconds
• Warning: Some people refer to the clock period
as the clock rate; they are not the same thing!
Relating the Time & Clock Metrics
• Determine effect of design change on
performance
• CPU execution time = CPU clock cycles X
clock cycle time
• CPU execution time = CPU clock
cycles/clock rate
Definition of Performance
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is n times faster than Y"
PerformanceX / PerformanceY = n
Problem:
—machine A runs a program in 20 seconds
—machine B runs the same program in 25
seconds
How to Improve Performance
seconds cycles seconds
program program cycle
So, to improve performance (everything else
being equal) you can either
Decrease the # of required cycles for a
program, or
Decrease the clock cycle time or, said
another way,
Increase the clock rate.
Different number of cycles for different
instructions
• Multiplication takes more time than
addition
• Floating point operations take longer than
integer ones
• Accessing memory takes more time than
accessing registers
How many cycles are required for a
program?
• Could assume that # of cycles = # of
instructions
2nd instruction
3rd instruction
1st instruction
time
4th
5th
6th
...
This assumption is incorrect, different instructions take different
amounts of time on different machines.
Why? hint: remember that these are machine instructions, not lines of C code
Example
• Our favorite program runs in 10 seconds on computer A,
which has a 400 Mhz. clock. We are trying to help a computer
designer build a new machine B, that will run this program in
6 seconds. The designer can use new (or perhaps more
expensive) technology to substantially increase the clock rate,
but has informed us that this increase will affect the rest of the
CPU design, causing machine B to require 1.2 times as many
clock cycles as machine A for the same program. What clock
rate should we tell the designer to target?"
CPU Performance Equation
• An alternative to "number of clock
cycles" is "number of instructions
executed" or Instruction Count ( IC ).
• Given both the "number of clock
cycles" and IC of a program, the
average Clocks Per Instruction ( CPI )
is given by:
CPU Clock Cycles of a Program
CPI
IC
IC CPI
CPU time IC CPI Clock cycle time
Clock rate
CPI Example
• Suppose we have two implementations of
the same instruction set architecture.
Machine A has a clock cycle time of 1 ns
(nanoseconds) and a CPI of 2.0 for some
program, and machine B has a clock cycle
time of 2 ns and a CPI of 1.2 for the same
program.
• Which machine is faster for this program,
and how much faster is it?
Reduce the CPU Time
• Reduce instruction count
—Which depends on instruction set architecture
+ compiler technology.
• Reduce CPI
—Which depends on system organization and
instruction set architecture.
• Reduce clock cycle time (increase rate)
—Which depends on technology and system
organization
Reduce the CPU Time
• Notice that reducing one measure may
increase another.
• For example, one can reduce instruction
count by the introduction of complex
instructions which carry out several basic
operations.
# of Instructions Example
• A compiler designer is trying to decide between
two code sequences for a particular machine.
Based on the hardware implementation, there
are three different classes of instructions: Class
A, Class B, and Class C, and they require one,
two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of
A, 1 of B, and 2 of C. The second sequence has 6
instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? How much?
What is the CPI for each sequence?
Amdahl’s law
• The performance improvement to be
gained from using some faster mode of
execution is limited by the fraction of the
time the faster mode can be used.
• This implies that the time consumed by
events whose performance is not
improved limits the effect of any
improvement.
—Lowest performer restricts all others.
Amdahl’s law….
The parameter to use in measuring the effect
of Amdahl's Law is speedup:
Performanc e using enhancemen t
Speedup
Performanc e without using enhancemen t
or
Execution time without enhancemen t
Speedup
Execution time with enhancemen t
Speedup depends on two factors
• The fraction of the computation time in the
original machine that can be converted to
take advantage of the enhancement
Fraction enhanced 1
• The improvement gained by the enhanced
execution mode
Speedup enhanced 1
Example
• Trip from point A to point B in
two parts
A 20 C 50/20/4/1.7/0.3 B
A-C Trip C-B Trip Total Time C-B Speedup Overall Speedup
20 50 70 1 1
20 20 40 2.5 1.75
20 4 24 12.5 2.9
20 1.7 21.7 29.4 3.2
20 0.3 20.3 166.66 3.4
Amdahl’s law
Exec timenew = execution time after some
enhancement
Exec timeold = execution time before any
enhancement
Fractionenhanced = fraction of work using the
enhancement
Speedupenhanced = speedup of enhanced mode
Example
• Suppose that we are considering an
enhancement to the processor of a server system
used for web serving. The new CPU is 10 times
faster on computation in the Web serving
application that the original processor. Assuming
that the original CPU is busy with computation
40% of the time and is waiting for I/O 60% of
the time, what is the overall speedup gained by
incorporating the enhancement?
Amdahl’s law Example
• Consider an enhancement that takes 20ns
on a machine with enhancement and
100ns on a machine without. Assume
enhancement can only be used 30% of
the time.
• What is the overall speedup?
Example
A common transformation required in graphics processors is
square root. Implementations of floating-point (FP) square root
vary significantly in performance, especially among processors
designed for graphics. Suppose FP square root (FPSQR) is
responsible for 20% of the execution time of a critical graphics
benchmark. One proposal is to enhance the FPSQR hardware and
speed up this operation by a factor of 10. The other alternative is
just to try to make all FP instructions in the graphics processor run
faster by a factor of 1.6; FP instructions are responsible for half of
the execution time for the application. The design team believes
that they can make all FP instructions run 1.6 times faster with the
same effort as required for the fast square root. Compare these two
design alternatives.