CS M151B / EE M116C
Computer Systems Architecture
Performance
Instructor: Prof. Lei He
<LHE@ee.ucla.edu>
Some notes adopted from Glenn Reinman
Time vs Throughput
Vehicle
Time to
San Diego*
Speed
Passengers
Throughput
(pmph)
Ferrari
0.75 hours
160 mph
320
Greyhound
2 hours
65 mph
60
3900
* obviously this does not include LA traffic!
Time to do the task from start to finish
execution time, response time, latency
Tasks per unit time
throughput, bandwidth
Time vs Throughput
Time is measured in time units/job.
Throughput is measured in jobs/time unit.
But time = 1/throughput may be false.
It takes 4 months to grow a tomato.
Can you only grow 3 tomatoes a year ?
If you run only one job at a time,
time = 1/throughput
How To Measure Execution Time?
% time program
... programs results ...
90.7u 12.9s 2:39 65%
%
user + kernel
wallclock
user CPU time? (time CPU spends running your code)
total CPU time (user + kernel)? (includes op. sys. code)
Wallclock time? (total elapsed time)
Includes time spent waiting for I/O, other users, ...
Answer depends ...
For measuring processor speed, we can use total CPU.
If no I/O or interrupts, wallclock may be better
more precise (microseconds rather than 1/100 sec)
can measure individual sections of code
Performance
For performance, larger should be better.
Time is backwards - larger execution time is worse.
CPU performance = 1 / total CPU time
System performance = 1 / wallclock time
These terms only make sense if you know what program is
measured ...
e.g. The performance on Linpack was 200 MFLOPS
And if CPU or system only works on 1 program at a time
This is no longer true in general!
Performances units, inverse seconds, can be awkward
Can answer What was performance? by It took 15 seconds.
Cycles
Every conventional processor has a clock with a
fixed cycle time or clock rate
Rate often measured in MHz = millions of cycles/second
Time often measured in ns (nanoseconds)
X MHz corresponds to 1000/X ns (e.g. 500 MHz 2 ns clock)
How many cycles are required for a given program?
# cycles = # instructions?
Does a multiply take as long as an add?
Floating point ops versus integer ops?
Memory Latency?
# cycles depends on
architecture (i.e. how many cycles a given instruction type
will take)
the instruction makeup of the program being evaluated
Definitions
CPU Time = CPU cycles executed * cycle time
CPU cycles = Instructions executed * CPI
Average Clock Cycles per Instruction
Putting It All Together
One of P&Hs
big pictures
seconds
CPU
Execution
Time
instructions
Instruction
Clock Cycle
CPI X
X
Count
Time
cycles/instruction
seconds/cycle
Note: Instruction count
Use dynamic instruction count (#instructions executed)
NOT static instruction count (#instructions in compiled
code)
Who Impacts Performance?
CPU
Execution
Time
Instruction
Clock Cycle
CPI X
X
Count
Time
Programmer
Compiler Writer
ISA Architect
Machine Architect
Hardware Designer
Materials Scientist
Physicist
Silicon Engineer
Explaining Performance Variation
CPU Execution
=
Time
Same machine,
different programs
Same program,
different machines,
but same ISA
Same program,
different ISAs
Instruction
Clock Cycle
CPI X
X
Count
Time
Comparing Performance
The fundamental question:
Will computer A run program P
faster than computer B?
Compare clock rates?
Compare CPI?
MIPS?
Millions of Instructions per Second
(Instruction Count) / (Execution Time * 106)
(Clock Rate) / (CPI * 106)
MFLOPS?
Example from the Text
Execution Time (in seconds) shown:
Computer A
Computer B
Program 1
10
Program 2
1000
100
Total Time
1001
110
Which is faster?
PerformanceB
PerformanceA =
Execution TimeA
1001
Execution TimeB = 110 = 9.1
But this assumes each program has equal weight
Program 1 is executed 30% of the time
Program 2 is executed 70% of the time
How does this change the above calculation?
Comparing Speeds ...
Computer X is 3 times faster than Y
times faster than (or times as fast as) means
theres a multiplicative factor relating quantities
X was 3 times faster than Y speed(X) = 3 speed(Y)
percent faster than implies an additive relationship
X was 25% faster than Y speed(X) = (1+25/100) speed(Y)
percent slower than implies subtraction
X was 5% slower than Y speed(X) = (1-5/100) speed(Y)
100% slower means it doesnt move at all !
times slower than or times as slow as is awkward.
X was 3 times slower than Y means speed(X) = 1/3
speed(Y)
If X is 5% faster than Y, is Y 5% slower than X?
CPI as a Weighted Average
Suppose 1 GHz computer ran short program:
Load (4 cycles), Shift (1), Add (1), Store (4).
We have instructions are CPI=4, are CPI=1.
So weighted average CPI = 4 + 1 = 2.5
Time = 4 instructions x 2.5 CPI x 1 ns = 10 ns
Benchmarks
A benchmark is a set of programs that are
representative of a class of problems.
We want reproducible results!
Microbenchmarks measure one feature of system
e.g. memory accesses or communication speed
Kernel most compute-intensive part of
applications
e.g. Linpack and NAS kernel bmarks (for
supercomputers)
Full application:
SPEC (int and float)
The SPEC benchmarks
SPEC = System Performance Evaluation Cooperative
(see www.specbench.org)
A set of real applications along with strict guidelines for
how to run them.
Relatively unbiased means to compare machines.
Very often used to evaluate architectural ideas
New versions in 89, 92, 95, 2000, 2004, ...
SPEC 95 didnt really use enough memory
Results are speedup compared to reference machine
SPEC 95: Sun SPARCstation 10/40 performance = 1
SPEC 2000, Sun Ultra 5 performance = 100
Geometric mean used to average results
Dont Forget Compiler
Performance
Darker bars show performance with compiler
improvements (same machine as light bars)
The SPEC CPU 2000 Suite
SPECint2000 12 C/Unix or NT programs
gzip and bzip2 - compression
gcc compiler; 205K lines of messy code!
crafty chess program
parser word processing
vortex object-oriented database
perlbmk PERL interpreter
eon computer visualization
vpr, twolf CAD tools for VLSI
mcf, gap combinatorial programs
SPECfp2000 10 Fortran, 3 C programs
scientific application programs (physics, chemistry, image processing,
number theory, ...)
SPEC on Pentium III and Pentium 4
Suppose:
Amdahls Law
total program time = time on part A + time on part B,
and you improve part A to go p times faster,
then:
improved time = time on part A/p + time on part B.
The impact of a performance improvement is
limited by the percent of execution time affected
by the improvement.
Execution Time Affected
Execution time
=
after improvement
Amount of Improvement
+ Execution Time Unaffected
Make the common case fast!!
Improving Latency
Latency is (ultimately) limited by physics.
e.g. speed of light
Some improvements are incremental
smaller transistors shorten distances
to reduce disk access time, make disks rotate faster
Some improvements can trade latency for CPI
reducing the size of data cache
Improvements can require new technology
copper interconnect
Improving Bandwidth
You can improve bandwidth or throughput by
throwing money at the problem.
Use wider buses, more disks, multiple processors,
more functional units ...
Two basic strategies:
Parallelism: duplicate resources.
Run multiple tasks simultaneously on separate hardware
Pipelining: break process up into multiple stages
Reduces the time needed for a single stage
Build separate resources for each stage.
Start a new task down the pipe every (shorter) timestep
Key Points
Be careful how you specify performance
Execution time = instructions *CPI *cycle time
Use real applications to measure performance
Throughput and latency are different
Make the common case fast!