Performance
Topics: performance metrics CPU performance equation benchmarks and benchmarking reporting averages Amdahls Law Littles Law concepts
balance tradeoffs bursty behavior (average and peak performance)
Performance Metrics
latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: bandwidth, work per time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap
in real processors, there is always overlap (e.g., pipelining)
good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B iff
perf(A)/perf(B) = time(B)/time(A) = N
A is X% faster than B iff
perf(A)/perf(B) = time(B)/time(A) = 1 + X/100
Performance Metric I: MIPS
MIPS (millions of instructions per second) (instruction count / execution time in seconds) x 10-6 but instruction count is not a reliable indicator of work
Prob #1: work per instruction varies (FP mult >> register move) Prob #2: instruction sets arent equal (3 Pentium instrs != 3 Alpha instrs)
may vary inversely with actual performance particularly bad metric for multicore chips
Performance Metric II: MFLOPS
MFLOPS (millions of floating-point operations per second)
(FP ops / execution time) x 10-6 like MIPS, but counts only FP operations
FP ops have longest latencies anyway (problem #1) FP ops are the same across machines (problem #2)
may have been valid in 1980 (most programs were FP)
most programs today are integer i.e., light on FP load from memory takes longer than FP divide (prob #1) Cray doesnt implement divide, Motorola has SQRT, SIN, COS (#2)
CPU Performance Equation
processor performance = seconds / program separate into three components (for single core)
instructions / program: dynamic instruction count mostly determined by program, compiler, ISA cycles / instruction: CPI mostly determined by ISA and CPU/memory organization seconds / cycle: cycle time, clock time, 1 / clock frequency mostly determined by technology and CPU organization uses of CPU performance equation high-level performance comparisons back of the envelope calculations helping architects think about compilers and technology
CPU Performance Comparison
famous example: RISC Wars (RISC vs. CISC) assume
instructions / program: CISC = P, RISC = 2P CPI: CISC = 8, RISC = 2 T = clock period for CISC and RISC (assume they are equal)
CISC time = P x 8 x T = 8PT RISC time = 2P x 2 x T = 4PT RISC time = CISC CPU time/2 the truth is much, much, much more complex actual data from IBM AS/400 (CISC -> RISC in 1995):
CISC time = P x 7 x T = 7PT RISC time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)
Actually Measuring Performance
how are execution-time & CPI actually measured? execution time: time (Unix cmd): wall-clock, CPU, system CPI = CPU time / (clock frequency * # instructions) more useful? CPI breakdown (compute, memory stall, etc.)
so we know what the performance problems are (what to fix)
measuring CPI breakdown hardware event counters (built into core)
calculate CPI using instruction frequencies/event costs
cycle-level microarchitecture simulator (e.g., SimpleScalar)
+ measure exactly what you want model microarchitecture faithfully (at least parts of interest) method of choice for many architects (yours, too!)
Benchmarks and Benchmarking
program as unit of work millions of them, many different kinds, which to use? benchmarks standard programs for measuring/comparing performance + represent programs people care about + repeatable!! benchmarking process
define workload extract benchmarks from workload execute benchmarks on candidate machines project performance on new machine run workload on new machine and compare not close enough -> repeat
Benchmarks: Toys, Kernels, Synthetics
toy benchmarks: little programs that no one really runs
e.g., fibonacci, 8 queens
little value, what real programs do these represent?
scary fact: used to prove the value of RISC in early 80s
kernels: important (frequently executed) pieces of real programs
e.g., Livermore loops, Linpack (inner product)
+ good for focusing on individual features, but not big picture over-emphasize target feature (for better or worse) synthetic benchmarks: programs made up for benchmarking
e.g., Whetstone, Dhrystone
toy kernels++, which programs do these represent?
Benchmarks: Real Programs
real programs + only accurate way to characterize performance requires considerable work (porting) Standard Performance Evaluation Corporation (SPEC) http://www.spec.org collects, standardizes and distributes benchmark suites consortium made up of industry leaders SPEC CPU (CPU intensive benchmarks)
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006
other benchmark suites
SPECjvm, SPECmail, SPECweb, SPEComp
Other benchmark suite examples: TPC-C, TPC-H for databases
SPEC CPU2006
12 integer programs (C, C++)
gcc (compiler), perl (interpreter), hmmer (markov chain) bzip2 (compress), go (AI), sjeng (AI) libquantum (physics), h264ref (video) omnetpp (simulation), astar (path finding algs) xalanc (XML processing), mcf (network optimization)
17 floating point programs (C, C++, Fortran)
fluid dynamics: bwaves, leslie3d, ibm quantum chemistry: gamess, tonto physics: milc, zeusmp, cactusADM gromacs (biochem) namd (bio, molec dynamics), dealll (finite element analysis) soplex (linear programming), povray (ray tracing) calculix (mechanics), GemsFDTD (computational E&M) wrf (weather), sphinx3 (speech recognition)
Benchmarking Pitfalls
benchmark properties mismatch with features studied
e.g., using SPEC for large cache studies
careless scaling
using only first few million instructions (initialization phase) reducing program data size
choosing performance from wrong application space
e.g., in a realtime environment, choosing gcc
using old benchmarks
benchmark specials: benchmark-specific optimizations
Benchmarks must be continuously maintained and updated!
Reporting Average Performance
averages: one of the things architects frequently get wrong + pay attention now and you wont get them wrong important things about averages (i.e., means) ideally proportional to execution time (ultimate metric)
Arithmetic Mean (AM) for times Harmonic Mean (HM) for rates (IPCs) Geometric Mean (GM) for ratios (speedups)
there is no such thing as the average program use average when absolutely necessary
What Does The Mean Mean?
arithmetic mean (AM): average execution times of N programs (time(i)) / N harmonic mean (HM): average IPCs of N programs arithmetic mean cannot be used for rates (e.g., IPCs)
30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH
N / 1..N(1 / rate(i)) geometric mean (GM): average speedups of N programs N ( 1..N(speedup(i)) what if programs run at different frequencies within workload? weighting weighted AM = (1..N w(i) * time(i)) / N
GM Weirdness
what about averaging ratios (speedups)? HM / AM change depending on which machine is the base
machine A machine B B/A A/B
Program1 Program2
1 1000
10 0.1 0.1 10 (10+.1)/2 = 5.05 (.1+10)/2 = 5.05 AM B is 5.05 times faster! A is 5.05 times faster! 2/(1/10+1/.1) = 5.05 2/(1/.1+1/10) = 5.05 HM B is 5.05 times faster! A is 5.05 times faster! GM (10*.1) = 1 (.1*10) = 1
10 100
geometric mean of ratios is not proportional to total time!
if we take total execution time, B is 9.1 times faster GM says they are equal
Amdahls Law
Validity of the Single-Processor Approach to Achieving Large- Scale Computing Capabilities G. Amdahl, AFIPS, 1967 let optimization speed up fraction f of program by factor s
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
f = 95%, s = 1.1 f = 5%, s = 10 f = 5%, s = f = 95%, s
1/[(1-0.95) + (0.95/1.1)] = 1.094 1/[(1-0.05) + (0.05/10)] = 1.047 1/[(1-0.05) + (0.05/ )] = 1.052
1/[(1-0.95) + (0.95/ )] = 20
make common case fast, but... ...uncommon case eventually limits performance
Littles Law
Key Relationship between latency and bandwidth: Average number in system = arrival rate * mean holding time Possibly the most useful equation I know Useful in design of computers, software, industrial processes, etc. Example: How big of a wine cellar should we build? We drink (and buy) an average of 2 bottles per week On average, we want to age the wine for 5 years bottles in cellar = 2 bottles/week * 52 weeks/year * 5 years
= 520 bottles
System Balance
each system component produces & consumes data make sure data supply and demand is balanced
X demand >= X supply computation is X-bound
e.g., memory bound, CPU-bound, I/O-bound
goal: be bound everywhere at once (why?) X can be bandwidth or latency
X is bandwidth buy more bandwidth X is latency much tougher problem
Tradeoffs
Bandwidth problems can be solved with money. Latency problems are harder, because the speed of light is fixed and you cant bribe God David Clark well... can convert some latency problems to bandwidth problems solve those with money the famous bandwidth/latency tradeoff architecture is the art of making tradeoffs
Bursty Behavior
Q: to sustain 2 IPC... how many instructions should processor be able to fetch per cycle? execute per cycle? complete per cycle? A: NOT 2 (more than 2) dependences will cause stalls (under-utilization) if desired performance is X, peak performance must be > X programs dont always obey average behavior cant design processor only to handle average behvaior