CENG 3420
Computer Organization & Design
Lecture 10: Performance
Bei Yu
CSE Department, CUHK
byu@cse.cuhk.edu.hk
(Textbook: Chapters 1.6 & 1.7)
Spring 2022
Performance Metrics
Purchasing perspective
Given a collection of machines, which has the
• best performance ?
• least cost ?
• best cost/performance?
Design perspective
Faced with design options, which has the
• best performance improvement ?
• least cost ?
• best cost/performance?
2/21
Performance Metrics
Both (purchasing & design) require
• basis for comparison
• metric for evaluation
Our Goal
Is to understand what factors in the architecture contribute to overall system performance
and the relative importance (and cost) of these factors
3/21
Throughput v.s. Response Time
Response time (execution time)
• The time between the start and the completion of a task.
• Important to individual users
Throughput (bandwidth)
• The total amount of work done in a given time
• Important to data center managers
Will need different performance metrics as well as a different set of applications to
benchmark embedded and desktop computers, which are more focused on
response time, versus servers, which are more focused on throughput
4/21
Response Time Matters
Justin Rattner’s ISCA-08 Keynote (VP and CTO of Intel) 5/21
Defining (Speed) Performance
• To maximize performance, need to minimize execution time
1
performanceX =
execution_timeX
• If X is n times faster than Y, then
performanceX execution_timeY
= =n
performanceY execution_timeX
• Decreasing response time almost always improves throughput.
6/21
EX-1
If computer A runs a program in 10 seconds and computer B runs the same program in 15
seconds, how much faster is A than B?
7/21
Performance Factors
• CPU execution time (CPU time): time the CPU spends working on a task
• Does not include time waiting for I/O or running other programs
CPU execution time = # CPU clock cycles × clock cycle time
# CPU clock cycles
=
clock rate
Can improve performance by reducing
• Length of the clock cycle
• Number of clock cycles required for a program
8/21
Review: Machine Clock Rate
Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time
(clock period)
1
CC =
CR
one clock p eriod
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
9/21
EX-2: Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must a
computer B has to run this program in 6 seconds? Unfortunately, to accomplish this,
computer B will require 1.2 times as many clock cycles as computer A to run the program.
10/21
Clock Cycles per Instruction
• Not all instructions take the same amount of time to execute
• One way to think about execution time is that it equals the number of instructions
executed multiplied by the average time per instruction
CPU clock cycles = # instruction × clock cycle per instruction
Clock cycles per instruction (CPI)
• The average number of clock cycles each instruction takes to execute
• A way to compare two different implementations of the same ISA
11/21
Effective (Average) CPI
n
X
CPIi × ICi
i=1
ICi : percentage of the number of instructions of class i executed
CPIi : (average) number of clock cycles per instruction for that instruction class
n: number of instruction classes
• Computing the overall effective CPI is done by looking at the different types of
instructions and their individual cycle counts and averaging
• The overall effective CPI varies by instruction mix
• A measure of the dynamic frequency of instructions across one or many programs
12/21
Basic Performance Equation
CPU time = Instruction count × CPI × clock cycle
Instruction count × CPI
CPU time =
clock rate
Three key factors that affect performance
• Can measure the CPU execution time by running the program
• The clock rate is usually given
• Can measure overall instruction count by using profilers/ simulators without
knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which we must know
the implementation details
13/21
EX-3: Using the Performance Equation
Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps
and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500
ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how
much?
14/21
Determinates of CPU Performance
CPU time = Instruction count × CPI × clock cycle
Instruction_ CPI clock_cycle
count
Algorithm
Programming
language
Compiler
ISA
Core
organization
Technology
15/21
Determinates of CPU Performance
CPU time = Instruction count × CPI × clock cycle
Instruction_ CPI clock_cycle
count
Algorithm
X X
Programming X X
language
Compiler X X
ISA
X X X
Core
X X
organization
Technology
X
15/21
EX-4
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
∑=
1 How much faster would the machine be if a better data cache reduced the average
load time to 2 cycles?
2 How does this compare with using branch prediction to shave a cycle off the branch
time?
3 What if two ALU instructions could be executed at once?
16/21
Workloads and Benchmarks
Benchmarks
A set of programs that form a “workload” specifically chosen to measure performance
• SPEC (System Performance Evaluation Cooperative) creates standard sets of
benchmarks starting with SPEC89.
• The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and
17 floating-point benchmarks (CFP2006).
• www.spec.org
• There are also benchmark collections for power workloads (SPECpower_ssj2008), for
mail workloads (SPECmail2008), for multimedia workloads (mediabench) ...
17/21
SPEC CINT2006 on Barcelona (CC = 0.4 × 109 )
Name ICx109 CPI ExTime RefTime SPEC
ratio
perl 2,1118 0.75 637 9,770 15.3
bzip2 2,389 0.85 817 9,650 11.8
gcc 1,050 1.72 724 8,050 11.1
mcf 336 10.00 1,345 9,120 6.8
go 1,658 1.09 721 10,490 14.6
hmmer 2,783 0.80 890 9,330 10.5
sjeng 2,176 0.96 837 12,100 14.5
libquantum 1,623 1.61 1,047 20,720 19.8
h264avc 3,102 0.80 993 22,130 22.3
omnetpp 587 2.94 690 6,250 9.1
astar 1,082 1.79 773 7,020 9.1
xalancbmk 1,058 2.70 1,143 6,900 6.0
Geometric Mean 11.7
18/21
Comparing and Summarizing Performance
How to summarize performance with a single number?
• First the execution times are normalized given the “SPEC ratio” (bigger is faster, i.e.,
SPEC ratio is the inverse of execution time)
• SPEC ratios are “averaged” using the geometric mean (GM)
v
u n
uX
GM = n · t SPEC ratioi
i=1
Guiding principle – reproducibility
List everything another experimenter would need to duplicate the experiment: version of
the operating system, compiler settings, input set used, specific computer configuration
(clock rate, cache sizes and speed, memory size and speed, etc.)
19/21
Other Performance Metrics
Power Consumption
• Especially in the embedded market where battery life is important
• For power-limited applications, the most important metric is energy efficiency
20/21
Highest Clock Rate of Intel Processors
What if the exponential increase had kept up? Why not?
• Due to process improvements
• Deeper pipeline
• Circuit design techniques
21/21