1.3.
Some Popular Performance Metrics
A number of popular measures have been devised in attempts to
create a standard and easy-to-use measure of computer performance.
One result has been that simple metrics, valid in a limited context, have
been heavily misused. All proposed alternatives to the use of a time
as the performance metric have eventually to misleading claims,
distorted results, or incorrect interpretations. Before introducing the
main Quantitative Principles of measuring Computer performance
(execution time), we are introducing at first some of the most popular
performance measures; clock rate, MIPS, and MFLOPS. The three are
measuring the performance by calculating the rate of occurrence of an
event.
1.3.2.1 The clock rate
In many advertisements for computer systems, the most prominent
indication of performance is often the frequency of the processor's
central clock. The implication to the buyer is that a 250 MHz system
must always be faster at solving the user's problem than a 200 MHz
system, for instance. However, this performance metric completely
ignores how much computation is actually accomplished in each
clock cycle, it ignores the complex interactions of the processor with
the memory subsystem and the input/output subsystem, and it ignores
the not at all unlikely fact that the processor may not be the performance
bottleneck.
Some of the characteristics of the clock rate metric are:
it is very repeatable since it is a constant for a given system,
it is easy to measure since it is most likely stamped on the box,
it is consistent since the value of MHz is precisely defined across
all systems, and
it is independent of any sort of manufacturers' games.
The above characteristics look as advantages for using the clock rate as a
measure of performance, but as a matter of fact it is nonlinear measure
(doubling the clock rate does not mean doubling the resulting
performance) and is an unreliable metric. As many owners of personal
computer systems can attest, buying a system with a faster clock in no
way assures that their programs will run correspondingly faster. This
point will be more clear when considering, latter, the execution time
equation. Thus, we conclude that the processor's clock rate is not a good
metric of performance.
1.3.2.2
MIPS (Million Instructions Per Second)
The MIPS metric is an attempt to develop a rate metric for computer
systems that allows a direct comparison of their speeds. It is throughput
or execution-rate performance metric and is a measure of the
amount of computation performed per unit time.
For a specific program running on a specific computer, MIPS is a measure
of millions of instructions executed per second: MIPS is an instruction
execution rate, accordingly, it specifies performance inversely to
execution time; faster machines have a higher MIPS rating. Thus, MIPS,
which is an acronym for millions of instructions executed per second,
is defined to be:
MIPS = Instruction count / (Execution time * 106 )
= Instruction count / (CPU clocks * Cycle time * 106)
= (Instruction count * Clock rate) / (Instruction count * CPI *
106)
= Clock rate / (CPI * 106)
(1.1)
Where: CPI = Clock Per Instruction
Which is the average number of clocks per instruction
calculated using a benchmark programme
Instruction Count =
The number of instructions included while
executing the programme
Execution time in terms of MIPS:
Execution time = (Instruction count * CPI) / Clock rate
= Instruction count / { (clock rate / (CPI * 106)) *106}
= Instruction count / (MIPS * 106)
(1.2)
Problems that may arise of using MIPS as measure of
performance:
Taking the instruction as the unit of counting makes MIPS easy to
measure, repeatable, and independent, otherwise MIPS sufers from many
problems that make it not a good way to measure performance. For
example, it is not linear since, like the clock rate, a doubling of
the MIPS rate does not necessarily cause a doubling of the
resulting performance. It also is neither reliable nor consistent since it
really does not correlate well to performance at all.
The problem with MIPS as a performance metric is that diferent
processors can do substantially diferent amounts of computation
with a single instruction. For instance, one processor may have a
branch instruction that branches after checking the state of a specified
condition code bit. Another processor, on the other hand, may have a
branch instruction that first decrements a specified count register, and
then branches after comparing the resulting value in the register with
zero. In the first case, a single instruction does one simple
operation, whereas in the second case, one instruction actually
performs several operations. The failing of the MIPS metric is that
each instruction is count as one unit even though in this example the
second instruction actually performs more real computation. These
diferences in the amount of computation performed by an instruction are
at the heart of the diferences between RISC and CISC processors and
render MIPS essentially useless as a performance metric.
As a matter of fact, sometimes MIPS can vary inversely with
performance. A very clear example of this case is when we
calculate MIPS for two machines with the same instruction set
but one of them has special hardware to execute floating-point
operations and another machine using software routines to
execute the floating-point operations. The first machine (with
optional hardware) takes more clock cycles per foating-point instruction
than per integer instruction. Then foating-point programs using optional
hardware instead of software foating-point routines have a lower MIPS
rating (less number of instructions with more number of cycles per
instruction) while it is practically take less time. Software foating point
executes simpler instructions, resulting in a higher MIPS rating, but it
executes so many of them compared with the case of hardwired
execution such that overall execution time is longer. Example1.4
shows such drawback of MIPS.
Relative MIPS
One attempt to retain the use of MIPS, but to make it useful among
diferent instruction sets, was to choose a definition of MIPS that is relative
to some agreed-upon reference machine. The term used is Relative MIPS
and defines as follows:
Relative MIPS = (Time-reference / Time-unrated) * MIPS-reference
(1.3)
Where
:
Time-reference = Execution time of a program on the
reference machine
Time-unrated = Execution time of the same program on the
machine to be related
MIPS-reference = Agreed-upon MIPS rating of the reference
machine
Relative MIPS is proportional to execution time only for a given program
and a given input. Even when these are identified, it becomes harder to
find a reference machine on which to run programs as the machine ages.
Moreover, should the older machine be run with the newest release of
the compiler and operating system, or should the software be fixed so
the reference machine does not become faster over time? There is also
the temptation to generalize from a relative MIPS rating obtained using
one benchmark to a general statement about relative performance, even
though there can be wide variations in performance of two machines
across a complete set of benchmarks.
In the 1980s the dominant reference machine was the VAX-11/780, which
was called a 1-MIPS machine. If a machine is five times faster than VAX11/789, its rating for that benchmark is 5 relative MIPS.
The 1970s and 1980s
which was defined by
programs. MIPS was
hence the invention
second).
marked the growth of the supercomputer industry,
high performance on foating-point-intensive
clearly inappropriate metric for this industryof MFLOPS (millions of floating-point operations per
1.3.2.3
MFOLPS (Million FLOating-Point Operations
Per Second)
Most performance measures look at the processors ability to process
integer data: whole numbers, text and the like. This is perfectly valid,
because most processing is done on this sort of data. However, intensive
foating point applications, such as spreadsheets and some graphics
programs and games, make heavy use of the processors foating point
unit. As a matter of fact, processors that have similar performance on
integer tasks can have very diferent performance on foating point
operations. MFOLPS is another measure of performance (speed) that
takes into consideration this type of processing.
The MFLOPS as a performance metric tries, also to correct the primary
shortcoming of the MIPS metric by more precisely defining the type of
the operation that will be taken as unit of counting. It defines an
arithmetic operation on two foating-point (i.e. fractional) quantities to be
the basic unit.
MFLOPS, for a specific program running on a specific computer, is a
measure of millions of foating point-operation (megafops) per second:
MFLOPS = Number of floating-point operations / (Execution time
x 106 )
(1.4)
7
A foating-point operation is an addition,
subtraction, multiplication,
or division operation applied to numbers represented by a single or
double precision foating-point representation. Such data items are
heavily used in scientific calculations and are specified in programming
languages using key words like float, real, double, or double precision.
MFLOPS depends on the program. Diferent programs require the
execution of diferent numbers of foating point operations. Since
MFLOPS were intended to measure foating-point performance, they are
not applicable outside that range. Compiler, as an example, have a
MFLOPS rating near 0 no matter how fast the machine is, because
compilers rarely use foating-point arithmetic.
Because it is based on operations in the program rather than on the
instructions, MFLOPS has stronger claim than MIPS to being a fair
comparison between diferent machines. The reason of that is the fact
that the same program running on diferent computers may execute a
diferent number of instructions but will always execute the same
number of foating-point operations.
MFLOPS is not dependable. It depends on the type of floating-point
operations present in the program (availability of foating point
instructions). For example some computers have no sine instruction
while others have. In the first group of computers, the calculation of a
sine function needs to call the sine routine, which would require
performing several foating-point operations, while in the second group
this would require only one operation. Another potential problem is that
the MFLOPS rating changes according not only to mixture of integer
and foating-point operations but to the mixture of fast and slow
floating-point operations. For example, a program with 100% floatingpoint adds will have a higher rating than a program with 100% floatingpoint divides. The solution to both these problems is to define a
method of counting the number of foating-point operations in a highlevel language program. This counting process can also weight the
operations, giving more complex operations larger weights, allowing a
machine to achieve a higher MFLOPS rating even if the program
contains many foating-point divides. This type of MFLOPS is called
normalized or weighted MFLOPS.
1.3.3
The Processor Performance Equation
The performance of any processor is measured, as mentioned above, by the time
required to implement a given task, in our case a given program. The CPU time
for a program can be expressed as follows:
CPU time = CPU clock cycles for a program x Clock cycle time
For any processor the clock cycle time (or clock rate) is known and in addition to
that we can also count the number of instruction executed the instruction
count IC. If we know the number of clock cycles and the instruction count we can
calculate the average number of clock cycles per instruction (CPI):
CPI = (CPU clock cycles for a program)/ IC
This allows us to express CPU time using the following equation:
CPU time = IC x CPI x Clock cycle time
(1.5)
Which can be written as:
CPU time =
seconds
Instructio ns
Cycles
Seconds
=
x
x
program
Pr ogram
Instructio n Cycles
(1.6)
In the general case, executing the program means the use of diferent instruction
types each has its own frequency of occurrence and its CPI.
Instruction Frequency & CPI
Given a program with n types or classes of instructions with the following
characteristics:
Ci
= Count of instructions of type i
CPIi
= Average cycles per instruction of type i
Fi
= Frequency of instruction type i
= Ci / total instruction count
Then:
CPU =
(CP xF )
i
i =1
In such cases the number of total CPU clock cycles will be expressed as:
n
CPU clock cycles = CPI i xIC
(1.7)
i =1
where ICi represents number of times instruction I
is executed in a program and CPIi represents the
9
av
er
age number of clock cycles for instruction i.
Equation 1.7 can be used to express CPU time as:
CPU time = [ CPI i xICi ]x Clock cycle time
i =1
EXAMPLE 1.1:
A RISC Machine has the following instruction types and frequencies.
Base Machine (Reg / Reg)
(1.8)
Operation
ALU
Load
Store
Branch
Freq, Fi
50%
20%
10%
20%
CPIi
1
5
3
2
CPIi x Fi
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
Calculate the value of CPI
SOLUTION:
In this case, we use the following equation to calculate the
weighted average of CPI (See section 1.5 for details)
n
CPI = (CPI i xFi )
(1.9)
i =1
CPI = 0.5 x 1 + 0.2 x 5 + 0.1 x 3 + 0.2 x 2 = 2.2
EXAMPLE 1.2:
CPU Execution Time
A Program is running on a specific machine with the
following parameters:
Total instruction count: 10,000,000 instructions
Average CPI for the program: 2.5 cycles/instruction.
CPU clock rate: 200 MHz.
What is the execution time for this program?
SOLUTION:
CPU time = (Seconds/ Program) = (Instructions/ Program) x
(Cycles/ Instruction) x (Seconds/ Cycle) CPU time = Instruction
count x CPI x Clock cycle
= 10,000,000 x 2.5 x 1 / clock rate
= 10,000,000 x 2.5 x 5x10-9
= .125 seconds
EXAMPLE 1.3:
Compiler Variations, MIPS, Performance:
For the machine with instruction classes:
Instruction class
A
B
C
CPI
1
2
3
For a given program two compilers produced the following instruction
counts:
Code from:
Compiler 1
Compiler 2
Instruction counts (in millions)
for each instruction class
A
B
C
5
1
1
10
1
1
The machine is assumed to run at a clock rate of 100 MHz
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
CPI = CPU execution cycles / Instructions count
CPU time = Instruction count x CPI / Clock rate
For compiler 1:
CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
MIP1 = 100 / (1.428 x 106) = 70.0
CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10
seconds
For compiler 2:
CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
MIP2 = 100 / (1.25 x 106) = 80.0
CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15
seconds
EXAMPLE 1.4:
Effect of Compiler Variations on MIPS Rating and Performance:
Assume we optimized the compiler of the load-store machine for which the measurements given in
Table 1.1 have been made. The compiler discards 50% of the arithmetic logic unit (ALU) instructions,
although it cannot reduce loads, stores, or branches. Ignoring systems issues and assuming a 500 MHz
clock rate and 1.57 unoptimized CPI, what is the MIPS rating for optimized code versus unoptimized code?
Does the rating of MIPS agree with the ranking of execution time?
Instruction type
ALU
Loads
Stores
Branches
Frequency
43%
21%
12%
24%
Clock Cycle Count
1
2
2
2
SOLUTION: Given the value CPIunoptimized = 1.57, then from equation (1.1), we get:
6
MIPSunoptimized = 500 MHz/ 1.57 x 10 = 318.5
Use equation 1.5 to calculate the performance of unoptimized code, we get
-9
CPU timeunoptimized = ICunoptimized x 1.57 x (2 x 10 )
-9
= 3.14 x 10 x ICunoptimized
For the optimized case:
Since the optimized compiler will discard 50% of the ALU operations, then:
ICoptimized = (1 0.43/2) ICunoptimized = (1 - 0.215) ICunoptmized = 0.785 ICunoptimized
The frequency given in table 1.1 must be changed now to fraction of the ICoptimized. The
table will take the form:
Instruction type
ALU
Loads
Stores
Branches
Frequency
27.4%
26.8%
15.3%
30.5%
Clock Cycle Count
1
2
2
2
From this table we get:
CPIoptimized = 0.274 x 1 + 0.268 x 2 + 0.153 x 2 + 0.305 x 2
= 1.73
Using equation 1.1, we get:
6
MIPS optimized = 500MHz/1.73 x 10 = 289.0
The performance of the optimized compiler is:
-9
CPU timeoptimized = ICoptimized x 1.73 x (2 x 10 )
-9
= 0.785 ICunoptimized x 1.73 x 2 x 10
-9
= 2.72 x 10 IC unoptimized.
The optimized code is 3.14/2.72 = 1.15 times faster, but its MIPS rating is lower: 289
versus 318.
11