KEMBAR78
Performance | PDF | Central Processing Unit | Multi Core Processor
0% found this document useful (0 votes)
37 views57 pages

Performance

The document discusses advancements in computer architecture, focusing on the significant improvements in performance and capacity of computer systems over the years. It highlights the evolution of microprocessors, including techniques like pipelining, branch prediction, and multicore processing, which enhance execution speed and efficiency. The document also addresses challenges such as power dissipation and memory latency that arise with increasing clock speeds and logic density.

Uploaded by

Rajaul Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views57 pages

Performance

The document discusses advancements in computer architecture, focusing on the significant improvements in performance and capacity of computer systems over the years. It highlights the evolution of microprocessors, including techniques like pipelining, branch prediction, and multicore processing, which enhance execution speed and efficiency. The document also addresses challenges such as power dissipation and memory latency that arise with increasing clock speeds and logic density.

Uploaded by

Rajaul Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CSM 2231: Computer Architecture

02: Performance

Presented By:
Professor Dr. Md. Rakib Hassan
Department of Computer Science and Mathematics
Email: rakib@bau.edu.bd
Performance
 The cost of computer systems continues to drop dramatically, while the performance and
capacity of those systems continue to rise equally dramatically
 Today’s laptops have the computing power of an IBM mainframe from 10 or 15 years ago
 Processors are so inexpensive that we now have microprocessors we throw away
 Desktop applications that require the great power of today’s microprocessor-based
systems include:
o Image processing
o Three-dimensional rendering
o Speech recognition
o Videoconferencing
o Multimedia authoring
o Voice and video annotation of files
o Simulation modeling
o Machine learning

 Businesses are relying on increasingly powerful servers to handle transaction and


database processing and to support massive client/server networks that have replaced
the huge mainframe computer centers of yesteryear
 Cloud service providers use massive high-performance banks of servers to satisfy high-
volume, high-transaction-rate applications for a broad spectrum of clients

PROF. DR. MD. RAKIB HASSAN 2


IAS vs Modern Computers
The basic building blocks for today’s computers are virtually
the same as those of the IAS computer from over 50 years ago.
On the other hand, the techniques for squeezing the maximum
performance out of the materials at hand have become
increasingly sophisticated.

PROF. DR. MD. RAKIB HASSAN 3


Microprocessor Speed
 Chipmakers can unleash a new generation of chips every three
years—with four times as many transistors.
 In microprocessors, the addition of new circuits, and the speed
boost that comes from reducing the distances between them, has
improved performance four- or fivefold every three years or so since
Intel launched its x86 family in 1978.
 But the raw speed of the microprocessor will not achieve its
potential unless it is fed a constant stream of work to do in the form
of computer instructions.
o Anything that gets in the way of that smooth flow undermines the power of
the processor.
 While the chipmakers have been busy learning how to fabricate
chips of greater and greater density, the processor designers must
come up with ever more elaborate techniques for feeding the
monster.

PROF. DR. MD. RAKIB HASSAN 4


Microprocessor Speed
Techniques built into contemporary processors include:
o Pipelining
o Branch prediction
o Superscalar execution
o Data flow analysis
o Speculative execution

These and other sophisticated techniques are made necessary


by the sheer power of the processor.
Collectively they make it possible to execute many instructions
per processor cycle, rather than to take many cycles per
instruction.

PROF. DR. MD. RAKIB HASSAN 5


Pipelining
The execution of an instruction involves multiple stages of
operation, including fetching the instruction, decoding the
opcode, fetching operands, performing a calculation, and so
on.
Pipelining enables a processor to work simultaneously on
multiple instructions by performing a different phase for each
of the multiple instructions at the same time. The processor
overlaps operations by moving data or instructions into a
conceptual pipe with all stages of the pipe processing
simultaneously.
For example, while one instruction is being executed, the
computer is decoding the next instruction. This is the same
principle as seen in an assembly line.

PROF. DR. MD. RAKIB HASSAN 6


Branch prediction
Processor looks ahead in the instruction code fetched from
memory and predicts which branches, or groups of
instructions, are likely to be processed next.
If the processor guesses right most of the time, it can prefetch
the correct instructions and buffer them so that the processor
is kept busy.
The more sophisticated examples of this strategy predict not
just the next branch but multiple branches ahead. Thus,
branch prediction potentially increases the amount of work
available for the processor to execute.

PROF. DR. MD. RAKIB HASSAN 7


Superscalar execution
This is the ability to issue more than one instruction in every
processor clock cycle.
In effect, multiple parallel pipelines are used.

PROF. DR. MD. RAKIB HASSAN 8


Data flow analysis
The processor analyzes which instructions are dependent on
each other’s results, or data, to create an optimized schedule
of instructions.
In fact, instructions are scheduled to be executed when ready,
independent of the original program order.
This prevents unnecessary delay.

PROF. DR. MD. RAKIB HASSAN 9


Speculative execution
Using branch prediction and data flow analysis, some
processors speculatively execute instructions ahead of their
actual appearance in the program execution, holding the
results in temporary locations.
This enables the processor to keep its execution engines as
busy as possible by executing instructions that are likely to be
needed.

PROF. DR. MD. RAKIB HASSAN 10


Performance Balance
Adjust the organization and architecture to compensate for the
mismatch among the capabilities of the various components
Architectural examples include:
o Increase the number of bits that are retrieved at one time by making
DRAMs “wider” rather than “deeper” and by using wide bus data paths
o Reduce the frequency of memory access by incorporating increasingly
complex and efficient cache structures between the processor and main
memory
o Change the DRAM interface to make it more efficient by including a cache
or other buffering scheme on the DRAM chip
o Increase the interconnect bandwidth between processors and memory by
using higher speed buses and a hierarchy of buses to buffer and structure
data flow

PROF. DR. MD. RAKIB HASSAN 11


Typical I/O Device Data Rates

PROF. DR. MD. RAKIB HASSAN 12


Improvements in Chip Org. and Arch.
As designers wrestle with the challenge of balancing processor
performance with that of main memory and other computer
components, the need to increase processor speed remains.
There are three approaches to achieving increased processor
speed:
1. Increase hardware speed of processor
 Fundamentally due to shrinking logic gate size
 More gates, packed more tightly, increasing clock rate
• An increase in clock rate means that individual operations are executed more rapidly
 Propagation time for signals reduced
2. Increase size and speed of caches
 Dedicating part of processor chip
 Cache access times drop significantly
3. Change processor organization and architecture
 Increase effective speed of instruction execution
 This involves using parallelism in one form or another

PROF. DR. MD. RAKIB HASSAN 13


Problems with Clock Speed and Logic Density
Power
o Power density increases (watts/cm2) with density of logic and clock speed
o Difficulty in dissipating heat

RC delay
o The speed at which electrons can flow on a chip between transistors is
limited by the resistance and capacitance of the metal wires connecting
them
o Specifically, delay increases as the RC product increases.
 As components on the chip decrease in size, the wire interconnects become
thinner, increasing resistance.
 Also, the wires are closer together, increasing capacitance.

Memory latency and throughput


o Memory access speed (latency) and transfer speed (throughput) lag
processor speeds

PROF. DR. MD. RAKIB HASSAN 14


Old Strategies to Increase Performance
Beginning in the late 1980s, and continuing for about 15 years,
two main strategies have been used to increase performance
beyond what can be achieved simply by increasing clock
speed:
o First, there has been an increase in cache capacity.
 There are now typically two or three levels of cache between the processor and
main memory.
 As chip density has increased, more of the cache memory has been incorporated
on the chip, enabling faster cache access.
 For example, the original Pentium chip devoted about 10% of on-chip area to a
cache. Contemporary chips devote over half of the chip area to caches. And,
typically, about three-quarters of the other half is for pipeline-related control
and buffering.
o Second, the instruction execution logic within a processor has become
increasingly complex to enable parallel execution of instructions within
the processor.
 Two noteworthy design approaches have been pipelining and superscalar.

PROF. DR. MD. RAKIB HASSAN 15


Problems
 By the mid to late 90s, both of these approaches were reaching a
point of diminishing returns.
 The internal organization of contemporary processors is exceedingly
complex and is able to squeeze a great deal of parallelism out of the
instruction stream.
 It seems likely that further significant increases in this direction will
be relatively modest.
 With three levels of cache on the processor chip, each level
providing substantial capacity, it also seems that the benefits from
the cache are reaching a limit.
 However, simply relying on increasing clock rate for increased
performance runs into the power dissipation problem.
o The faster the clock rate, the greater the amount of power to be dissipated,
and some fundamental physical limits are being reached.

PROF. DR. MD. RAKIB HASSAN 16


Processor Trends

PROF. DR. MD. RAKIB HASSAN 17


Multicore
 The use of multiple processors on the same chip provides the
potential to increase performance without increasing the clock rate
 Studies indicate that, within a processor, the increase in
performance is roughly proportional to the square root of the
increase in complexity.
o But if the software can support the effective use of multiple processors, then
doubling the number of processors almost doubles performance.
 Thus, the strategy is to use two simpler processors on the chip rather
than one more complex processor
o In addition, with two processors, larger caches are justified
 This is important because the power consumption of memory logic on a chip is much
less than that of processing logic

 As caches became larger, it made performance sense to create two


and then three levels of cache on a chip
o It is now common for the second-level cache to also be private to each core

PROF. DR. MD. RAKIB HASSAN 18


MIC vs GPU
MIC GPU

Many integrated core Graphics processing unit


Core designed to perform parallel
Leap in performance as well as the operations on graphics data
challenges in developing software to Traditionally found on a plug-in
exploit such a large number of cores graphics card, it is used to encode and
render 2D and 3D graphics as well as
The multicore and MIC strategy process video
involves a homogeneous collection of
general purpose processors on a single Used as vector processors for a variety
chip of applications that require repetitive
computations
When a broad range of applications
are supported by such a processor, the
term General-Purpose computing on
GPUs (GPGPU) is used

PROF. DR. MD. RAKIB HASSAN 19


Amdahl’s Law
Gene Amdahl in 1967
Deals with the potential speedup of a program using multiple
processors compared to a single processor
Illustrates the problems faced by the industry in the
development of multi-core machines
o Software must be adapted to a highly parallel execution environment to
exploit the power of parallel processing

Can be generalized to evaluate and design technical


improvement in a computer system

PROF. DR. MD. RAKIB HASSAN 20


Amdahl’s Law (Cont.)
 Consider a program running on a single processor such that a fraction (1 −
𝑓) of the execution time involves code that is inherently sequential, and a
fraction 𝑓 that involves code that is infinitely parallelizable with no
scheduling overhead.
 Let 𝑇 be the total execution time of the program using a single processor.
 Then the speedup using a parallel processor with N processors that fully
exploits the parallel portion of the program is as follows:

Time to execute program on a single processor


Speedup =
Time to execute program on N parallel processors
𝑇 1 − 𝑓 + 𝑇𝑓
=
𝑇𝑓
𝑇 1−𝑓 +
𝑁
1
=
𝑓
1−𝑓 +
𝑁
1
=
1
1−𝑓 1−
𝑁

PROF. DR. MD. RAKIB HASSAN 21


Amdahl’s Law (Cont.)
1
Speedup =
1
1−𝑓 1−𝑁

When 𝑓 is small, the use of parallel processors has little effect.


1
As 𝑁 approaches infinity, speedup is bound by , so that
1−𝑓
there are diminishing returns for using more processors.

PROF. DR. MD. RAKIB HASSAN 22


Illustration of Amdahl’s Law

PROF. DR. MD. RAKIB HASSAN 23


Amdahl’s Law for Multiprocessors

PROF. DR. MD. RAKIB HASSAN 24


Generalized Amdahl’s Law
Execution time before enhancement
Speedup=
Execution time after enhancement

PROF. DR. MD. RAKIB HASSAN 25


Example
Suppose that a task makes extensive use of floating-point
operations, with 40% of the time consumed by floating-point
operations. With a new hardware design, the floating-point
module is sped up by a factor of 𝐾. Then the overall speedup is
as follows:

1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.67
0.4
0.6 + 𝐾

PROF. DR. MD. RAKIB HASSAN 26


Little’s Law
Fundamental and simple relation with broad applications
Can be applied to almost any system that is statistically in
steady state, and in which there is no leakage
Little’s Law:
o 𝐿 = 𝜆𝑊
 𝜆=average arrival rate of 𝜆 items per unit time
 𝑊=average units of time an item stays in the system
 𝐿= average units in the system at any one time

PROF. DR. MD. RAKIB HASSAN 27


Little’s Law (Cont.)
Queuing system
o If server is idle, an item is served immediately.
o Otherwise, an arriving item joins a queue.
o There can be a single queue for a single server or for multiple servers, or
multiple queues with one being for each of multiple servers

Average number of items in a queuing system equals the


average rate at which items arrive multiplied by the time that
an item spends in the system
o Relationship requires very few assumptions
o Because of its simplicity and generality, it is extremely useful

PROF. DR. MD. RAKIB HASSAN 28


Application Performance
The performance of an application depends on:
o Raw speed of the processor
o Instruction set
o Choice of implementation language
o Efficiency of the compiler
o Skill of the programming (algorithm)

PROF. DR. MD. RAKIB HASSAN 29


Measuring Processor Speed
 Clock speed
o Operations performed by a processor, such as fetching an instruction,
decoding the instruction, performing an arithmetic operation, and so on, are
governed by a system clock.
o Typically, all operations begin with the pulse of the clock.
o Thus, at the most fundamental level, the speed of a processor is dictated by
the pulse frequency produced by the clock, measured in cycles per second, or
Hertz (Hz).
o Typically, clock signals are generated by a quartz crystal, which generates a
constant sine wave while power is applied. This wave is converted into a digital
voltage pulse stream that is provided in a constant flow to the processor
circuitry.

PROF. DR. MD. RAKIB HASSAN 30


Clock Speed
 The rate of pulses is known as the clock rate, or clock speed.
o For example, a 1-GHz processor receives 1 billion pulses per second.
 One increment, or pulse, of the clock is referred to as a clock cycle, or a
clock tick.
 The time between pulses is the cycle time.
 The clock rate is not arbitrary, but must be appropriate for the physical
layout of the processor.
 Actions in the processor require signals to be sent from one processor
element to another.
 When a signal is placed on a line inside the processor, it takes some finite
amount of time for the voltage levels to settle down so that an accurate
value (logical 1 or 0) is available.
 Furthermore, depending on the physical layout of the processor circuits,
some signals may change more rapidly than others.
 Thus, operations must be synchronized and paced so that the proper
electrical signal (voltage) values are available for each operation.

PROF. DR. MD. RAKIB HASSAN 31


Clock Speed (Cont.)
Some instructions may take only a few cycles, while others
require dozens.
In addition, when pipelining is used, multiple instructions are
being executed simultaneously.
Thus, a straight comparison of clock speeds on different
processors does not tell the whole story about performance.

PROF. DR. MD. RAKIB HASSAN 32


Instruction Execution Rate
A processor is driven by a clock with a constant frequency f or,
1
equivalently, a constant cycle time 𝜏, where 𝜏 = .
𝑓

𝐼𝑐 : instruction count for a program as the number of machine


instructions executed for that program until it runs to
completion or for some defined time interval.
o Note that this is the number of instruction executions, not the number of
instructions in the object code of the program.

PROF. DR. MD. RAKIB HASSAN 33


CPI
Cycles per instruction
If all instructions required the same number of clock cycles,
then CPI would be a constant value for a processor.
o However, on any given processor, the number of clock cycles required
varies for different types of instructions, such as load, store, branch, and
so on.
σ𝑛
𝑖=1(𝐶𝑃𝐼𝑖 ×𝐼𝑖 )
𝐶𝑃𝐼 =
𝐼𝑐
o 𝐶𝑃𝐼𝑖 is the number of cycles required for instruction type 𝑖
o 𝐼𝑖 is the number of executed instructions of type i for a given program

PROF. DR. MD. RAKIB HASSAN 34


Processor Time
 The processor time 𝑇 needed to execute a given program can be
expressed as:
o 𝑇 = 𝐼𝐶 × 𝐶𝑃𝐼 × 𝜏
 We can refine this formulation by recognizing that during the
execution of an instruction, part of the work is done by the
processor, and part of the time a word is being transferred to or
from memory.
o In this latter case, the time to transfer depends on the memory cycle time,
which may be greater than the processor cycle time.
 We can rewrite the preceding equation as:
o 𝑇 = 𝐼𝐶 × [𝑃 + (𝑚 × 𝑘)] × 𝜏
 where,
 p: number of processor cycles needed to decode and execute the instruction,
 m: number of memory references needed, and
 k: the ratio between memory cycle time and processor cycle time.

PROF. DR. MD. RAKIB HASSAN 35


5 Performance Factors
The five performance factors in the preceding equation
(𝐼𝑐 , 𝑝, 𝑚, 𝑘, 𝜏) are influenced by four system attributes:
o The design of the instruction set (known as instruction set architecture)
o Compiler technology (how effective the compiler is in producing an
efficient machine language program from a high-level language program)
o Processor implementation
o Cache and memory hierarchy

Performance Factors and System Attributes

PROF. DR. MD. RAKIB HASSAN 36


MIPS
Millions of Instructions per Second (MIPS)
MIPS is a common measure of performance for a processor.
It is the rate at which instructions are executed.
We can express the MIPS rate in terms of the clock rate and CPI
as follows:
𝐼𝑐 𝑓
o MIPS rate = =
𝑇×106 𝐶𝑃𝐼×106

PROF. DR. MD. RAKIB HASSAN 37


Example
 Consider the execution of a program that results in the execution of
2 million instructions on a 400-MHz processor. The program consists
of four major types of instructions. The instruction mix and the CPI
for each instruction type are given below, based on the result of a
program trace experiment:

 The average CPI when the program is executed on a uniprocessor


with the above trace results is CPI = 0.6 + (2 * 0.18) + (4 * 0.12) + (8 *
0.1) = 2.24.
 The corresponding MIPS rate is (400 * 106)/(2.24 * 106) ≈ 178.

PROF. DR. MD. RAKIB HASSAN 38


MFLOPS
Millions of Floating-Point operations per Second (MFLOPS)

PROF. DR. MD. RAKIB HASSAN 39


Calculating the Mean
The use of benchmarks to compare systems involves
calculating the mean value of a set of data points related to
execution time

The three common formulas used for calculating a mean are:


o Arithmetic
o Geometric
o Harmonic

PROF. DR. MD. RAKIB HASSAN 40


Comparison of Means on Various Data Sets

Each set has a maximum data


point value of 11

PROF. DR. MD. RAKIB HASSAN 41


Arithmetic Mean
 An Arithmetic Mean (AM) is an appropriate measure if the sum of all
the measurements is a meaningful and interesting value
 The AM is a good candidate for comparing the execution time
performance of several systems
o For example, suppose we were interested in using a system for large-scale
simulation studies and wanted to evaluate several alternative products. On
each system we could run the simulation multiple times with different input
values for each run, and then take the average execution time across all runs.
The use of multiple runs with different inputs should ensure that the results
are not heavily biased by some unusual feature of a given input set. The AM of
all the runs is a good measure of the system’s performance on simulations,
and a good number to use for system comparison.
 The AM used for a time-based variable, such as program execution
time, has the important property that it is directly proportional to
the total time
o If the total time doubles, the mean value doubles

PROF. DR. MD. RAKIB HASSAN 42


Comparison of Arithmetic and Harmonic Means for
Rates

PROF. DR. MD. RAKIB HASSAN 43


A Comparison of Arithmetic and Geometric Means
Results normalized to Computer A

Results normalized to Computer B

PROF. DR. MD. RAKIB HASSAN 44


Another Comparison of Arithmetic and Geometric
Means
Results normalized to Computer A

Results normalized to Computer B

PROF. DR. MD. RAKIB HASSAN 45


Benchmark Principles
Desirable characteristics of a benchmark program:
o It is written in a high-level language, making it portable across different
machines
o It is representative of a particular kind of programming domain or
paradigm, such as systems programming, numerical programming, or
commercial programming
o It can be measured easily
o It has wide distribution

PROF. DR. MD. RAKIB HASSAN 46


SPEC
System Performance Evaluation Corporation (SPEC)
Benchmark suite
o A collection of programs, defined in a high-level language
o Together attempt to provide a representative test of a computer in a
particular application or system programming area

SPEC
o An industry consortium
o Defines and maintains the best known collection of benchmark suites
aimed at evaluating computer systems
o Performance measurements are widely used for comparison and research
purposes

PROF. DR. MD. RAKIB HASSAN 47


SPEC CPU2006
Best known SPEC benchmark suite
Industry standard suite for processor intensive applications
Appropriate for measuring performance for applications that
spend most of their time doing computation rather than I/O
Consists of 17 floating point programs written in C, C++, and
Fortran and 12 integer programs written in C and C++
Suite contains over 3 million lines of code
Fifth generation of processor intensive suites from SPEC

PROF. DR. MD. RAKIB HASSAN 48


SPEC CPU2006
Integer Benchmarks

PROF. DR. MD. RAKIB HASSAN 49


SPEC CPU2006
Floating-Point
Benchmarks

PROF. DR. MD. RAKIB HASSAN 50


Terms Used in SPEC Documentation
Benchmark
o A program written in a high-level language that can be compiled and
executed on any computer that implements the compiler

System under test


o This is the system to be evaluated

Reference machine
o This is a system used by SPEC to establish a baseline performance for all
benchmarks
 Each benchmark is run and measured on this machine to establish a reference
time for that benchmark

Base metric
o These are required for all reported results and have strict guidelines for
compilation

PROF. DR. MD. RAKIB HASSAN 51


Terms Used in SPEC Documentation (Cont.)
Peak metric
o This enables users to attempt to optimize system performance by
optimizing the compiler output

Speed metric
o This is simply a measurement of the time it takes to execute a compiled
benchmark
 Used for comparing the ability of a computer to complete single tasks

Rate metric
o This is a measurement of how many tasks a computer can accomplish in a
certain amount of time
 This is called a throughput, capacity, or rate measure
 Allows the system under test to execute simultaneous tasks to take advantage of
multiple processors

PROF. DR. MD. RAKIB HASSAN 52


SPEC Evaluation Flowchart

PROF. DR. MD. RAKIB HASSAN 53


Some SPEC CINT2006 Results
Sun blade 1000

PROF. DR. MD. RAKIB HASSAN 54


Some SPEC CINT2006 Results
Sun blade X6250

PROF. DR. MD. RAKIB HASSAN 55


GeekBench
Set of cross-platform multicore benchmarks
o Can run on iPhone, Android, laptop, desktop, etc

Tests integer, floating point, memory bandwidth performance

GeekBench stores all results online


o Easy to check scores for many different systems, processors

Pitfall: Workloads are simple, may not be a completely


accurate representation of performance
o We know they evaluate compared to a baseline benchmark

PROF. DR. MD. RAKIB HASSAN 56


PROF. DR. MD. RAKIB HASSAN 57

You might also like