+
CIT 1103: Computer System
Organization and Architecture
+ Chapter 2
Performance Issues
+
Designing for Performance
The cost of computer systems continues to drop dramatically, while the performance
and capacity of those systems continue to rise equally dramatically
Today’s laptops have the computing power of an IBM mainframe from 10 or 15 years
ago
Processors are so inexpensive that we now have microprocessors we throw away
Desktop applications that require the great power of today’s microprocessor-based
systems include:
Image processing
Three-dimensional rendering
Speech recognition
Videoconferencing
Multimedia authoring
Voice and video annotation of files
Simulation modeling
Businesses are relying on increasingly powerful servers to handle transaction
and database processing and to support massive client/server networks that
have replaced the huge mainframe computer centers of yesteryear
Cloud service providers use massive high-performance banks of servers to
satisfy high-volume, high-transaction-rate applications for a broad spectrum of
clients
+
Microprocessor Speed
Techniques built into contemporary processors include:
• Processor moves data or instructions into a
Pipelining conceptual pipe with all stages of the pipe
processing simultaneously
• Processor looks ahead in the instruction code
Branch prediction fetched from memory and predicts which
branches, or groups of instructions, are likely to
be processed next
Superscalar • This is the ability to issue more than one
instruction in every processor clock cycle. (In
execution effect, multiple parallel pipelines are used.)
• Processor analyzes which instructions are
Data flow analysis dependent on each other’s results, or data, to
create an optimized schedule of instructions
• Using branch prediction and data flow analysis,
Speculative some processors speculatively execute
instructions ahead of their actual appearance in
execution
the program execution, holding the results in
temporary locations, keeping execution engines
as busy as possible
+
Performance
Balance
Increase the
number of bits
Adjust the organization and that are retrieved
at one time by
architecture to compensate making DRAMs
“wider” rather
for the mismatch among the than “deeper” and
by using wide bus
capabilities of the various data paths
components Reduce the
frequency of
memory access by
Architectural examples incorporating
increasingly
include: complex and
efficient cache
structures
between the
processor and
main memory
Change the DRAM Increase the
interface to make interconnect
it more efficient by bandwidth between
processors and
including a cache memory by using
or other buffering higher speed buses
scheme on the and a hierarchy of
DRAM chip buses to buffer and
structure data flow
+
Improvements in Chip
Organization and Architecture
Increase hardware speed of processor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
Increase size and speed of caches
Dedicating part of processor chip
Cache access times drop significantly
Change processor organization and architecture
Increase effective speed of instruction execution
Parallelism
+
Problems with Clock Speed and
Login Density
Power
Power density increases with density of logic and clock
speed
Dissipating heat
RC delay
Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them
Delay increases as the RC product increases
As components on the chip decrease in size, the wire
interconnects become thinner, increasing resistance
Also, the wires are closer together, increasing capacitance
Memory latency
Memory speeds lag processor speeds
+
The use of multiple
processors on the same
chip provides the potential
to increase performance
Multicore without increasing the
clock rate
Strategy is to use two
simpler processors on the
chip rather than one more
complex processor
With two processors larger
caches are justified
As caches became larger it
made performance sense
to create two and then
three levels of cache on a
chip
+
Many Integrated Core (MIC)
Graphics Processing Unit
(GPU)
MIC GPU
Leap in performance as well Core designed to perform
as the challenges in parallel operations on
developing software to graphics data
exploit such a large number
of cores Traditionally found on a
plug-in graphics card, it is
The multicore and MIC used to encode and render
strategy involves a 2D and 3D graphics as well
homogeneous collection of as process video
general purpose processors
on a single chip Used as vector processors
for a variety of applications
that require repetitive
computations
+ Gene Amdahl
Deals with the potential speedup of
a program using multiple
processors compared to a single
Amdahl’s processor
Law
Illustrates the problems facing
industry in the development of
multi-core machines
Software must be adapted to a
highly parallel execution
environment to exploit the power
of parallel processing
Can be generalized to evaluate and
design technical improvement in a
computer system
+
+
Little’s Law
Fundamental and simple relation with broad applications
Can be applied to almost any system that is statistically
in steady state, and in which there is no leakage
Queuing system
If server is idle an item is served immediately, otherwise an
arriving item joins a queue
There can be a single queue for a single server or for multiple
servers, or multiple queues with one being for each of
multiple servers
Average number of items in a queuing system equals
the average rate at which items arrive multiplied by the
time that an item spends in the system
Relationship requires very few assumptions
Because of its simplicity and generality it is extremely useful
Table 2.1 Performance Factors and System Attributes
Calculating the Mean
The three
The use of benchmarks to common
compare systems involves formulas
calculating the mean value of
a set of data points related to used for
execution time calculating
a mean are:
• Arithmetic
• Geometric
• Harmonic
An Arithmetic Mean (AM) is an
appropriate measure if the sum of all
the measurements is a meaningful
and interesting value Arithmeti
The AM is a good candidate for c
comparing the execution time
performance of several systems
For example, suppose we were interested in using a system
for large-scale simulation studies and wanted to evaluate
several alternative products. On each system we could run
the simulation multiple times with different input values for
Mean
each run, and then take the average execution time across
all runs. The use of
multiple runs with different inputs should ensure that the
results are not heavily biased by some unusual feature of a
given input set. The AM of all the runs is a good measure of
+ the system’s performance on simulations, and a good
number to use for system comparison.
The AM used for a time-based variable, such as
program execution time, has the important
property that it is directly proportional to the
total time
If the total time doubles, the mean value
doubles
Table 2.2
A Comparison
of Arithmetic
and
Harmonic
Means for
Rates
+
Benchmark Principles
Desirable
characteristics of a
benchmark program:
1. It is written in a high-level language,
making it portable across different
machines
2. It is representative of a particular kind of
programming domain or paradigm, such as
systems programming, numerical
programming, or commercial programming
3. It can be measured easily
4. It has wide distribution
+
System Performance Evaluation
Corporation (SPEC)
Benchmark suite
A collection of programs, defined in a high-level language
Together attempt to provide a representative test of a
computer in a particular application or system
programming area
SPEC
An industry consortium
Defines and maintains the best known collection of
benchmark suites aimed at evaluating computer systems
Performance measurements are widely used for comparison
and research purposes
+ Best known SPEC benchmark suite
Industry standard suite for
processor intensive applications
SPEC Appropriate for measuring
performance for applications that
spend most of their time doing
computation rather than I/O
CPU2006 Consists of 17 floating point
programs written in C, C++, and
Fortran and 12 integer programs
written in C and C++
Suite contains over 3 million lines of
code
Fifth generation of processor
intensive suites from SPEC
Table 2.5
SPEC
CPU2006
Integer
Benchmarks
(Table can be found on page 69 in the
textbook.)
Table 2.6
SPEC
CPU2006
Floating-
Point
Benchmarks
(Table can be found on page 70
in the textbook.)
+
Terms Used in SPEC
Documentation
Benchmark Peak metric
A program written in a high-level This enables users to attempt to
language that can be compiled optimize system performance by
and executed on any computer optimizing the compiler output
that implements the compiler Speed metric
System under test This is simply a measurement of the
time it takes to execute a compiled
This is the system to be evaluated
benchmark
Reference machine
Used for comparing the ability of
a computer to complete single
This is a system used by SPEC to tasks
establish a baseline performance
for all benchmarks Rate metric
Each benchmark is run and This is a measurement of how many
measured on this machine to tasks a computer can accomplish in
establish a reference time for a certain amount of time
that benchmark This is called a throughput,
capacity, or rate measure
Base metric Allows the system under test to
These are required for all reported execute simultaneous tasks to
results and have strict guidelines take advantage of multiple
for compilation processors
+ Summary Performance
Issues
Chapter 2
Designing for performance Basic measures of computer
Microprocessor speed performance
Performance balance
Clock speed
Improvements in chip
Instruction execution rate
organization and Calculating the mean
architecture
Arithmetic mean
Multicore
Harmonic mean
MICs
Geometric mean
GPGPUs
Amdahl’s Law Benchmark principles
Little’s Law SPEC benchmarks