Advance Computer Organization
By Kai Hwang.
Computer Generations
1. Computer evolved two stage of development,
2. Prior to 1945 they were Mechanical & Electromechanical
3. Earliest mechanical Computer in form of ABACUS from China ,500 BC
4.Blaise Pascal built a mechanical adder subtracter 1962.
5. Charles Babbage “Difference Engine” in England 1827.
6.Konard Zuse 1941, Germany mechanical Computer
7. Harward Mark1 by IBM 1944, Electromechanical decimal Computer
Computer Generations
1. Computer evolved two stage of development,
2. Prior to 1945 they were Mechanical & Electromechanical
3. Earliest mechanical Computer in form of ABACUS from China ,500 BC
4.Blaise Pascal built a mechanical adder subtracter 1962.
5. Charles Babbage “Difference Engine” in England 1827.
6.Konard Zuse 1941, Germany mechanical Computer
7. Harward Mark1 by IBM 1944, Electromechanical decimal Computer
Computer Generations
Progress in Hardware
1944-1954 Vaccum tubes
1955-1964 Transistors
1965-1974 IC's (SSI & MSI)
1975-1991 Large Scale or Very Large Scale
1991- till present- very high density & high
computing power
Elements of Modern Computers
Computing problems
Recognized that concept of computer architecture
is no longer limited to hardware
A modern computer consists of h/w, kernal, system
programs,OS, application programs, Networking
Interfaces etc.
Use of computers in real life problems
Sciencetific Applications demands floating point
strong computer.
Elements of a Computer System
Elements of Modern Computers
Computing problems
Hardware Resources
Operating System
System Software support
Compiler Support
Evolution of Computer Archietecture
Evolution of Computer Archietecture
Flynn's Classification
Development Layers
A layered development of parallel computers is
proposed by Lionel Ni (1990)
Hardware configurations differ from machine to
machine even from same model
Address space varies in different architectures
These features are up to the designer and
depends upon the type of application being
developed.
Six layers computer System
development
System Attributes to Performance
The ideal performance of a computer demands
a perfect match between machine capability
and program behavior.
Machine capability can be enhanced with
better hardware technology, innovative
architectural features and efficient resource
management.
Program behavior is difficult to predict due to
its heavy dependence on application and run
time conditions
System Attributes to Performance
The other factors which affect program
behavior includes Data Structures, Algorithm
design, language efficiency, programmers
capability, Compiler Technology etc.
It is impossible to achieve a perfect match
between hardware and software by merelt
improving few factors and without touching
other factors.
System Attributes to Performance
Some fundamental factors which affect
performance
Clock rate and CPI
Performance factors
System Attributes
MIPS Rate
Throughput Rate
Clock Rate and CPI
Performance Factors
Performance Factors
System Attributes
MIPS Rate
Performance Factors
Performance Factors
MIPS Rate of a given computer is directly
proportional to the Clock Rate and inversely
proportional to the CPI.
All four system attributes, Instruction Set,
Compiler, Processor Technology and Memory
Technology affect the MIPS rate which varies
from Program to Program.
Throughput Rate
Through Put is how many programs a system
can execute per unit time. The unit of
Throughput is programs per second.
Multiprocessors and Multicomputers
Two categories of parallel computers are
architecturally mentioned.
These physical models are physically distinguished
by having a shared common memory or unshared
distributed memory.
Shared Memory Multiprocessors:
1. The Uniform memory Access (UMA)model
2. The non Uniform memory Access (NUMA) model
3. The Cache Only Memory Architecture (COMA)
model
The Uniform memory Access (UMA)model
In a UMA multiprocessor model, the physical
memory is uniformly shared by all the
processors.
All processors have equal access time to all
memory words. This is why it is called Uniform
Memory Access.
Each processor may use a private cache. The
UMA model is suitable for general purpose time
sharing applications by multiple users.
The Uniform memory Access (UMA)model
Most Computer manufacturers have multiprocessor
extensions to their Uniprocessor product line.
To coordinate parallel events, synchronization and
communication among processors are done using
shared variables in the common memory.
When all processors have equal access to all
peripherals, the system is called symmetric
multiprocessor. In this case all processors are equally
capable of running the executive programs, such as
OS kernels and I/O service routines.
The Uniform memory Access (UMA)model
The NUMA Model
A NUMA multiprocessor is a shared memory
system in which the access time varies with the
location of the memory word.
The NUMA Model
The NUMA Model
The shared memory is physically distributed to
all processors, called local memories.
The collection of all local memories forms a
global address space accessible by all
processors.
It is faster to access a local memory with a local
processor.
The access to remote memory attached to other
processors takes longer due to the added
delay through the interconnection network.
The NUMA Model
In the hierarchical structured multiprocessor, the
processors are divided into clusters. Each
cluster can be a UMA or a NUMA model.
Clusters are connected to global shared memory
modules. The entire system is considered as
NUMA multiprocessor.
All processors belonging to same cluster are
allowed to uniformly access the clustered
shared memory modules.
The COMA model
Cache only memory architecture (COMA) is a
computer memory organization for use in
multiprocessors in which the local memories
(typically DRAM) at each node are used as
cache.
The COMA is a special case of NUMA machines
where distributed main memories are
converted into caches.
The COMA model
In NUMA, each address in the global address
space is typically assigned a fixed home node.
When processors access some data, a copy is
made in their local cache, but space remains
allocated in the home node. Instead, with
COMA, there is no home. An access from a
remote node may cause that data to migrate.
Compared to NUMA, this reduces the number
of redundant copies and may allow more
efficient use of the memory resources.
The COMA model
The COMA model
Tt raises problems of how to find a particular
data (there is no longer a home node) and
what to do if a local memory fills up.
Hardware memory coherence mechanisms are
typically used to implement the migration.
Distributed Memory Multicomputers
The system consists of multiple computers, often
called nodes, interconnected by a message
passing network.
Each node is an autonomous computer
consisting of processor, local memory and
sometimes attached disks or I/O peripherals.
The message passing network provides point to
point static connections among nodes. All local
memories are private and are accessible only
by local processors.
Distributed Memory Multicomputers
Vector Processor
A vector processor, or array processor, is a
central processing unit (CPU) that implements
an instruction set containing instructions that
operate on one-dimensional arrays of data
called vectors. This is in contrast to a scalar
processor, whose instructions operate on
single data items.
Vector Processor
Vector processors can greatly improve performance
on certain workloads, notably numerical
simulation and similar tasks.
Vector machines appeared in the early 1970s and
dominated supercomputer design through the
1970s into the 90s, notably the various Cray
platforms.
The rapid fall in the price-to-performance ratio of
conventional microprocessor designs led to the
vector supercomputer's demise in the later 1990s.
Vector Processor
Today, most commodity CPUs implement
architectures that feature instructions for a form
of vector processing on multiple (vectorized)
data sets, typically known as SIMD (Single
Instruction, Multiple Data). Common examples
include VIS, MMX, SSE, AltiVec and AVX.
Vector Processor
Vector processing techniques are also found in
video game console hardware and graphics
accelerators.
In 2000, IBM, Toshiba and Sony collaborated to
create the Cell processor, consisting of one
scalar processor and eight vector processors,
which found use in the Sony PlayStation 3
among other applications.
Vector Processor
Other CPU designs may include some multiple
instructions for vector processing on multiple
(vectorised) data sets, typically known as
MIMD (Multiple Instruction, Multiple Data) and
realized with VLIW.
Such designs are usually dedicated to a
particular application and not commonly
marketed for general purpose computing.
In the Fujitsu FR-V VLIW/vector processor both
technologies are combined.
Vector Processor-Cray J40 module
having four scalor/vector processors
Multivector and SIMD Computers
Super computers are classifieds either as
pipelined vector machines using a few powerful
processors equipped with vector hardware,
Or as SIMD computers emphasizing massive
data parallelism.
Vector Supercomputers
A vector supercomputer is often built on top of a
scalar processor as an optional feature.
Program and data are first loaded into the main
memory through a host computer.
All instructions are first decoded by scalar control
unit. If the decoded instruction is a scalar
operation or a program control operation it will
be directly executed by the scalar processor
using scalar functional pipelines.
Vector Supercomputers
If the Instruction is decoded as a vector
operation, it will be sent to the vector control
unit. This control unit will supervise the flow of
vector data between the main memory and
vector functional pipelines.
The vector data flow is coordinated by the control
unit. A number of vector functional pipelines
may be built into a vector processor.
Two pipe line Vector Supercomputer
model
NEC's Vector Super computer
NEC provides vector supercomputer with the
world's fastest core performance.
NEC Corporation has announced the worldwide
availability of the SX-ACE supercomputer, the
latest model in NEC's SX Series of vector
supercomputers, featuring the fastest
processor core performance in the world.
NEC's Vector Super computer
NEC receives order for SX-ACE vector supercomputer from
Tohoku University, Osaka University and the National Institute
for Environmental Studies. May 2014
NEC's Vector Super computer
The SX-ACE is the first SX Series
supercomputer equipped with a multi-core
vector CPU, which enables the world’s fastest
core performance and the world’s highest
memory bandwidth per core. Its performance
per rack has improved ten times over existing
SX-9 models, and it offers high sustained
performance and ease of use in scientific
computing applications.
Program and Network Properties
Levels of Parallelism
Computational Granularity
Time and Space Complexities
Communication Latencies
Scheduling Policies
Load Balancing
Data Dependencies
The ability to execute several programs
segments in parallel requires each segment to
be independent of the other segments.
In general, each code segment may contain one
or more statements.
Data Dependencies
Five types of data dependence are defined:
Flow Dependence
Anti Dependence
Output Dependence
I/O Dependence
Unknown Dependence
Data Dependencies
Data Dependencies
Data Dependencies
Data Dependencies
Consider above program.
S2 is flow dependent on S1 because A is
passed through R1.
S3 is anti dependent on S2 because of
potential conflicts in register content in R1.
S3 is output dependent on S1 because they
both modify the same register.
Control Dependencies
This refers to the situation where the order of
execution of statements cannot be determined
before run time.
For eg. Conditional If statements will not be
resolved until run time. Different paths taken
after a conditional branch may introduce or
eliminate data dependencies among
Instructions.
Resource Dependencies
This is different from data or control
dependence, which demands independence of
work to be done.
Resource dependence is concerned with the
conflicts in using shared resources, such as
integer Units, floating-point Units, Registers,
and memory areas, among parallel events.
When the conflicting resource is ALU we call it
ALU dependence.
Bernstein's Conditions
Bernstein gave a set of conditions based on
which two processes can execute in parallel.
Defining the input set Ii of a process Pi as the set
of input variables needed to execute the
process. (Input variables are operands which
can be fetched from memory or registers and
output variables may be stored in working
registered and output variables)
Bernstein's Conditions
Considering two process P1 and P2 with their
input sets I1 and I2 and the output sets O1 and
O2 respectively. The two processes can
execute in parallel and are denoted P1 || P2 if
they are independent and do not create
confusing results.
These conditions are as follows.
Bernstein's Conditions
I1 ∩ O2 = Φ
I2 ∩ O1 = Φ
O1 ∩ O2 = Φ
These three conditions are known as Bernstein's
Conditions.
In terms of data dependence Bernstein's
Conditions simply imply that two processes can
execute in parallel if they are flow in-
dependent, anti-independent and output-
independent.
Detection of parallelism using Bernstein's Conditions
We want to detect the parallelism embedded in
the following five instructions, P1,P2,P3,P4 and
P5 in the program order.
Detection of parallelism using Bernstein's Conditions
In sequential execution, five steps are needed.
Detection of parallelism using Bernstein's Conditions
Sequential Execution.
Detection of parallelism using Bernstein's Conditions
If two adders are available simultaneously, the
parallel execution requires only three steps.
Detection of parallelism using Bernstein's Conditions
Pair wise there are 10 pair of statements to
check against Bernstein's Conditions. Only five
pairs,P1 || P5 ,P2 || P3 ,P2 || P5 ,P5 || P3 ,and P4 ||
P5 , can execute in parallel as seen in
dependence graph.
Only, P2|| P3|| P5 is possible because P2|| P3,
P3|| P5 and P5|| P2 are all possible.
Detection of parallelism using Bernstein's Conditions
In general the paralleleism relation || is
commutative; that is, Pi|| Pj implies Pj|| Pi .But
the relation is not transitive; that is Pi|| Pj and
Pj|| Pk do not necessarily guarantee Pi|| Pk .
For example we have P1 || P5 and P5 || P2 , but
,P1 not parallel P2, which means P1 and P2 can
not execute in parallel.
In other words the order in which P1 and P2 are
executed will make a difference in the
computational results.
Program Partitioning and Scheduling
We Learn:
Computational Granularity or level of
parallelism in programs
Communication latency
Scheduling Issues
Program Partitioning and Scheduling
Grain Size or Granualarity:-
is a measure of Computation involved in a
software process.
Simplest measure is to count the number if
instructions in a grain (program segment).
Grain sizes are commonly described as
fine,medium, or coarse, depending upon the
processing levels involved.
Program Partitioning and Scheduling
Latency: is a time measure of the
communication overhead incurred between
machine subsystems.
E.g: the memory latency is the time required by a
processor to access memory.
The time required for two processes to
syncronize with each other is called
syncronisation latency.
Program Partitioning and Scheduling
Parallelism has been exploited at various
processing levels.
Program Partitioning and Scheduling
In general, the execution of a program may
involve a combination of these levels.
The actual combination depends on the
application, formulation, algorithm,
language,program, compilation support, and
hardware limitations.
Instruction Level: At this level, a typical grain
contains less than 20 Instructions, called fine
grain.
Program Partitioning and Scheduling
Loop Level: This corresponds to iterative loop
operations. Vector processing is mostly
exploited at the loop level by vectorizing
compiler. Considered fine Grain of
Computation.
Procedure Level: This corresponds to medium
size grain at the task, procedure, sub routine,
and coroutine level. A typical grain contains
less than 2000 instructions.
Program Partitioning and Scheduling
Subprogram Level: This corresponds to the
level of job steps and related subprograms.
The grain size may contain thousands of
Instructions. Multiprogramming on a
uniprocessor or on a multiprocessor is
conducted at this level. Medium Grain or
coarse-Grain.
Job(Program) Level: This corresponds to the
parallel execution of essentially independent
jobs (programs) on a parallel computer. The
grain size can be as high as tens thousands of
Instructions in a single program.
Grain Packing and Scheduling
Q. How can we partition a program into parallel
branches, program modules,microtasks, or
grains to yield the shortest possible execution
time?
Q. What is the optimal size of concurrent grains
in a computation?
Grain Packing and Scheduling
A Grain packing approach:
Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q
Grain Packing and Scheduling
The grain size is measured by the number of
basic processor and memory cycles needed to
execute all operations with in node.
Pair (n,s), n is the node name, s is the grain size
of the node.
Fine grains have smaller grain size and coarse
grain nodes have a larger grain size.
Edge label (v,d), specifies the out put variable v
from the source node or the input variable to
the destination node, and the communication
delay d. This includes all the path delays and
Grain Packing and Scheduling
Grain Packing and Scheduling: Fine
Grain
Grain Packing and Scheduling:
Course Grain
Grain Packing and Scheduling
The idea of grain packing is to apply fine grain
first to achieve a higher degree of parallelism.
Then Combine multiple fine grains into Coarse
grain node, if it can eliminate unnecessary
delays or reduce the over all scheduling
overhead.
Internal delays among fine grain operations with
in the same course grain node are negligible
because the communication delay is
contributed mainly by interprocessor delays
rather than the delays with in the same
processor.
Static multiprocessor scheduling
Grain Packing not always produce shorter
schedule.
Node Duplication:
In oder to eliminate the idle time and to further
reduce the communication delays among
processors, one can duplicate some of the
nodes in more than one processor.
Static multiprocessor scheduling
First Schedule without duplicating any of the
five nodes.(contains idle time as long
interprocessor time as 8 units)
Second Schedule has node A is duplicated as
A' and assigned to P2, similarly node C' is
copied into P1 besides original node C in P2.
Static multiprocessor scheduling
Static multiprocessor scheduling
Static multiprocessor scheduling
The schedule contains idle time as long
interprocessor delay (8 Units) between P1 and
P2.
In next schedule node A is duplicated into A' and
assigned to P2 besides retaining the original
copy A in node P1.
Similarly, a duplicated node C' is copied into P1
besides the original node in P2.
Static multiprocessor scheduling
The new schedule is almost 50% shorter. The
reduction is schedule time is caused by
elimination of (a,8) and (c,8) delays between
the two processors.
The purpose of Multiprocessor Scheduling is to
obtain a minimal time schedule for the
computations involved.
Ex: How two 2x2 matrices are multiplied to
compute the sum of four elements, resulting in
C=A X B.
Parallel decomposition for static
multiprocessor schedule
Parallel decomposition for static
multiprocessor schedule
Parallel decomposition for static
multiprocessor schedule
Parallel decomposition for static
multiprocessor schedule
The Eight computations are performed in eight nodes, each of
which has a grain size 101 CPU Cycles.
The interprocessor communication latency along all edges in the
the fine graph is eliminated as d=212 cycles.
Parallel decomposition for static
multiprocessor schedule
Parallel decomposition for static
multiprocessor schedule
Parallel decomposition for static
multiprocessor schedule
The Fine Grain on an Eight Processor (P1 to P8) system.
Speed Up= 864/741=1.16
Parallel decomposition for static
multiprocessor schedule
Using Grain Packing , top two levels into four Course grain
Nodes as V, W, X and Y.
Remaining three nodes (V,W,X and Z) form the fifth node as Z
Parallel decomposition for static
multiprocessor schedule
Using Grain Packing , top two levels into four Course grain
Nodes as V, W, X and Y.
Remaining three nodes (V,W,X and Z) form the fifth node as Z.
Speed UP : 864/446=1.94
Program Flow Mechanisms
Conventional Computers are based on control
flow ,mechanism as stated by Instruction
execution by the user program.
Data Flow computers emphasize the execution
of any instruction based by data (operand)
availability.
Data Flow Computers emphasize higher degree
of parallelism at the fine grain instructional
level.
Program Flow Mechanisms
Control Flow Computer Data Flow Computer
Program Counter Sequences the Uses shared memory to hold program
Program and Instruction
Control Driven Data availability driven
Contro Flow can be made parallel by Execution is driven by data availability
using parallel language constructs
Computational results (data tokens) are
passed directly between instructions
The data generated is duplicated into
many copies and forwarded directly to all
needy instructions
Data tokens once consumed by an
instruction, will no longer be available for
reuse
Requires no Program Counter, no shared
memory, no control sequencer
A data flow Architecture
Arvind at MIT developed tagged token
architecture.
A data flow Architecture
A data flow Architecture
The inter PE communications are done
through pipelined routing network
With in PE, the machine provides a low level
token matching mechanism which dispatches
only those instructions whose input data tokens
are already available
Each datum is tagged with the address of the
Instructions to which it belongs
Instruction are stored in program memory.
Tagged tokens enter the PE through a local
path.
Data flow Graph
A data flow graph is similar to dependence
graph, with a difference that data tokens are
passed around edges in a data flow graph.
Add: 1
Multiply: 2
Divide: 3 cycles to complete
Sequential execution on a uniprocessor
requires 48 cycles
Data flow Graph
Data driven execution on 4 processor requires
24 cycles.