21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
UNIT - IV
PARALLELISM
Parallel processing challenges – Flynn„s classification – SISD, MIMD, SIMD, SPMD -
Subword Parallelism and Vector Architectures – Hardware multithreading – Multi-core
processors and other Shared Memory Multiprocessors – Introduction to Graphics Processing
Units (GPU), Clusters, Warehouse Scale Computers and other Message-Passing
Multiprocessors – Case Study: Intel Core i7 versus NVIDIA Tesla GPU.
1. PARALLELISM
INTRODUCTION
o Pipelining exploits the parallelism among instructions. This parallelism is called
instruction-level parallelism (ILP).
o Instruction-level parallelism (ILP) is a measure of how many of
the instructions in a computer program can be executed simultaneously.
o There are two approaches to instruction level parallelism: Hardware and Software.
o Hardware level works upon dynamic parallelism, whereas the Software level
works on static parallelism.
o Dynamic parallelism means the processor decides at run time which instructions
to execute in parallel.
o Static parallelism means the compiler decides which instructions to execute in
parallel.
o There are two methods for increasing the amount of instruction-level parallelism.
1. First approach - to increase the depth of the pipeline to overlap more
instructions.
2. Second approach - to replicate the internal components of the computer so that
it can launch multiple instructions in every pipeline stage. This technique is
called as multiple issue.
o There are two major ways to implement a multiple-issue processor :
Static Multiple Issue and Dynamic Multiple Issue
Static Multiple Issue : Decision of work are being made statically
(at compile time before execution).
Dynamic Multiple Issue : Decision of work are being made dynamically
(at execution time).
Prepared by Dr.M.Malathy Professor/CSE
Page 1
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
SPECULATION
o Speculation is an approach that allows the compiler or the processor to ―guess‖
about the properties of an instruction, so as to enable execution to begin for
other instructions that may depend on the speculated instruction.
o We might speculate on the outcome of a branch, so that instructions after the
branch could be executed earlier.
o The difficulty with speculation is that it may be wrong. So, any speculation
mechanism must include both a method to check if the guess was right and a
method to unroll or back out the effects of the instructions that were executed
speculatively.
o Speculation introduces one problem - Speculating on certain instructions may
introduce exceptions that were formerly not present.
o Speculation may be done by the hardware (or) by the compiler.
1. HARDWARE BASED SPECULATION
o In hardware speculation, the instructions are executed and stored in buffers.
o If the speculation was correct, the executed instructions are written to the registers
or memory.
o If the speculation was incorrect, the hardware flushes the buffers and reexecutes
the correct instruction sequence.
2. COMPILER BASED SPECULATION
o In compiler-based speculation, special speculation supports are added while
executing the instructions to ignore any needless exceptions.
o Additional instructions are inserted to check the accuracy of the speculation.
o Provides a necessary routine to use when the speculation is incorrect.
2. INTRODUCTION TO PARALLEL PROCESSING
o Processing data concurrently is called as Parallel Processing.
o There are two ways of achieving parallelism :
1. Multiple Functional Units : System may have two or more ALU‟s so
that they can execute two or more instructions at the same time.
2. Multiple Processors : System may have two or more processors operating
concurrently.
Prepared by Dr.M.Malathy Professor/CSE
Page 2
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Fig : Processor with Multiple Functional Units
DEFINITIONS
o Job-level Parallelism (or) Process-level Parallelism (or) Task-level Parallelism
- Utilizing multiple processors for executing independent programs
simultaneously.
o Parallel Processing Program - It is referred to a single program that runs on
multiple processors simultaneously.
o Multicore Processor
An architectural design that contains multiple processors (“cores”) on a
single IC.
Also called as Chip Multiprocessors (CMP).
Processors are often called cores in a multicore chip. The number of cores
is expected to double every two years.
These multiprocessors are also called as Shared Memory Processors
(SMP), as they share a single physical address space.(Memory).
LIMITATIONS OF SINGLE CORE ARCHITECTURE
One way to increase the system performance is to increase the clock frequency or
processor speed. But this approach has following limitations:
1. Higher frequency requires more power.
2. More power consumption results in harder and more expensive to cool the
system.
3. More power consumption also affects the sizing and packaging
considerations.
Thus, instead of trying to make the processor faster to gain performance, adding more
processors on a single chip is a better approach.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
PARALLEL PROCESSING CHALLENGES
o The difficulty with parallelism is not the hardware.
o It is difficult to write software that uses multiple processors and the problem gets
worse as the number of processors increases.
o We must get better performance or better energy efficiency from a parallel
processing program.
o It is difficult to write parallel processing programs that are fast, as the number
of processors increases.
o In parallel programming, the other challenges include scheduling, load balancing,
time for synchronization and overhead for communication.
3. AMDAHL’S LAW
o Amdahl's law is used in parallel processing to predict the theoretical speedup
when using multiple processors.
o It is named after Gene Amdahl, a computer architect from IBM.
o It tells that even small parts of a program must be parallelized if the program is
to make good use of many cores.
o It is a formula which gives the theoretical speedup in latency of the execution of
a task at a fixed workload that can be expected of a system whose resources are
improved.
o It is a formula used to find the maximum improvement possible by just
improving a particular part of a system.
o It is often used in parallel computing to predict the theoretical speedup when
using multiple processors.
o Amdahl‟s law states that ―Unless virtually all of a serial program is
parallelized, the possible speedup is going to be very limited —regardless of
the number of cores available.‖
(P = Number of Processors)
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o Let Fe = Fraction of Sequential program that can be Parallelized (or)
Enhanced (or) Improved
o Let Pe = Speedup Enhanced (in terms of times faster) (or)
Number of concurrent processors
o Then according to Amdahl‟s law,
Speedup =
(NOTE – Instead of Pe , S can also be used)
SPEEDUP
Speedup is defined as the ratio of the execution time for the entire task without using
the enhancement and execution time for the entire task using the enhancement.
If
Pe is the performance for entire task using the enhancement
Pw is the performance for entire task without using the enhancement
Speedup = Pe/Pw
If
Ew is the execution time for entire task without using the enhancement
Ee is the execution time for entire task after using the enhancement
Speedup = Ew/Ee
ACHIEVING SPEED UP / TYPES OF SCALING
In order to achieve speedup in multiprocessor system, the following techniques are
followed:
o Strong scaling
o Weak scaling
1. Strong Scaling - Strong scaling means measuring speed-up while keeping the
problem size fixed.
2. Weak Scaling - Speed-up achieved on a multiprocessor while increasing the size
of the problem proportionally to the increase in the number of
processors.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
SPEEDUP CHALLENGES
1. Increase in Problem Size :
Getting good speed-up on a multiprocessor while keeping the problem size
fixed is harder than getting good speed-up by increasing the size of the problem.
2. Balancing Load :
There will be greater impact on speed-up if one processor‟s load is higher than
all the rest.
SOLVED PROBLEMS - AMDAHL’S LAW
1. What is the overall speedup achieved if we make 90% of a program run 10 times
faster?
2. What is the overall speedup achieved if we make 80% of a program run 20% faster?
3. If 90% of the computation can be parallelized, what is the maximum speedup
achievable
using 8 processors?
F = 0.9 S=8
4. We are considering an enhancement to the processor of a web server. The new
CPU is 20 times faster on search queries than the old processor. The old processor is
busy with search queries 70% of the time, what is the speedup gained by integrating
the enhanced CPU?
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
5. We are considering an enhancement to the processor of a server. The new CPU 10X
faster. I/O bound server, so 60% time waiting for I/O.
6. What percentage of original computation can be sequential to achieve a speed-up of
90 times faster with 100 processors?
Given :
SpeedUp = 90, Se = 100
To find
Fe = ?
We Know
Substituting the given values we have
Solving for Fe we get
Fe = 0.9988 = 99.88%
The Percentage of original computation that can be sequential = (100 - Fe)
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
SOLVED PROBLEMS ON PROTEIN STRING MATCHING CODE
1. Consider a Protein String Matching Code that runs for 200 hours on the current
machine, and spends 20% of time doing integer instructions.
(a) How much faster must you make the integer unit to make the code run 10 hours
faster?
(b) How much faster must you make the integer unit to make the code run 50 hours
faster?
Solution
(a) 10 hours faster (b) 50 hours faster
Negative Speedup not Possible.
Hence target is not achievable.
2. Consider a Protein String Matching Code that takes 4 days Execution Time on
current machine.20% of time doing integer instructions and 35% percent of time doing
I/O instructions.
Which is the better economic tradeoff?
a) Compiler optimization that reduces number of integer instructions by 25% .
(assume each integer inst takes the same amount of time)
b) Hardware optimization that makes I/O run 20% faster?
Solution
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
4. FLYNN’S CLASSIFICATION / TAXONOMY
M.J. Flynn proposed a classification for the organization of a computer system
by the number of instructions and data items that are manipulated
simultaneously.
The sequence of instructions read from memory constitutes an instruction
stream.
The operations performed on the data in the processor constitute a data stream.
It is based on the multiplicity of instruction and data streams observed by the
CPU during program execution.
o Flynn proposed the following categories of computer systems:
o Single Instruction, Single Data stream (SISD)
o Single Instruction, multiple Data stream (SIMD)
o Multiple Instructions, single Data stream (MISD)
o Multiple Instructions, multiple Data stream (MIMD)
1. SINGLE INSTRUCTION SINGLE DATA STREAM (SISD)
o An SISD computing system is a uniprocessor machine which is capable of
executing a single instruction, operating on a single data stream.
o In SISD, machine instructions are processed in a sequential manner and computers
adopting this model are popularly called sequential computers.
o Most conventional computers have SISD architecture. All the instructions and data
to be processed have to be stored in primary memory.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o There is some sort of control unit that provides an instruction stream to a
processing unit
o The processing unit operates on a single data stream from a data memory unit.
o E.g - IBM PC, IBM 704,VAX 11/7801,CRAY -1 ,Older mainframe computers
2. SINGLE INSTRUCTION MULTIPLE DATA STREAM (SIMD)
o An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different data streams.
o In SIMD, there is a single control unit, feeding a single instruction stream to
multiple Processing Units.
o Each Processing Unit may have its own dedicated memory unit.
o These machines are used in applications such as Digital signal processing, image
processing and multimedia applications (Audio and Video).
o Machines based on an SIMD model are well suited to scientific computing since
they involve lots of vector and matrix operations.
o Vector and array processors fall into this category.
o E.g – ILLIAC-IV,MPP,CM-2,STARAN
3. MULTIPLE INSTRUCTION SINGLE DATA STREAM (MISD)
o An MISD computing system is a multiprocessor machine capable of executing
different instructions but all of them operating on the same data stream.
o This is an uncommon architecture which is generally used for fault tolerance.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o Machines built using the MISD model are not useful in most of the application, a
few machines are built, but none of them are available commercially.
o This structure is not commercially implemented.
o E.g – Systolic Array Computers .
4. MULTIPLE INSTRUCTIONS MULTIPLE DATA STREAM (MIMD)
o MIMD machines are usually referred to as multiprocessors or multi computers.
o An MIMD system is capable of executing multiple instructions on multiple data
streams.
o Each processor must include its own control unit that will assign to the processors
parts of a task or a separate task.
o Machines built using this model are capable to any kind of application..
o The MIMD may be a distributed-memory multiprocessor or a shared-memory
multicomputer.
o E.g – CRAY-XMP, IBM 370/168 M
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
5. SINGLE PROGRAM, MULTIPLE DATA (SPMD)
o SPMD (single program, multiple data) is a technique employed to
achieve parallelism; it is a subcategory of MIMD.
o Tasks are split up and run simultaneously on multiple processors with different
input in order to obtain results faster.
o SPMD is the most common style of parallel programming.
SPMD vs SIMD
1. In SPMD, multiple autonomous processors simultaneously execute the same
program at independent points, rather than SIMD imposes on different data.
With SPMD, tasks can be executed on general purpose CPUs.
2. SIMD requires vector processors to manipulate data streams. Note that the
two are not mutually exclusive.
6. VECTOR PROCESSORS
o Vector processors are the technology used in modern computers and central
processing units.
o A vector processor is a central processing unit that can work on an entire vector
(array) in one instruction.
o The instruction to the processor is in the form of one complete vector instead of
its element.
o Vector processors are used because they reduce the draw and interpret
bandwidth owing to the fact that fewer instructions must be fetched.
o A vector processor is also known as an array processor.
o They exploit data parallelism in large scientific and multimedia applications.
5. HARDWARE MULTITHREADING
o An approach, which allows for a high degree of instruction-level parallelism
without increasing the complexity or power consumption, is called multithreading
o The instruction stream is divided into several smaller streams, known as threads,
such that the threads can be executed in parallel.
o Hardware multithreading is a well-known technique to increase the utilization of
processor resources. The idea is to start executing a different thread when the
current thread is stalled.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o Hardware Multithreading allows multiple threads to share the functional units
of a single processor in an overlapping fashion.
o To permit this sharing, the processor must duplicate the independent state of each
thread.
o There are three main approaches to hardware multithreading.
o Fine-Grained Multithreading
o Coarse-Grained Multithreading
o Simultaneous Multithreading
FINE-GRAINED MULTITHREADING
o Fine-grained multithreading switches between threads on
each instruction, resulting in interleaved execution of
multiple threads.
o Also called as Interleaving.
o This interleaving is often done in a round robin fashion,
skipping any threads that are stalled at that time.
o To make fine-grained multithreading practical, the processor
must be able to switch threads on every clock cycle.
o If there is a sufficient number of threads, it is likely that at
least one is active (not stalled), and the CPU can be kept
running.
o With fine-grained multithreading in a pipelined Architecture,
if:
– the pipeline has k stages,
– there are at least k threads to be executed, and
– the CPU can execute a thread switch at each clock cycle
then
-there can never be more than a single instruction per thread in the pipeline at
any instant, so there cannot be hazards due to dependencies, and the pipeline
never stalls.
o Advantage :
Potential to avoid wasted machine time due to stalls.
It can hide throughput losses that arise from both short and long stalls.
o Disadvantage :
A thread that is ready to execute a long sequence of instructions may have
wait to execute every instruction.
It slows down execution of individual threads.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
COARSE-GRAINED MULTITHREADING
o Coarse-grained multithreading was invented as an alternative
to fine-grained multithreading.
o Coarse-grained multithreading switches threads only on
stalls, waiting for a time-consuming operation to complete.
o Also called as Blocking.
o A switch is made to another thread. When this thread in turn
causes a stall, a third thread is scheduled and so on.
o Advantage :
Switching threads doesn‟t need to be nearly
instantaneous.
o Disadvantages :
The processor can be idled on shorter stalls, and thread
switching will also cause delays.
Coarse-grained multithreading suffers, from a major
throughput losses, especially from shorter stalls.
SIMULTANEOUS MULTITHREADING (SMT)
o Simultaneous multithreading (SMT) is a variation on hardware multithreading to
exploit instruction-level parallelism (ILP) and thread-level parallelism(TLP) at the
same time.
Simultaneous Multi-Threading (SMT) = ILP + TLP
o Simultaneous multithreading (SMT) is a technique for improving the overall
efficiency of superscalar CPUs with hardware multithreading.
o SMT permits multiple independent threads of execution to
better utilize the resources provided by modern processor
architectures.
o In SMT, instructions belonging to different threads are
(almost certainly) independent, and by issuing them
concurrently, CPU resources utilization raises.
o SMT is a multiple-issue processor that uses the resources of
a multiple-issue, dynamically scheduled processor.
o SMT is convenient since modern multiple-issue CPUs have
a number of functional units that cannot be kept busy with
instructions from a single thread.
o By applying register renaming and dynamic scheduling,
instructions belonging to different threads can be executed concurrently.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
(Register Renaming - There are an infinite number of virtual registers
available, and hence all WAW and WAR hazards are avoided and an
unbounded number of instructions can begin execution simultaneously.
Dynamic Parallelism - The processor decides at run time which instructions to
execute in parallel).
o SMT often have more functional unit parallelism available than a single thread can
effectively use.
o SMT is always executing instructions from multiple threads, leaving it up to the
hardware to associate instruction slots with their proper threads.
APPROACHES TO EXECUTE MULTIPLE THREADS
o The figure given below illustrates the possible architecture that involve
multithreading and contrasts these with approaches that do not use multithreading.
o Each horizontal row represents the issue slot or slots for a single execution cycle;
that is, the width of each row corresponds to the maximum number of instructions
that can be issued in a single clock cycle.
o The vertical dimension represents the time sequence of clock cycles.
o An empty slot represents an unused execution slot in one pipeline.
The various approaches to execute Multiple Threads are
1. Approaches with a scalar (Single Issue) processor
2. Approaches with a superscalar (Multiple Issue) processor
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Approaches with a Scalar Processor
o The different approaches with a scalar (i.e., single-issue) processor:
1. Single Threaded Scalar:
o This is the simple pipeline found in traditional RISC and CISC machines,
with no multithreading.
2. Interleaved Multithreaded Scalar:
o This is the easiest multithreading approach to implement.
o By switching from one thread to another at each clock cycle, the pipeline
stages can be kept fully occupied, or close to fully occupied.
o The hardware must be capable of switching from one thread context to
another between cycles.
3. Blocked Multithreaded Scalar:
o A single thread is executed until a latency event occurs that would stop the
pipeline, at which time the processor switches to another thread.
Approaches with a Superscalar Processor
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
1. Superscalar:
o This is the basic superscalar approach with no multithreading.
o It is the most powerful approach to providing parallelism within a
processor.
o During some cycles, not all of the available issue slots are used. During
these cycles, less than the maximum number of instructions is issued; this is
referred to as horizontal loss.
o During other instruction cycles, no issue slots are used; these are cycles
when no instructions can be issued; this is referred to as vertical loss.
2. Interleaved Multithreading Superscalar:
o During each cycle, as many instructions as possible are issued from a single
thread.
o With this technique, potential delays due to thread switches are eliminated.
o The number of instructions issued in any given cycle is still limited by
dependencies that exist within any given thread.
3. Blocked Multithreaded Superscalar:
o Instructions from only one thread may be issued during any cycle, and
blocked multithreading is used.
6. MULTICORE PROCESSORS
o A multicore computer, also known as a chip multiprocessor, combines
two or more processors (called cores) on a single IC.
o Each core consists of all of the components of an independent processor,
such as registers, ALU, pipeline hardware, and control unit, plus L1
instruction and data caches.
o In addition to the multiple cores, contemporary multicore chips also include
L2 cache and L3 cache.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Hardware Performance Issues Software Performance Issues
Increase in Parallelism Multi-threaded applications
Increase in Complexity Multi-process applications
Increase in Power Consumption Multi-instance applications
MULTICORE ORGANIZATION
The main variables in a multicore organization are as follows:
o The number of core processors on the chip
o The number of levels of cache memory
o The amount of cache memory that is shared
Each core has its own L1 and L2 cache.
L1 and L2 are the fastest memories that a CPU can access.
L1 is always dedicated per core whereas L2 can be shared.
If the processor can find the instruction sets or data for its next operation in the
L1 and L2 cache, then it does not need to access the slower L3 cache.
The four general organizations for multicore systems are :
Dedicated L1 Cache
Dedicated L2 Cache
Shared L2 Cache
Shared L3 Cache
DEDICATED L1 CACHE
o In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache.
o The L1 cache is divided into instruction and data
caches.
o L1 (Level 1) cache is the fastest memory that is present
in a computer system.
o L1 cache has the data the CPU is most likely to need
while completing a certain task.
o Its size typically varies between 256KB to 1MB.
o Example : ARM11 MPCore
DEDICATED L2 CACHE
o In this organization,there is no on-chip cache sharing..
o L2 cache is slower than L1 cache, but bigger in size.
o Its size typically varies between 256KB to 8MB.
o A potential advantage to having only dedicated L2
caches on the chip is that each core enjoys more rapid
access to its private L2 cache.
o L2 cache holds data that is likely to be accessed by the
CPU next.
o Example : AMD Opteron.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
SHARED L2 CACHE
o A similar allocation of chip space to memory, but with
the use of a shared L2 cache.
o As the amount of cache memory available on the chip
continues to grow, performance considerations dictate
splitting off a separate, shared L3 cache, with dedicated
L1 and L2 caches for each core processor.
o The use of a shared L2 cache confines the cache
coherency problem to the L1 cache level, which may
provide some additional performance advantage.
o Example : Intel Core Duo.
SHARED L3 CACHE
o L3 (Level 3) cache is the largest cache memory unit,
and also the slowest one.
o Its size typically varies between 4MB to 50MB.
o As both the amount of memory available and the
number of cores grow, the use of a shared L3 cache
combined with either a shared L2 cache or dedicated
per core L2 caches seems likely to provide better
performance than simply a massive shared L2 cache.
o Example : AMD K10.
7. SHARED MEMORY MULTIPROCESSORS
o A shared memory multiprocessor is a parallel
processor with a single address space across all
processors.
o All processes on the various CPUs share a unique
logical address space, which is mapped on a
physical memory that can be distributed among the
processors.
o Processors communicate through shared variables
in memory.
o Each process can read and write a data item simply using load and store operations
o These systems can still run independent jobs in their own virtual address spaces,
even if they all share a physical address space.
o This is an architectural model simple and easy to use for programming.
o It can be applied to a wide variety of problems that can be modeled as a set of
tasks, to be executed in parallel.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o The basic issue in shared memory multiprocessor systems is memory itself, since
the larger the number of processors involved, the more difficult to work on
memory efficiently.
TYPES OF SHARED MEMORY MULTIPROCESSORS
There are three types of Shared memory multiprocessors.
They are
o Uniform Memory Access multiprocessors (UMA)
o Non-Uniform Memory Access multiprocessors (NUMA)
o Cache Only Memory Access (COMA)
UNIFORM MEMORY ACCESS MULTIPROCESSOR (UMA)
o Multiprocessor in which accesses to main
memory take about the same amount of time
o No matter which processor requests the access
and no matter which word is asked.
o These systems are also called Symmetric
Shared-memory Multiprocessors.
NON-UNIFORM ACCESS MULTIPROCESSOR (NUMA)
o A type of single address space
multiprocessor in which some memory
accesses are much faster than others
depending on which processor asks for
which word.
o These systems are also called Distributed
Shared Memory architectures.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
CACHE ONLY MEMORY ACCESS (COMA)
• Data have no specific “permanent” location
The entire physical address space is considered a huge, single cache.
The data can be read into the local caches and/or modified and then updated at
their “permanent” location.
Data can migrate and/or can be replicated in the various memory banks of the
central main memory.
TYPES OF SHARED MEMORY ARCHITECTURES
There are two types of Shared memory Architectures. They are
1. Symmetric Shared-Memory Architecture
2. Distributed Shared Memory Architecture
SYMMETRIC SHARED-MEMORY ARCHITECTURE
o The Symmetric Shared Memory Architecture consists of several processors with a
single physical memory shared by all processors through a shared bus.
o A multicore processor has multiple CPUs or cores on a single chip.
o The cores have private level-1 caches, while other caches may or may not be
shared between the cores.
DISTRIBUTED SHARED MEMORY ARCHITECTURE
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o A distributed-memory system, often called a multicomputer, consists of multiple
independent processing nodes with local memory modules which are connected by
a general interconnection network.
o Here, physically separated memories can be addressed as one logically shared
address space.
o Here, the term "shared" does not mean that there is a single centralized memory,
but that the address space is "shared".
o Each node of a cluster has access to shared memory in addition to each node's non-
shared private memory.
o Distributed-memory systems are also called as clusters or hybrid systems.
8. GRAPHICS PROCESSING UNIT (GPU)
GPU is also called as Visual Processing Unit
A GPU is a graphics coprocessor or accelerator mounted on a computer‟s
graphics card or video card.
Traditional CPUs are structured with only a few cores whereas a modern GPU
chip is built with hundreds of processing cores.
GPUs are designed to handle large number of floating-point operations in
parallel.
GPUs are widely used in mobile phones, game consoles, embedded systems,
PCs, and servers.
GPUs do not rely on multilevel caches.
GPUs rely on hardware multithreading.
There are special graphics DRAM chips for GPUs that are wider and have
higher bandwidth.
GPU have smaller main memories than conventional microprocessors.
GPUs typically have 4 to 6 GB or less, while CPUs have 32 to 256 GB.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Difference Between CPU and GPU
GPU ARCHITECTURE
First GPU is GeForce 256 by NVIDIA in 1999.
These GPU chips can process a minimum of 10 million polygons per second.
The GPU consists of upto 128 cores on a single chip.
Each core can handle 8 threads of instructions and hence 1,024(8 * 128)
threads are executed concurrently on a single GPU.
A GPU is hardware device which contain multiple small hardware units called
Streaming Multiprocessors(SM).
Multiple SMs can be built on single GPU chip.
Each SM can execute many threads concurrently.
Each SM is associated with a private L1 Data Cache.
Every Memory Controller (MC) is associated with a shared L2 cache for
faster access to the cached data. Both MC and L2 are on-chip.
Each SM has 32 CUDA cores (Totally 16*32 =512 CUDA Cores)
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Each SM has 16 load/store units allowing source and destination address to be
calculated for 16 threads/clock.
There are four special function units (SFUs) for executing trigonometric and
logarithmic operations.
When operations is done with floating point numbers, it might be performed
with two rounding’s or with a single rounding.
When performed with a single rounding, it is called a fused multiply–add
(FMA) or fused multiply–accumulate (FMAC).
GPU PROGRAMMING MODEL - CUDA
GPU uses a programming model called CUDA (Compute Unified Device
Architecture)
CUDA is an extension of the C language.
It enables the programmer to write C programs to execute on GPUs.
It is used to control the device.
The programmer specifies CPU and GPU functions.
o The host code can be C++
o Device code may only be C
GPU MEMORY SYSTEM
The figure shows the memory structures of an NVIDIA GPU.
It has a Multi-Banked Memory Structure.
GPU memory systems are designed for data throughput with wide memory
buses
Much larger bandwidth than typical CPUs typically 6 to 8 times
The on-chip SMEM memory is local to each Streaming Multiprocessor.
The off - chip Global memory is shared by the whole GPU and all thread
blocks GPU Memory.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
9. CLUSTERS
o Cluster is a set of loosely or tightly connected computers working together as a
unified computing resource that can create the illusion of being one machine.
o Computer clusters have each node set to perform the same task, controlled and
produced by software.
o Clusters are constructed from whole computers and independent, scalable networks
and connected through a local area network.
o Clusters are generally collections of commodity computers that are connected to
each other by I/O interconnect via standard network switches and cables.
o Each computer(node) has private memory and OS.
o Cluster software is a layer that runs on top of the local operating systems
running on each computer, it is much easier to disconnect and replace a broken
computer.
o It easier to expand the system without bringing down the application that runs
on top of the cluster.
Prepared by Dr.M.Malathy Professor/CSE
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
o It is also easier to replace a computer without bringing down the system in a
cluster than in a shared memory multiprocessor.
o It is also easy for clusters to scale down gracefully when a server fails,
thereby improving dependability.
TYPES OF CLUSTERED SYSTEMS
There are primarily two types of clustered systems
1.Asymmetric clustering system
2.Symmetric clustering system
1. Asymmetric Clustering System
In this system, one of the nodes in the clustered system is in standby mode and
all the others run the required applications.
The standby node continuously monitors the server and if it fails, the standby
node takes its place.
2. Symmetric Clustering System
In symmetric clustering system, all the nodes run applications as well as
monitor each other.
This is more efficient than asymmetric system as it uses all the hardware and
doesn't keep a node merely as a standby.
ATTRIBUTES OF CLUSTERED SYSTEMS
The clustering systems that embody some major attributes are:
1. Load Balancing Clusters
In this type of clusters, the nodes in the system share the workload to provide a
better performance.
For example: A web based cluster may assign different web queries to different
nodes so that the system performance is optimized. S
Some clustered systems use a round robin mechanism to assign requests to
different nodes in the system.
2. High Availability Clusters
These clusters improve the availability of the clustered system.
They have extra nodes which are only used if some of the system components
fail.
So, high availability clusters remove single points of failure i.e. nodes whose
failure leads to the failure of the system.
These types of clusters are also known as failover clusters or HA clusters.
Prepared by Dr.M.Malathy Professor/CSE
Page 26
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
BENEFITS OF CLUSTERED SYSTEMS
The difference benefits of clustered systems are as follows:
Performance
Clustered systems result in high performance as they contain two or
more individual computer systems merged together. These work as a
parallel unit and result in much better performance for the system.
Fault Tolerance
Clustered systems are quite fault tolerant and the loss of one node does
not result in the loss of the system. They may even contain one or more
nodes in hot standby mode which allows them to take the place of failed
nodes.
Scalability
Clustered systems are quite scalable as it is easy to add a new node to
the system. There is no need to take the entire cluster down to add a new
node.
APPLICATIONS OF CLUSTERS
The search engines that hundreds of millions of us use everyday depend upon
this technology.
Amazon, Facebook, Google, Microsoft have multiple datacenters each with
clusters of tens of thousands of servers.
10. WAREHOUSE SCALE COMPUTERS (WSC)
o Large Clusters are called as Warehouse-Scale Computers.
o A Warehouse-Scale Computer (WSC) is a cluster comprised of tens of thousands
of servers.
o The WSC is a modern descendant of the 1970’s supercomputers.
o WSC‟s play a more important societal role today than supercomputers did in the
past.
o Although they may be classified as just large clusters, their architecture and
operation are more sophisticated.
o They act as one giant computer.
o Internet services necessitated the construction of new buildings to house, power,
and cool 100,000 servers.
o The cost on the order of $150M for the building, the electrical and cooling
infrastructure, the servers, and the networking equipment that connects and houses
50,000 to 100,000 servers.
o WSC often use a hierarchy of networks for interconnection.
Prepared by Dr.M.Malathy Professor/CSE
Page 27
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION
Applications of WSC
A WSC can be used to provide internet services.
1. Search - Google
2. Social Networking - Facebook
3. Video Sharing - YouTube
4. Online Sales - Amazon
5. Cloud Computing Services – Rackspace
Goals / Design factors for WSC
1. Cost-performance
2. Energy efficiency
3. Dependability via redundancy
4. Network I/O
5. Interactive and batch processing workloads
6. Ample computational parallelism is not important
7. Operational costs count
i. Power consumption is a primary, not secondary, constraint when
designing system
8. Scale and its opportunities and problems
i. Can afford to build customized systems since WSC require
volume purchase
Programming Model of WSC
o WSC uses MapReduce programming model.
o MapReduce runtime environment schedules map and reduce task to WSC nodes.
o MapReduce programs work in two phases:
1. Map Phase 2. Reduce Phase.
o An input to each phase is (key-value) pairs.
o In addition, two functions are to be specified -Map function and Reduce function.
o Map first applies a programmer-supplied function to each logical input record.
o Map runs on thousands of servers to produce an intermediate result of key-value
pairs.
o Reduce collects the output of those distributed tasks and collapses them using
another programmer-defined function.
Prepared by Dr.M.Malathy Professor/CSE
Page 28
CS8491 - COMPUTER ARCHITECTURE
11. MESSAGE PASSING MULTIPROCESSORS (MPP)
o Message passing is an inherent element of all computer clusters.
o Message passing systems provide alternative methods for communication and
movement of data among multiprocessors.
o A message passing system typically combines local memory and processor at each
node of the interconnection network.
o Message passing is a way for communicating between multiple processors by
explicitly sending and receiving messages.
o The system has functions to send and receive messages.
Prepared by Dr.M.Malathy Professor/CSE
CS8491 - COMPUTER ARCHITECTURE
o Coordination is built-in with message passing, since one processor knows when a
message is sent, and the receiving processor knows when a message arrives.
o If the sender needs confirmation that the message has arrived, the receiving
processor can then send an acknowledgment message back to the sender.
o Computers that rely on message passing for communication rather than cache
coherent shared memory are much easier for hardware designers to build.
o Message passing can be synchronous or asynchronous.
o Synchronous message passing systems require the sender and receiver to wait for
each other while transferring the message.
o In asynchronous communication the sender and receiver do not wait for each other
and can carry on their own computations while transfer of messages is being done.
o Two typical approaches of Message Passing:
1. PVM - Parallel Virtual Machine
2. MPI - Message Passing Interface
Advantages of Message Passing Model
Easier to build than scalable shared memory machines
Easy to scale (but topology is important)
Coherency and synchronization is the responsibility of the user, so the
system designer need not worry about them.
Disadvantage of Message Passing Model
Large overhead: copying of buffers requires large data transfers
Programming is more difficult.
Blocking nature of SEND/RECEIVE can cause increased latency and
deadlock issues.
Prepared by Dr.M.Malathy Professor/CSE
Page 30