0% found this document useful (0 votes)

115 views30 pages

Cao - Unit 4 - Notes - Final

The document discusses parallelism in computer architecture, covering concepts such as instruction-level parallelism (ILP), multiple-issue processors, and speculation techniques. It introduces parallel processing, its challenges, and Amdahl's Law, which predicts speedup in parallel computing. Additionally, it categorizes computer systems based on Flynn's classification and explores hardware multithreading and vector architectures.

Uploaded by

iamslayer09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views30 pages

Cao - Unit 4 - Notes - Final

Uploaded by

iamslayer09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

UNIT - IV

PARALLELISM

Parallel processing challenges – Flynn„s classification – SISD, MIMD, SIMD, SPMD -

Subword Parallelism and Vector Architectures – Hardware multithreading – Multi-core
processors and other Shared Memory Multiprocessors – Introduction to Graphics Processing
Units (GPU), Clusters, Warehouse Scale Computers and other Message-Passing
Multiprocessors – Case Study: Intel Core i7 versus NVIDIA Tesla GPU.
1. PARALLELISM

INTRODUCTION

o Pipelining exploits the parallelism among instructions. This parallelism is called

instruction-level parallelism (ILP).
o Instruction-level parallelism (ILP) is a measure of how many of
the instructions in a computer program can be executed simultaneously.
o There are two approaches to instruction level parallelism: Hardware and Software.
o Hardware level works upon dynamic parallelism, whereas the Software level
works on static parallelism.
o Dynamic parallelism means the processor decides at run time which instructions
to execute in parallel.
o Static parallelism means the compiler decides which instructions to execute in
parallel.
o There are two methods for increasing the amount of instruction-level parallelism.
1. First approach - to increase the depth of the pipeline to overlap more
instructions.
2. Second approach - to replicate the internal components of the computer so that
it can launch multiple instructions in every pipeline stage. This technique is
called as multiple issue.
o There are two major ways to implement a multiple-issue processor :
Static Multiple Issue and Dynamic Multiple Issue

Static Multiple Issue : Decision of work are being made statically

(at compile time before execution).

Dynamic Multiple Issue : Decision of work are being made dynamically

(at execution time).

Prepared by Dr.M.Malathy Professor/CSE

Page 1
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

SPECULATION

o Speculation is an approach that allows the compiler or the processor to ―guess‖

about the properties of an instruction, so as to enable execution to begin for
other instructions that may depend on the speculated instruction.

o We might speculate on the outcome of a branch, so that instructions after the

branch could be executed earlier.

o The difficulty with speculation is that it may be wrong. So, any speculation
mechanism must include both a method to check if the guess was right and a
method to unroll or back out the effects of the instructions that were executed
speculatively.

o Speculation introduces one problem - Speculating on certain instructions may

introduce exceptions that were formerly not present.

o Speculation may be done by the hardware (or) by the compiler.

1. HARDWARE BASED SPECULATION
o In hardware speculation, the instructions are executed and stored in buffers.
o If the speculation was correct, the executed instructions are written to the registers
or memory.
o If the speculation was incorrect, the hardware flushes the buffers and reexecutes
the correct instruction sequence.

2. COMPILER BASED SPECULATION

o In compiler-based speculation, special speculation supports are added while
executing the instructions to ignore any needless exceptions.
o Additional instructions are inserted to check the accuracy of the speculation.
o Provides a necessary routine to use when the speculation is incorrect.

2. INTRODUCTION TO PARALLEL PROCESSING

o Processing data concurrently is called as Parallel Processing.

o There are two ways of achieving parallelism :
1. Multiple Functional Units : System may have two or more ALU‟s so
that they can execute two or more instructions at the same time.
2. Multiple Processors : System may have two or more processors operating
concurrently.

Prepared by Dr.M.Malathy Professor/CSE

Page 2
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

Fig : Processor with Multiple Functional Units

DEFINITIONS

o Job-level Parallelism (or) Process-level Parallelism (or) Task-level Parallelism

- Utilizing multiple processors for executing independent programs
simultaneously.
o Parallel Processing Program - It is referred to a single program that runs on
multiple processors simultaneously.
o Multicore Processor
 An architectural design that contains multiple processors (“cores”) on a
single IC.
 Also called as Chip Multiprocessors (CMP).
 Processors are often called cores in a multicore chip. The number of cores
is expected to double every two years.
 These multiprocessors are also called as Shared Memory Processors
(SMP), as they share a single physical address space.(Memory).

LIMITATIONS OF SINGLE CORE ARCHITECTURE

One way to increase the system performance is to increase the clock frequency or
processor speed. But this approach has following limitations:
1. Higher frequency requires more power.
2. More power consumption results in harder and more expensive to cool the
system.
3. More power consumption also affects the sizing and packaging
considerations.
Thus, instead of trying to make the processor faster to gain performance, adding more
processors on a single chip is a better approach.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

PARALLEL PROCESSING CHALLENGES

o The difficulty with parallelism is not the hardware.
o It is difficult to write software that uses multiple processors and the problem gets
worse as the number of processors increases.
o We must get better performance or better energy efficiency from a parallel
processing program.
o It is difficult to write parallel processing programs that are fast, as the number
of processors increases.
o In parallel programming, the other challenges include scheduling, load balancing,
time for synchronization and overhead for communication.

3. AMDAHL’S LAW

o Amdahl's law is used in parallel processing to predict the theoretical speedup

when using multiple processors.
o It is named after Gene Amdahl, a computer architect from IBM.
o It tells that even small parts of a program must be parallelized if the program is
to make good use of many cores.
o It is a formula which gives the theoretical speedup in latency of the execution of
a task at a fixed workload that can be expected of a system whose resources are
improved.
o It is a formula used to find the maximum improvement possible by just
improving a particular part of a system.
o It is often used in parallel computing to predict the theoretical speedup when
using multiple processors.
o Amdahl‟s law states that ―Unless virtually all of a serial program is
parallelized, the possible speedup is going to be very limited —regardless of
the number of cores available.‖

(P = Number of Processors)

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o Let Fe = Fraction of Sequential program that can be Parallelized (or)

Enhanced (or) Improved
o Let Pe = Speedup Enhanced (in terms of times faster) (or)
Number of concurrent processors
o Then according to Amdahl‟s law,

Speedup =

(NOTE – Instead of Pe , S can also be used)

SPEEDUP
Speedup is defined as the ratio of the execution time for the entire task without using
the enhancement and execution time for the entire task using the enhancement.
If
Pe is the performance for entire task using the enhancement
Pw is the performance for entire task without using the enhancement
Speedup = Pe/Pw

If
Ew is the execution time for entire task without using the enhancement
Ee is the execution time for entire task after using the enhancement
Speedup = Ew/Ee

ACHIEVING SPEED UP / TYPES OF SCALING

In order to achieve speedup in multiprocessor system, the following techniques are
followed:
o Strong scaling
o Weak scaling

1. Strong Scaling - Strong scaling means measuring speed-up while keeping the
problem size fixed.
2. Weak Scaling - Speed-up achieved on a multiprocessor while increasing the size
of the problem proportionally to the increase in the number of
processors.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

SPEEDUP CHALLENGES

1. Increase in Problem Size :

Getting good speed-up on a multiprocessor while keeping the problem size
fixed is harder than getting good speed-up by increasing the size of the problem.

2. Balancing Load :
There will be greater impact on speed-up if one processor‟s load is higher than
all the rest.

SOLVED PROBLEMS - AMDAHL’S LAW

1. What is the overall speedup achieved if we make 90% of a program run 10 times
faster?

2. What is the overall speedup achieved if we make 80% of a program run 20% faster?

3. If 90% of the computation can be parallelized, what is the maximum speedup

achievable
using 8 processors?
F = 0.9 S=8

4. We are considering an enhancement to the processor of a web server. The new

CPU is 20 times faster on search queries than the old processor. The old processor is
busy with search queries 70% of the time, what is the speedup gained by integrating
the enhanced CPU?

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

5. We are considering an enhancement to the processor of a server. The new CPU 10X
faster. I/O bound server, so 60% time waiting for I/O.

6. What percentage of original computation can be sequential to achieve a speed-up of

90 times faster with 100 processors?

Given :
SpeedUp = 90, Se = 100

To find
Fe = ?

We Know

Substituting the given values we have

Solving for Fe we get

Fe = 0.9988 = 99.88%

The Percentage of original computation that can be sequential = (100 - Fe)

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

SOLVED PROBLEMS ON PROTEIN STRING MATCHING CODE

1. Consider a Protein String Matching Code that runs for 200 hours on the current
machine, and spends 20% of time doing integer instructions.
(a) How much faster must you make the integer unit to make the code run 10 hours
faster?
(b) How much faster must you make the integer unit to make the code run 50 hours
faster?
Solution
(a) 10 hours faster (b) 50 hours faster

Negative Speedup not Possible.

Hence target is not achievable.

2. Consider a Protein String Matching Code that takes 4 days Execution Time on
current machine.20% of time doing integer instructions and 35% percent of time doing
I/O instructions.
Which is the better economic tradeoff?
a) Compiler optimization that reduces number of integer instructions by 25% .
(assume each integer inst takes the same amount of time)
b) Hardware optimization that makes I/O run 20% faster?
Solution

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

4. FLYNN’S CLASSIFICATION / TAXONOMY

 M.J. Flynn proposed a classification for the organization of a computer system

by the number of instructions and data items that are manipulated
simultaneously.
 The sequence of instructions read from memory constitutes an instruction
stream.
 The operations performed on the data in the processor constitute a data stream.
 It is based on the multiplicity of instruction and data streams observed by the
CPU during program execution.

o Flynn proposed the following categories of computer systems:

o Single Instruction, Single Data stream (SISD)
o Single Instruction, multiple Data stream (SIMD)
o Multiple Instructions, single Data stream (MISD)
o Multiple Instructions, multiple Data stream (MIMD)

1. SINGLE INSTRUCTION SINGLE DATA STREAM (SISD)

o An SISD computing system is a uniprocessor machine which is capable of
executing a single instruction, operating on a single data stream.
o In SISD, machine instructions are processed in a sequential manner and computers
adopting this model are popularly called sequential computers.
o Most conventional computers have SISD architecture. All the instructions and data
to be processed have to be stored in primary memory.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o There is some sort of control unit that provides an instruction stream to a

processing unit
o The processing unit operates on a single data stream from a data memory unit.
o E.g - IBM PC, IBM 704,VAX 11/7801,CRAY -1 ,Older mainframe computers

2. SINGLE INSTRUCTION MULTIPLE DATA STREAM (SIMD)

o An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different data streams.
o In SIMD, there is a single control unit, feeding a single instruction stream to
multiple Processing Units.
o Each Processing Unit may have its own dedicated memory unit.
o These machines are used in applications such as Digital signal processing, image
processing and multimedia applications (Audio and Video).
o Machines based on an SIMD model are well suited to scientific computing since
they involve lots of vector and matrix operations.
o Vector and array processors fall into this category.
o E.g – ILLIAC-IV,MPP,CM-2,STARAN

3. MULTIPLE INSTRUCTION SINGLE DATA STREAM (MISD)

o An MISD computing system is a multiprocessor machine capable of executing
different instructions but all of them operating on the same data stream.
o This is an uncommon architecture which is generally used for fault tolerance.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o Machines built using the MISD model are not useful in most of the application, a
few machines are built, but none of them are available commercially.
o This structure is not commercially implemented.
o E.g – Systolic Array Computers .

4. MULTIPLE INSTRUCTIONS MULTIPLE DATA STREAM (MIMD)

o MIMD machines are usually referred to as multiprocessors or multi computers.
o An MIMD system is capable of executing multiple instructions on multiple data
streams.
o Each processor must include its own control unit that will assign to the processors
parts of a task or a separate task.
o Machines built using this model are capable to any kind of application..
o The MIMD may be a distributed-memory multiprocessor or a shared-memory
multicomputer.
o E.g – CRAY-XMP, IBM 370/168 M

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

5. SINGLE PROGRAM, MULTIPLE DATA (SPMD)

o SPMD (single program, multiple data) is a technique employed to
achieve parallelism; it is a subcategory of MIMD.
o Tasks are split up and run simultaneously on multiple processors with different
input in order to obtain results faster.
o SPMD is the most common style of parallel programming.

SPMD vs SIMD
1. In SPMD, multiple autonomous processors simultaneously execute the same
program at independent points, rather than SIMD imposes on different data.
With SPMD, tasks can be executed on general purpose CPUs.
2. SIMD requires vector processors to manipulate data streams. Note that the
two are not mutually exclusive.

6. VECTOR PROCESSORS
o Vector processors are the technology used in modern computers and central
processing units.
o A vector processor is a central processing unit that can work on an entire vector
(array) in one instruction.
o The instruction to the processor is in the form of one complete vector instead of
its element.
o Vector processors are used because they reduce the draw and interpret
bandwidth owing to the fact that fewer instructions must be fetched.
o A vector processor is also known as an array processor.
o They exploit data parallelism in large scientific and multimedia applications.

5. HARDWARE MULTITHREADING

o An approach, which allows for a high degree of instruction-level parallelism

without increasing the complexity or power consumption, is called multithreading
o The instruction stream is divided into several smaller streams, known as threads,
such that the threads can be executed in parallel.
o Hardware multithreading is a well-known technique to increase the utilization of
processor resources. The idea is to start executing a different thread when the
current thread is stalled.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o Hardware Multithreading allows multiple threads to share the functional units

of a single processor in an overlapping fashion.
o To permit this sharing, the processor must duplicate the independent state of each
thread.
o There are three main approaches to hardware multithreading.
o Fine-Grained Multithreading
o Coarse-Grained Multithreading
o Simultaneous Multithreading

FINE-GRAINED MULTITHREADING
o Fine-grained multithreading switches between threads on
each instruction, resulting in interleaved execution of
multiple threads.
o Also called as Interleaving.
o This interleaving is often done in a round robin fashion,
skipping any threads that are stalled at that time.
o To make fine-grained multithreading practical, the processor
must be able to switch threads on every clock cycle.
o If there is a sufficient number of threads, it is likely that at
least one is active (not stalled), and the CPU can be kept
running.
o With fine-grained multithreading in a pipelined Architecture,
if:
– the pipeline has k stages,
– there are at least k threads to be executed, and
– the CPU can execute a thread switch at each clock cycle
then
-there can never be more than a single instruction per thread in the pipeline at
any instant, so there cannot be hazards due to dependencies, and the pipeline
never stalls.

o Advantage :
 Potential to avoid wasted machine time due to stalls.
 It can hide throughput losses that arise from both short and long stalls.
o Disadvantage :
 A thread that is ready to execute a long sequence of instructions may have
wait to execute every instruction.
 It slows down execution of individual threads.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

COARSE-GRAINED MULTITHREADING
o Coarse-grained multithreading was invented as an alternative
to fine-grained multithreading.
o Coarse-grained multithreading switches threads only on
stalls, waiting for a time-consuming operation to complete.
o Also called as Blocking.
o A switch is made to another thread. When this thread in turn
causes a stall, a third thread is scheduled and so on.
o Advantage :
 Switching threads doesn‟t need to be nearly
instantaneous.
o Disadvantages :
 The processor can be idled on shorter stalls, and thread
switching will also cause delays.
 Coarse-grained multithreading suffers, from a major
throughput losses, especially from shorter stalls.

SIMULTANEOUS MULTITHREADING (SMT)

o Simultaneous multithreading (SMT) is a variation on hardware multithreading to
exploit instruction-level parallelism (ILP) and thread-level parallelism(TLP) at the
same time.
Simultaneous Multi-Threading (SMT) = ILP + TLP
o Simultaneous multithreading (SMT) is a technique for improving the overall
efficiency of superscalar CPUs with hardware multithreading.
o SMT permits multiple independent threads of execution to
better utilize the resources provided by modern processor
architectures.
o In SMT, instructions belonging to different threads are
(almost certainly) independent, and by issuing them
concurrently, CPU resources utilization raises.
o SMT is a multiple-issue processor that uses the resources of
a multiple-issue, dynamically scheduled processor.
o SMT is convenient since modern multiple-issue CPUs have
a number of functional units that cannot be kept busy with
instructions from a single thread.
o By applying register renaming and dynamic scheduling,
instructions belonging to different threads can be executed concurrently.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

(Register Renaming - There are an infinite number of virtual registers

available, and hence all WAW and WAR hazards are avoided and an
unbounded number of instructions can begin execution simultaneously.
Dynamic Parallelism - The processor decides at run time which instructions to
execute in parallel).
o SMT often have more functional unit parallelism available than a single thread can
effectively use.
o SMT is always executing instructions from multiple threads, leaving it up to the
hardware to associate instruction slots with their proper threads.

APPROACHES TO EXECUTE MULTIPLE THREADS

o The figure given below illustrates the possible architecture that involve
multithreading and contrasts these with approaches that do not use multithreading.
o Each horizontal row represents the issue slot or slots for a single execution cycle;
that is, the width of each row corresponds to the maximum number of instructions
that can be issued in a single clock cycle.
o The vertical dimension represents the time sequence of clock cycles.
o An empty slot represents an unused execution slot in one pipeline.

The various approaches to execute Multiple Threads are

1. Approaches with a scalar (Single Issue) processor
2. Approaches with a superscalar (Multiple Issue) processor

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

Approaches with a Scalar Processor

o The different approaches with a scalar (i.e., single-issue) processor:

1. Single Threaded Scalar:

o This is the simple pipeline found in traditional RISC and CISC machines,
with no multithreading.
2. Interleaved Multithreaded Scalar:
o This is the easiest multithreading approach to implement.
o By switching from one thread to another at each clock cycle, the pipeline
stages can be kept fully occupied, or close to fully occupied.
o The hardware must be capable of switching from one thread context to
another between cycles.
3. Blocked Multithreaded Scalar:
o A single thread is executed until a latency event occurs that would stop the
pipeline, at which time the processor switches to another thread.

Approaches with a Superscalar Processor

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

1. Superscalar:
o This is the basic superscalar approach with no multithreading.
o It is the most powerful approach to providing parallelism within a
processor.
o During some cycles, not all of the available issue slots are used. During
these cycles, less than the maximum number of instructions is issued; this is
referred to as horizontal loss.
o During other instruction cycles, no issue slots are used; these are cycles
when no instructions can be issued; this is referred to as vertical loss.

2. Interleaved Multithreading Superscalar:

o During each cycle, as many instructions as possible are issued from a single
thread.
o With this technique, potential delays due to thread switches are eliminated.
o The number of instructions issued in any given cycle is still limited by
dependencies that exist within any given thread.

3. Blocked Multithreaded Superscalar:

o Instructions from only one thread may be issued during any cycle, and
blocked multithreading is used.

6. MULTICORE PROCESSORS

o A multicore computer, also known as a chip multiprocessor, combines

two or more processors (called cores) on a single IC.
o Each core consists of all of the components of an independent processor,
such as registers, ALU, pipeline hardware, and control unit, plus L1
instruction and data caches.
o In addition to the multiple cores, contemporary multicore chips also include
L2 cache and L3 cache.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

Hardware Performance Issues Software Performance Issues

Increase in Parallelism Multi-threaded applications
 Increase in Complexity Multi-process applications
 Increase in Power Consumption Multi-instance applications

MULTICORE ORGANIZATION
 The main variables in a multicore organization are as follows:
o The number of core processors on the chip
o The number of levels of cache memory
o The amount of cache memory that is shared
 Each core has its own L1 and L2 cache.
 L1 and L2 are the fastest memories that a CPU can access.
 L1 is always dedicated per core whereas L2 can be shared.
 If the processor can find the instruction sets or data for its next operation in the
L1 and L2 cache, then it does not need to access the slower L3 cache.
 The four general organizations for multicore systems are :
 Dedicated L1 Cache
 Dedicated L2 Cache
 Shared L2 Cache
 Shared L3 Cache
DEDICATED L1 CACHE
o In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache.
o The L1 cache is divided into instruction and data
caches.
o L1 (Level 1) cache is the fastest memory that is present
in a computer system.
o L1 cache has the data the CPU is most likely to need
while completing a certain task.
o Its size typically varies between 256KB to 1MB.
o Example : ARM11 MPCore
DEDICATED L2 CACHE
o In this organization,there is no on-chip cache sharing..
o L2 cache is slower than L1 cache, but bigger in size.
o Its size typically varies between 256KB to 8MB.
o A potential advantage to having only dedicated L2
caches on the chip is that each core enjoys more rapid
access to its private L2 cache.
o L2 cache holds data that is likely to be accessed by the
CPU next.
o Example : AMD Opteron.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

SHARED L2 CACHE
o A similar allocation of chip space to memory, but with
the use of a shared L2 cache.
o As the amount of cache memory available on the chip
continues to grow, performance considerations dictate
splitting off a separate, shared L3 cache, with dedicated
L1 and L2 caches for each core processor.
o The use of a shared L2 cache confines the cache
coherency problem to the L1 cache level, which may
provide some additional performance advantage.
o Example : Intel Core Duo.

SHARED L3 CACHE
o L3 (Level 3) cache is the largest cache memory unit,
and also the slowest one.
o Its size typically varies between 4MB to 50MB.
o As both the amount of memory available and the
number of cores grow, the use of a shared L3 cache
combined with either a shared L2 cache or dedicated
per core L2 caches seems likely to provide better
performance than simply a massive shared L2 cache.
o Example : AMD K10.

7. SHARED MEMORY MULTIPROCESSORS

o A shared memory multiprocessor is a parallel
processor with a single address space across all
processors.

o All processes on the various CPUs share a unique

logical address space, which is mapped on a
physical memory that can be distributed among the
processors.
o Processors communicate through shared variables
in memory.
o Each process can read and write a data item simply using load and store operations
o These systems can still run independent jobs in their own virtual address spaces,
even if they all share a physical address space.
o This is an architectural model simple and easy to use for programming.
o It can be applied to a wide variety of problems that can be modeled as a set of
tasks, to be executed in parallel.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o The basic issue in shared memory multiprocessor systems is memory itself, since
the larger the number of processors involved, the more difficult to work on
memory efficiently.

TYPES OF SHARED MEMORY MULTIPROCESSORS

There are three types of Shared memory multiprocessors.

They are
o Uniform Memory Access multiprocessors (UMA)
o Non-Uniform Memory Access multiprocessors (NUMA)
o Cache Only Memory Access (COMA)

UNIFORM MEMORY ACCESS MULTIPROCESSOR (UMA)

o Multiprocessor in which accesses to main

memory take about the same amount of time
o No matter which processor requests the access
and no matter which word is asked.
o These systems are also called Symmetric
Shared-memory Multiprocessors.

NON-UNIFORM ACCESS MULTIPROCESSOR (NUMA)

o A type of single address space
multiprocessor in which some memory
accesses are much faster than others
depending on which processor asks for
which word.
o These systems are also called Distributed
Shared Memory architectures.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

CACHE ONLY MEMORY ACCESS (COMA)

• Data have no specific “permanent” location
 The entire physical address space is considered a huge, single cache.
 The data can be read into the local caches and/or modified and then updated at
their “permanent” location.
 Data can migrate and/or can be replicated in the various memory banks of the
central main memory.

TYPES OF SHARED MEMORY ARCHITECTURES

There are two types of Shared memory Architectures. They are

1. Symmetric Shared-Memory Architecture
2. Distributed Shared Memory Architecture
SYMMETRIC SHARED-MEMORY ARCHITECTURE

o The Symmetric Shared Memory Architecture consists of several processors with a

single physical memory shared by all processors through a shared bus.
o A multicore processor has multiple CPUs or cores on a single chip.
o The cores have private level-1 caches, while other caches may or may not be
shared between the cores.

DISTRIBUTED SHARED MEMORY ARCHITECTURE

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o A distributed-memory system, often called a multicomputer, consists of multiple

independent processing nodes with local memory modules which are connected by
a general interconnection network.
o Here, physically separated memories can be addressed as one logically shared
address space.
o Here, the term "shared" does not mean that there is a single centralized memory,
but that the address space is "shared".
o Each node of a cluster has access to shared memory in addition to each node's non-
shared private memory.
o Distributed-memory systems are also called as clusters or hybrid systems.

8. GRAPHICS PROCESSING UNIT (GPU)

 GPU is also called as Visual Processing Unit

 A GPU is a graphics coprocessor or accelerator mounted on a computer‟s
graphics card or video card.
 Traditional CPUs are structured with only a few cores whereas a modern GPU
chip is built with hundreds of processing cores.
 GPUs are designed to handle large number of floating-point operations in
parallel.
 GPUs are widely used in mobile phones, game consoles, embedded systems,
PCs, and servers.
 GPUs do not rely on multilevel caches.
 GPUs rely on hardware multithreading.
 There are special graphics DRAM chips for GPUs that are wider and have
higher bandwidth.
 GPU have smaller main memories than conventional microprocessors.
 GPUs typically have 4 to 6 GB or less, while CPUs have 32 to 256 GB.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

Difference Between CPU and GPU

GPU ARCHITECTURE
 First GPU is GeForce 256 by NVIDIA in 1999.
 These GPU chips can process a minimum of 10 million polygons per second.
 The GPU consists of upto 128 cores on a single chip.
 Each core can handle 8 threads of instructions and hence 1,024(8 * 128)
threads are executed concurrently on a single GPU.

 A GPU is hardware device which contain multiple small hardware units called
Streaming Multiprocessors(SM).
 Multiple SMs can be built on single GPU chip.
 Each SM can execute many threads concurrently.
 Each SM is associated with a private L1 Data Cache.
 Every Memory Controller (MC) is associated with a shared L2 cache for
faster access to the cached data. Both MC and L2 are on-chip.
 Each SM has 32 CUDA cores (Totally 16*32 =512 CUDA Cores)

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

 Each SM has 16 load/store units allowing source and destination address to be

calculated for 16 threads/clock.
 There are four special function units (SFUs) for executing trigonometric and
logarithmic operations.
 When operations is done with floating point numbers, it might be performed
with two rounding’s or with a single rounding.
 When performed with a single rounding, it is called a fused multiply–add
(FMA) or fused multiply–accumulate (FMAC).

GPU PROGRAMMING MODEL - CUDA

 GPU uses a programming model called CUDA (Compute Unified Device
Architecture)
 CUDA is an extension of the C language.
 It enables the programmer to write C programs to execute on GPUs.
 It is used to control the device.
 The programmer specifies CPU and GPU functions.
o The host code can be C++
o Device code may only be C

GPU MEMORY SYSTEM

 The figure shows the memory structures of an NVIDIA GPU.
 It has a Multi-Banked Memory Structure.
 GPU memory systems are designed for data throughput with wide memory
buses
 Much larger bandwidth than typical CPUs typically 6 to 8 times
 The on-chip SMEM memory is local to each Streaming Multiprocessor.
 The off - chip Global memory is shared by the whole GPU and all thread
blocks GPU Memory.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

9. CLUSTERS

o Cluster is a set of loosely or tightly connected computers working together as a

unified computing resource that can create the illusion of being one machine.
o Computer clusters have each node set to perform the same task, controlled and
produced by software.
o Clusters are constructed from whole computers and independent, scalable networks
and connected through a local area network.
o Clusters are generally collections of commodity computers that are connected to
each other by I/O interconnect via standard network switches and cables.

o Each computer(node) has private memory and OS.

o Cluster software is a layer that runs on top of the local operating systems
running on each computer, it is much easier to disconnect and replace a broken
computer.
o It easier to expand the system without bringing down the application that runs
on top of the cluster.

Prepared by Dr.M.Malathy Professor/CSE

21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

o It is also easier to replace a computer without bringing down the system in a

cluster than in a shared memory multiprocessor.
o It is also easy for clusters to scale down gracefully when a server fails,
thereby improving dependability.

TYPES OF CLUSTERED SYSTEMS

There are primarily two types of clustered systems
1.Asymmetric clustering system
2.Symmetric clustering system

1. Asymmetric Clustering System

 In this system, one of the nodes in the clustered system is in standby mode and
all the others run the required applications.
 The standby node continuously monitors the server and if it fails, the standby
node takes its place.

2. Symmetric Clustering System

 In symmetric clustering system, all the nodes run applications as well as
monitor each other.
 This is more efficient than asymmetric system as it uses all the hardware and
doesn't keep a node merely as a standby.

ATTRIBUTES OF CLUSTERED SYSTEMS

The clustering systems that embody some major attributes are:
1. Load Balancing Clusters
 In this type of clusters, the nodes in the system share the workload to provide a
better performance.
 For example: A web based cluster may assign different web queries to different
nodes so that the system performance is optimized. S
 Some clustered systems use a round robin mechanism to assign requests to
different nodes in the system.

2. High Availability Clusters

 These clusters improve the availability of the clustered system.
 They have extra nodes which are only used if some of the system components
fail.
 So, high availability clusters remove single points of failure i.e. nodes whose
failure leads to the failure of the system.
 These types of clusters are also known as failover clusters or HA clusters.

Prepared by Dr.M.Malathy Professor/CSE

Page 26
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

BENEFITS OF CLUSTERED SYSTEMS

The difference benefits of clustered systems are as follows:
 Performance
Clustered systems result in high performance as they contain two or
more individual computer systems merged together. These work as a
parallel unit and result in much better performance for the system.

 Fault Tolerance
Clustered systems are quite fault tolerant and the loss of one node does
not result in the loss of the system. They may even contain one or more
nodes in hot standby mode which allows them to take the place of failed
nodes.

 Scalability
Clustered systems are quite scalable as it is easy to add a new node to
the system. There is no need to take the entire cluster down to add a new
node.
APPLICATIONS OF CLUSTERS
 The search engines that hundreds of millions of us use everyday depend upon
this technology.
 Amazon, Facebook, Google, Microsoft have multiple datacenters each with
clusters of tens of thousands of servers.

10. WAREHOUSE SCALE COMPUTERS (WSC)

o Large Clusters are called as Warehouse-Scale Computers.

o A Warehouse-Scale Computer (WSC) is a cluster comprised of tens of thousands
of servers.
o The WSC is a modern descendant of the 1970’s supercomputers.
o WSC‟s play a more important societal role today than supercomputers did in the
past.
o Although they may be classified as just large clusters, their architecture and
operation are more sophisticated.
o They act as one giant computer.
o Internet services necessitated the construction of new buildings to house, power,
and cool 100,000 servers.
o The cost on the order of $150M for the building, the electrical and cooling
infrastructure, the servers, and the networking equipment that connects and houses
50,000 to 100,000 servers.
o WSC often use a hierarchy of networks for interconnection.

Prepared by Dr.M.Malathy Professor/CSE

Page 27
21CS46T - COMPUTER ARCHITECTURE AND ORGANIZATION

Applications of WSC
A WSC can be used to provide internet services.
1. Search - Google
2. Social Networking - Facebook
3. Video Sharing - YouTube
4. Online Sales - Amazon
5. Cloud Computing Services – Rackspace

Goals / Design factors for WSC

1. Cost-performance
2. Energy efficiency
3. Dependability via redundancy
4. Network I/O
5. Interactive and batch processing workloads
6. Ample computational parallelism is not important
7. Operational costs count
i. Power consumption is a primary, not secondary, constraint when
designing system
8. Scale and its opportunities and problems
i. Can afford to build customized systems since WSC require
volume purchase

Programming Model of WSC

o WSC uses MapReduce programming model.
o MapReduce runtime environment schedules map and reduce task to WSC nodes.
o MapReduce programs work in two phases:
1. Map Phase 2. Reduce Phase.
o An input to each phase is (key-value) pairs.
o In addition, two functions are to be specified -Map function and Reduce function.
o Map first applies a programmer-supplied function to each logical input record.
o Map runs on thousands of servers to produce an intermediate result of key-value
pairs.
o Reduce collects the output of those distributed tasks and collapses them using
another programmer-defined function.

Prepared by Dr.M.Malathy Professor/CSE

Page 28
CS8491 - COMPUTER ARCHITECTURE

11. MESSAGE PASSING MULTIPROCESSORS (MPP)

o Message passing is an inherent element of all computer clusters.

o Message passing systems provide alternative methods for communication and
movement of data among multiprocessors.
o A message passing system typically combines local memory and processor at each
node of the interconnection network.

o Message passing is a way for communicating between multiple processors by

explicitly sending and receiving messages.
o The system has functions to send and receive messages.

Prepared by Dr.M.Malathy Professor/CSE

CS8491 - COMPUTER ARCHITECTURE

o Coordination is built-in with message passing, since one processor knows when a
message is sent, and the receiving processor knows when a message arrives.
o If the sender needs confirmation that the message has arrived, the receiving
processor can then send an acknowledgment message back to the sender.
o Computers that rely on message passing for communication rather than cache
coherent shared memory are much easier for hardware designers to build.
o Message passing can be synchronous or asynchronous.
o Synchronous message passing systems require the sender and receiver to wait for
each other while transferring the message.
o In asynchronous communication the sender and receiver do not wait for each other
and can carry on their own computations while transfer of messages is being done.
o Two typical approaches of Message Passing:
1. PVM - Parallel Virtual Machine
2. MPI - Message Passing Interface

Advantages of Message Passing Model

 Easier to build than scalable shared memory machines
 Easy to scale (but topology is important)
 Coherency and synchronization is the responsibility of the user, so the
system designer need not worry about them.

Disadvantage of Message Passing Model

 Large overhead: copying of buffers requires large data transfers
 Programming is more difficult.
 Blocking nature of SEND/RECEIVE can cause increased latency and
deadlock issues.

Prepared by Dr.M.Malathy Professor/CSE

Page 30

Module 5
No ratings yet
Module 5
45 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Unit 1
No ratings yet
Unit 1
22 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
SP23 CS 212 Week 2
No ratings yet
SP23 CS 212 Week 2
23 pages
Lecture 36
No ratings yet
Lecture 36
15 pages
Unit 5
No ratings yet
Unit 5
66 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
‎⁨نسخة Lecture 4⁩
No ratings yet
‎⁨نسخة Lecture 4⁩
21 pages
Computer Hardware Engineering: IS1200, Spring 2015
No ratings yet
Computer Hardware Engineering: IS1200, Spring 2015
17 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
Chapter 1 Solution
No ratings yet
Chapter 1 Solution
35 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Parallel Computing Seminar Report
100% (3)
Parallel Computing Seminar Report
35 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Computer Performance Insights
No ratings yet
Computer Performance Insights
10 pages
Unit 5
No ratings yet
Unit 5
96 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
Amdahl's Law & Parallel Processing
No ratings yet
Amdahl's Law & Parallel Processing
16 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
L5-L6-Performance Issues
No ratings yet
L5-L6-Performance Issues
47 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
08 Parallel Algorithms Approches
No ratings yet
08 Parallel Algorithms Approches
12 pages
Lecture 03
No ratings yet
Lecture 03
39 pages
Aca
No ratings yet
Aca
13 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
28 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Microprocessors: Single vs Multi-core
No ratings yet
Microprocessors: Single vs Multi-core
9 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
CS802A Lec-2 PDF
No ratings yet
CS802A Lec-2 PDF
28 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Hardware Book
No ratings yet
Hardware Book
50 pages
Parallel Implementation of Devanagari Document Image Segmentation Approach On GPU
No ratings yet
Parallel Implementation of Devanagari Document Image Segmentation Approach On GPU
6 pages
OEMTechnicalGuideForWindows7ForSystemBuilders PDF
No ratings yet
OEMTechnicalGuideForWindows7ForSystemBuilders PDF
74 pages
Technical Documentation for Engineers
No ratings yet
Technical Documentation for Engineers
70 pages
14 Intergraph
100% (1)
14 Intergraph
60 pages
MediaPipe Vs TensorFlow For Node
No ratings yet
MediaPipe Vs TensorFlow For Node
6 pages
AI
No ratings yet
AI
188 pages
The RDDL Network - Vision For A Physical Trust Layer - v1.3
No ratings yet
The RDDL Network - Vision For A Physical Trust Layer - v1.3
27 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Chapter 3 - U.S.-China Competition in Emerging Technologies
No ratings yet
Chapter 3 - U.S.-China Competition in Emerging Technologies
102 pages
Mac Pro Graphics and Hardware Specs
No ratings yet
Mac Pro Graphics and Hardware Specs
2 pages
System Requirements
No ratings yet
System Requirements
3 pages
Nvidia Geforce GTX 590 Graphics Card Datasheet
No ratings yet
Nvidia Geforce GTX 590 Graphics Card Datasheet
5 pages
Lenovo ThinkSystem SR650 V3 Server Product Guide
No ratings yet
Lenovo ThinkSystem SR650 V3 Server Product Guide
191 pages
Nvidia Professional Graphics Cards Brochure
No ratings yet
Nvidia Professional Graphics Cards Brochure
2 pages
Parts of The System Unit
No ratings yet
Parts of The System Unit
43 pages
HPC Blog
No ratings yet
HPC Blog
3 pages
Real-Time Translator From OpenGL To OpenGL ES For
No ratings yet
Real-Time Translator From OpenGL To OpenGL ES For
4 pages
CelestiaUsersGuide-1 3 2
No ratings yet
CelestiaUsersGuide-1 3 2
40 pages
Axxon One Presentation
No ratings yet
Axxon One Presentation
74 pages
Microcontrollers Stm32n6 Series Overview
No ratings yet
Microcontrollers Stm32n6 Series Overview
39 pages
A Complete Guide To Flight Simulator 2023 V1.1.0-Single
No ratings yet
A Complete Guide To Flight Simulator 2023 V1.1.0-Single
622 pages
Lastexception 63881828158
No ratings yet
Lastexception 63881828158
14 pages
Applio - Installation
No ratings yet
Applio - Installation
5 pages
Instructions
No ratings yet
Instructions
9 pages
Console
No ratings yet
Console
54 pages
Cryptocurrency Mining: Cloud Transition
No ratings yet
Cryptocurrency Mining: Cloud Transition
10 pages
Force2 A-Ds-5018 5
No ratings yet
Force2 A-Ds-5018 5
2 pages
DSS ExpresslUser Manual - V1.00.0001 - Eng PDF
No ratings yet
DSS ExpresslUser Manual - V1.00.0001 - Eng PDF
292 pages