Course Outline
Multiple simultaneous computations: (Goals of parallelism)
Throughput versus concurrency (e.g., controlling access to shared resources); Concurrency:
controlling access to shared resources
Parallelism, communication, and coordination Programming constructs for coordinating
multiple simultaneous computations.
Need for synchronization; Programming errors not found in sequential programming Data races
(simultaneous read/write or write/write of shared state);
level races (inter-leavings violating program intention, undesired non-determinism);
Lack of liveness/progress (deadlock, starvation)
Need for communication and coordination/synchronization:
Independence and partitioning; Shared Memory:
Consistency and its role in programming language guarantees for data-race-free programs.
Multicore processors: Shared vs. distributed memory
Introduction to Parallelism
Computer architects have always strived to increase the performance of their computer architectures.
High performance may come from fast dense circuitry, packaging technology, and parallelism.
Parallel processors are computer systems consisting of multiple processing units connected through
some interconnection network with the software needed to make the processing units work together.
There are two major factors used to categorize such systems:
The multiple processing units themselves
and the interconnection network that ties them together.
The processing units can communicate and interact with each other using either shared memory or
message passing methods.
The main argument for using multiprocessors is to create powerful computers by simply connecting
multiple processors.
A multiprocessor is expected to reach faster speed than the fastest single-processor system. In addition,
a multiprocessor consisting of a number of single processors is expected to be more cost-effective than
building a high-performance single processor.
Another advantage of a multiprocessor is fault tolerance. If a processor fails, the remaining processors
should be able to provide continued service, albeit with degraded performance.
Why Parallel Architecture?
Parallel computer architecture adds a new dimension in the development of computer system
by using more and more number of processors.
In principle, performance achieved by utilizing large number of processors is higher than the
performance of a single processor at a given point of time
Parallel Processing
Parallel processing can be described as a class of techniques which enables the system to
achieve simultaneous data-processing tasks to increase the computational speed of a computer
system.
A parallel processing system can carry out simultaneous data-processing to achieve faster
execution time, for instance, while an instruction is being processed in the ALU component of
the CPU, the next instruction can be read from memory.
The primary purpose of parallel processing is to enhance the computer processing capability
and increase its throughput.
A parallel processing system can be achieved by having a multiplicity of functional units that
perform identical or different operations simultaneously.
The data can be distributed among various multiple functional units.
The following diagram shows one possible way of separating the execution unit into eight
functional units operating in parallel.
The operation performed in each functional unit is indicated in each block of the diagram:
Figure 1: Computer systems consisting of multiple processing units
The adder and integer multiplier performs the arithmetic operation with integer numbers.
The floating-point operations are separated into three circuits operating in parallel.
The logic, shift, and increment operations can be performed concurrently on different data.
All units are independent of each other, so one number can be shifted while another number is
being incremented.
Parallel computers can be roughly classified according to the level at which the hardware
supports parallelism, with multi-core and multi-processor computers having multiple
processing elements within a single machine.
In some cases parallelism is transparent to the programmer, such as in bit-level or instruction-
level parallelism.
But explicitly parallel algorithms, particularly those that use concurrency, are more difficult to
write than sequential ones, because concurrency introduces several new Classes of potential
software bugs, of which race conditions are the most common.
Communication and synchronization between the different subtasks are typically some of the
greatest obstacles to getting optimal parallel program performance.
Advantages of Parallel Computing over Serial Computing are as follows:
1. It saves time and money as many resources working together will reduce the time and cut
potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.
4. Serial Computing ‘wastes the potential computing power, thus Parallel Computing makes
better work of hardware.
Types of Parallelism:
1. Bit-level parallelism (BLP): It is the form of parallel computing which is based on the increasing
processor’s size. It reduces the number of instructions that the system must execute in order to
perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum of two 16-bit
integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring
two instructions to perform the operation. A 16-bit processor can perform the operation with just
one instruction.
2. Instruction-level parallelism (ILP): A processor can only address less than one Instruction for each
clock cycle phase. These instructions can be re-ordered and grouped which are later executed
concurrently without affecting the result of the program. This is called instruction-level parallelism.
3. Task Parallelism (TP): Task parallelism employs the decomposition of a task into subtasks and then
allocating each of the subtasks for execution. The processors perform execution of sub tasks
concurrently.
4. Data-level parallelism (DLP) – Instructions from a single stream operate concurrently on several
data – Limited by non-regular data manipulation patterns and by memory bandwidth
Architectural Trends
When multiple operations are executed in parallel, the number of cycles needed to execute the
program is reduced.
However, resources are needed to support each of the concurrent activities.
Resources are also needed to allocate local storage.
The best performance is achieved by an intermediate action plan that uses resources to utilize a
degree of parallelism and a degree of locality.
Generally, the history of computer architecture has been divided into four generations having
following basic technologies
1. Vacuum tubes
2. Transistors
3. Integrated circuits
4. VLSIC
Till 1985, the duration was dominated by the growth in bit-level parallelism.
4-bit microprocessors followed by 8-bit, 16-bit, and soon.
To reduce the number of cycles needed to perform a full 32-bit operation, the width of the data
path was doubled Later on, 64-bit operations were introduced.
The growth in instruction-level-parallelism dominated the mid-80s to mid-90s.
The RISC approach showed that it was simple to pipeline the steps of instruction processing so
that on an average an instruction is executed in almost every cycle.
Growth in compiler technology has made instruction pipelines more productive.
In mid-80s, microprocessor-based computers consisted of
An integer processing unit
A floating-point unit
A cache controller
SRAMs for the cache data
Tag storage
As chip capacity increased, all these components were merged into a single chip, thus, a single chip
consisted of separate hardware for integer arithmetic, floating point operations, memory operations
and branch operations, other than pipelining individual instructions, it fetches multiple instructions at a
time and sends them in parallel to different functional units whenever possible. This type of instruction
level parallelism is called super scalar execution.
General Classifications of Computer Architectures
Over the last two decades, parallel processing has drawn the attention of many research workers, and
several high-speed architectures have been proposed. To present these results in a concise manner,
different architectures or schemes are used for parallel computers. All computers may be categorized
into different groups using one of three classification methods:
1. Flynn
2. Feng
3. Handler
Depending on the parallelism it exhibits with its Instruction stream or Data stream, a sequence of
instructions (the instruction stream) manipulates a sequence of operands (the data stream).
FLYNN‘S CLASSIFICATION
Flynn's taxonomy is a specific classification of parallel computer architectures that are based
on the number of concurrent instruction (single or multiple) and data streams (single or
multiple) available in the architecture.
The four categories in Flynn's taxonomy are the following:
1. (SISD) single instruction, single data
2. (SIMD) single instruction, multiple data
3. (MISD) multiple instruction, single data
4. (MIMD) multiple instruction, multiple data
Instruction stream: is the sequence of instructions as executed by the machine.
Data Stream is a sequence of data including input, or partial or temporary result, called by the
instruction Stream.
Instructions are decoded by the control unit and then ctrl unit send the instructions to the
processing units for execution.•
Data Stream flows between the processors and memory is bi-directionally.
SISD
An SISD computing system is a uniprocessor machine which is capable of executing a single
instruction, operating on a single data stream.
In SISD, machine instructions are processed in a sequential manner and computers adopting
this model are popularly called sequential computers.
Most conventional computers have SISD architecture. All the instructions and data to be
processed have to be stored in primary memory.
The speed of the processing element in the SISD model is limited (dependent) by the rate at
which the computer can transfer information internally.
Dominant representative SISD systems are IBM PC, workstations.
SIMD
An SIMD system is a multiprocessor machine capable of executing the same instruction on all the
CPUs but operating on different data streams.
Machines based on an SIMD model are well suited to scientific computing since they
involve lots of vector and matrix operations.
So that the information can be passed to all the processing elements (PEs) organized data
elements of vectors can be divided into multiple sets (N-sets for N PE systems) and each PE
can process one dataset.
Dominant representative of SIMD systems is Cray’s vector processing machine.
MISD
An MISD computing system is a multiprocessor machine capable of executing different instructions on
different PEs but all of them operating on the same dataset.
The system performs different operations on the same data set. Machines built using the MISD model
are not useful in most of the application, a few machines are built, but none of them are available
commercially.
MIMD
An MIMD system is a multi processor machine which is capable of executing multiple instructions on
multiple datasets.
Each PE in the MIMD model has separate instruction and data streams, therefore machines built using
this model are capable to any kind of application.
Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.
MIMD machines are broadly categorized into: shared-memory MIMD and distributed-memory
MIMD based on the way PEs are coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are
connected to a single global memory and they all have access to it. The communication between PEs
in this model takes place through the shared memory, modification of the data stored in the global
memory by one PE is visible to all other PEs.
Dominant representative of shared memory MIMD systems are Silicon Graphics machines and
Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a
local memory. The communication between PEs in this model takes place through the interconnection
network (the inter process communication channel, or IPC).
The network connecting PEs can be configured to tree, mesh or in accordance with the requirement.
VECTOR ARCHITECTURES
In a traditional scalar processor, the basic data type is an n-bit word. The architecture often exposes a
register file of words, and the instruction set is composed of instructions that operate on individual
words. In a vector architecture, there is support of a vector data type, where a vector is a collection of
VL n-bit words (VL is the vector length). There may also be a vector register file, which was a key
innovation of the Cray architecture.
Previously, vector machines operated on vectors stored in main memory.
Figures 1 and 2 illustrate the difference between vector and scalar data types, and the operations that
can be performed on them.
Vector load/store instructions provide the ability to do strided and scatter/gather memory accesses,
which take data elements distributed throughout memory and pack them into sequential
vectors/streams placed in vector/stream registers. This promotes data locality.
It results in less data pollution, since only useful data is loaded from the memory system. It provides
latency tolerance because there can be many simultaneous outstanding memory accesses.
Vector instructions such as VLD and VST provide this capability.
MULTITHREADING
Multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core
processor) to provide multiple threads of execution concurrently, supported by the operating system.
This approach differs from multiprocessing. A mechanism by which the instruction streams is divided
into several smaller Streams (threads) and can be executed in parallel is called multithreading.
A multithreaded CPU is not a parallel architecture, strictly speaking; multithreading is obtained
through a single CPU, but it allows a programmer to design and develop applications as a set of
programs that can virtually execute in parallel: namely, threads.
Multithreading is solution to avoid waiting clock cycles as the missing data is fetched: making the
CPU manages more peer-threads concurrently; if a thread gets blocked, the CPU can execute
instructions of another thread, thus keeping functional units busy.
Each thread must have a private Program Counter and a set of private registers, separate from other
threads.
Hardware Multithreading
• Increasing utilization of a processor by switching to another thread when one thread is stalled
(delayed or stuck) is known as hardware multithreading.
Thread
A thread includes the program counter, the register state, and the stack. It is a lightweight
process; whereas threads commonly share a single address space, processes don't.
Thread Switch
The act of switching processor control from one thread to another within the same process. It is
much less costly than a processor switch.
Process
A process includes one or more threads, the address space, and the operating system state. Hence, a
process switch usually invokes the operating system, but not a thread switch.
Types of Multi-threading
1. Coarse-grained Multithreading
2. Fine-grained Multithreading
3. Simultaneous Multithreading
Coarse-grained Multithreading
A version of hardware multithreading that implies switching between threads only after significant
events, such as a last-level cache miss.
This change relieves the need to have thread switching be extremely fast and is much less likely to
slow down the execution of an individual thread, since instructions from other threads will only be
issued when a thread encounters a costly stall (delay)
Advantage
To have very fast thread switching.
Doesn't slow down thread.
Disadvantage
It is hard to overcome throughput losses from shorter stalls, due to pipeline start -up costs.
Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied.
New thread must fill pipeline before instructions can complete.
Due to this start-up overhead, coarse-grained multithreading is much more useful for reducing
the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time.
Fine-grained Multithreading (FGM)
FGM is a version of hardware multithreading that implies switching between threads after every
instruction, resulting in interleaved execution of multiple threads. It switches from one thread to
another at each clock cycle. This interleaving is often done in a round-robin fashion, skipping any
threads that are stalled at that clock cycle.
To make fine-grained multithreading practical, the processor must be able to switch threads on every
clock cycle.
Advantage
Vertical waste is eliminated.
Pipeline hazards can not arise.
Zero switching overhead
Ability to hide latency within a thread i.e., it can hide the throughput losses that arise from both
short and long stalls.
Instructions from other threads can be executed when one thread stalls.
High execution efficiency
Potentially less complex than alternative high performance processors.
Disadvantage
Clock cycles are wasted if a thread has little operation to execute.
Needs a lot of threads to execute.
It is expensive than coarse-grained multithreading.
It slows down the execution of the individual threads, since a thread that is ready to execute
without stalls will be delayed by instructions from other threads.
Simultaneous multithreading (SMT)
SMT is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically
scheduled pipelined processor to exploit thread level parallelism at the same time it exploits instruction
level parallelism.
The key insight that motivates SMT is that multiple-issue processors often have more functional unit
parallelism available than most single threads can effectively use. Since SMT relies on the existing
dynamic mechanisms, it does not switch resources every cycle, instead, SMT is always executing
instructions from multiple threads, to associate instruction slots and renamed registers with their proper
threads.
Advantage
It is ability to boost utilization by dynamically scheduling functional units among multiple
threads.
It increases hardware design facility.
It produces better performance and add resources to a fine grained manner
Disadvantage
It cannot improve performance if any of the shared resources are the limiting bottlenecks for the
performance.
Multiprocessing
A multiprocessing system is one which has more than two processors. The CPUs are added to the
system to increase the computing speed of the system. Each CPU has its own set of registers and main
memory. Just because CPUs are separate, it may happen that one CPU must not have anything to
process and may sit idle and the other may be overloaded with the processes. In such cases, the
processes and the resources are shared dynamically among the processors.
Multiprocessing can be classified as symmetric multiprocessing and asymmetric multiprocessing.
In symmetric multiprocessing, all processors are free to run any process in a system. In Asymmetric
multiprocessing, there is a master-slave relationship among the processors. The master processor is
responsible for allotting the process to slave processors.
If the processor has integrated memory controller then adding processor would increase the
amount of addressable memory in the system. Multiprocessing can change the memory access
model from uniform memory access (UMA) to non-uniform memory access (NUMA). The
uniform memory access amounts the same time for accessing any RAM from any Processor.
On the other hands, non-uniform memory access amounts longer time to access some part of
memory than the other parts.
Multiprocessing and Multithreading both adds performance to the system. Multiprocessing is
adding more number of or CPUs/processors to the system which increases the computing speed
of the system. Multithreading is allowing a process to create more threads which increase the
responsiveness of the system.
Thread VS Process
Multithreading is the execution of multiple threads of a single process concurrently within
the context of that process.
Now let us first discuss what is a thread? A thread of a process means a code segment of a
process, which has its own thread ID, program counter, registers and stack and can execute
independently, However, threads belonging to the same process has to share the belongings of that
process like code, data, and system resources.
Creating separate processes for each service request consumes time and exhaust system resources.
Instead of incurring this overhead, it is more efficient to create threads of a process.
To understand the multithreading concept let us take an example of a word processor. A word
processor, displays graphic, responds to keystrokes, and at the same time, it continues spelling and
grammar checking. You do not have to open different word processors to do this concurrently. It does
get happen in a single word processor with the help of multiple threads.
Now let us take into consideration the benefits of multithreading. Multithreading increases
the responsiveness as if one thread of a process is blocked or performing the lengthy operation, the
process still continues. The second benefit of multithreading is resource sharing as several threads of
a process share same code and data within the same address space.
Creating a thread is economical as it shares the code and data of the process to which they belong. So
the system does not have to allocate resources separately for each thread. Multithreading can
be increased on multiprocessing operating system. As multithreading on multiple CPUs
increases parallelism.
In conclusion, The benefits of multithreading can be gradually increased in multiprocessing
environment as multithreading on a multiprocessing system increases parallelism.
1. Creating a process can consume time and even exhaust the system resources. However
creating threads is economical as threads belonging to the same process share the belongings
of that process.
2. Multiprocessing system executes multiple processes simultaneously whereas, the
multithreading system let execute multiple threads of a process simultaneously.
3. The key difference between multiprocessing and multithreading is that multiprocessing
allows a system to have more than two CPUs added to the system whereas multithreading
lets a process generate multiple threads to increase the computing speed of a system.
Difference between multiprocessing and multithreading
1. In Multiprocessing, CPUs are added for increasing computing power, while In Multithreading,
many threads are created of a single process for increasing computing power.
2. In Multiprocessing, Many processes are executed simultaneously, While in multithreading, many
threads of a process are executed simultaneously.
3. In Multiprocessing, Process creation is a time-consuming process, while in Multithreading,
process creation is according to economical.
4. In Multiprocessing, every process owned a separate address space, while in Multithreading; a
common address space is shared by all the threads.