High Performance
Computing
LECTURE #2
1
Agenda
o What is parallel computing?
o Terminologies of parallel computing
o Performance Evaluation of parallel computing
o Challenges of parallel computing
o Parallel processing concepts:
o How is parallelism expressed in a program
o Architectural concepts related to parallelism (parallel processing)
2
What is parallel computing?
Multiple processors cooperating concurrently to solve one problem.
3
What is parallel computing?
“A parallel computer is a collection of processing elements that can
communicate and cooperate to solve large problems fast”
Almasi/Gottlieb
“communicate and cooperate”
• Nodes and interconnect architecture
• Problem partitioning (Co-ordination of events in a process)
“large problems fast”
• Programming model
• Match of model and architecture
4
What is parallel computing?
Some broad issues:
• Resource Allocation:
– How large a collection?
– How powerful are the elements?
• Data access, Communication and Synchronization
– How are data transmitted between processors?
– How do the elements cooperate and communicate?
– What are the abstractions and primitives for cooperation?
• Performance and Scalability
– How does it all translate into performance?
– How does it scale? A service is said to be scalable when ??
5
Terminologies
❑ Core a single computing unit with its own independent control
❑Multicore is a processor having several cores that can access the same memory
concurrently
❑A computation is decomposed into several parts called Tasks that can be computed
in parallel
❑ Finding enough parallelism is (one of the) critical steps for high performance
(Amdahl’s law).
6
Performance Metrics
❑ Execution time:
The time elapsed between the beginning and the end of its execution.
❑ Speedup:
The ration between serial and parallel time.
Speedup= Ts/Tp
❑ Efficiency:
Ratio of speedup to the number of processors.
Efficiency= Speedup/P
7
Performance Metrics
❑ Amdahl’s Law
Used to predict maximum speedup using multiple processors.
• Let f = fraction of work performed sequentially.
• (1 - f) = fraction of work that is parallelizable.
• P = number of processors
On 1 cpu: T1 = f + (1 – f ) = 1.
(1−𝑓 )
On P processors: Tp = f +
𝑝
• Speedup
𝑇1 1 1
= <
𝑇𝑝 𝑓+(1−𝑓)/𝑝 𝑓
Speedup limited by sequential part
9
Challenges
All parallel programs contain:
❑ Parallel sections
❑ Sequential sections
❑Sequential sections are with work is being duplicated or no useful work is being done,
(waiting for others)
1) Building efficient algorithms avoiding:
❑ Communication delay
❑ Idling
❑ Synchronization
2) Memory system challenges
10
Challenges
1) Sources of overhead in parallel programs
❑ Inter process interaction:
The time spent communicating data between processing elements is
usually the most significant source of parallel processing overhead.
❑ Idling:
Processes may become idle due to many reasons such as load
imbalance, synchronization, and presence of serial components in a
program.
❑ Excess Computation:
The fastest known sequential algorithm for a problem may be difficult
or impossible to parallelize, forcing us to use a parallel algorithm
based on a poorer but easily parallelizable sequential algorithm.
11
Challenges
❖ 2)Memory system challenges
❖ The effective performance of a program on computer relies :
1. Speed of processor (clock rates of processors increased from 40MHz (e.g MIPS R3000, 1988), to 2.0 GHz
(e.g pentium 4, 2002), to, 8.429GHz (AMD's Bulldozer based FX chips, 2012)
2. Ability of memory to feed data to processor
❖ Memory System Performance is mainly captured by two parameters, latency and bandwidth.
• Latency is the time from the issue of a memory request to the time the data is Memory
available at the processor.
Example: if memory has latency 100 ns (no caches). Assume that the processor has two Data Path
multiply-add units and is capable of executing four instructions in each cycle of 1 ns.
Then processor must wait 100 cycles before it can process the data
Processor
◦ Bandwidth is the rate at which data can be pumped to the processor by the memory system.
13
How is parallelism expressed in a program
IMPLICITLY EXPLICITLY
❑Define tasks only, rest implied; or ❑Define tasks, work decomposition,
define tasks and work decomposition data decomposition, communication,
rest implied; synchronization.
❑OpenMP is a high-level parallel ❑MPI is a library for fully explicit
programming model, which is mostly an parallelization.
implicit model.
Hidden from programmer Exposed to programmer
( implicit parallelism) ( explicit parallelism)
17
1- IMPLICITLY
❑It is a characteristic of a programming language that allows a compiler or interpreter to
automatically exploit the parallelism inherent to the computations expressed by some of the
language's constructs.
❑ A pure implicitly parallel language does not need special directives, operators or functions
to enable parallel execution.
❑ Programming languages with implicit parallelism include Axum, HPF, Id, LabVIEW, MATLAB
M-code,
❑Example: taking the sine or logarithm of a group of numbers, a language that provides
implicit parallelism might allow the programmer to write the instruction thus:
18
Advantages Disadvantages
❑ A programmer does not need to worry ❑ It reduce the control that the
about task division or process programmer has over the parallel
communication, execution of the program,
❑focusing instead in the problem that ❑resulting sometimes in less-than-optimal
his or her program is intended to solve. parallel efficiency.
❑It generally facilitates the design of ❑ Sometimes debugging is difficult.
parallel programs .
19
2- EXPLICITLY
How is parallelism expressed in a program
❑ it is the representation of concurrent computations by means of primitives in the form
of special-purpose directives or function calls.
❑ Most parallel primitives are related to process synchronization, communication or task
partitioning.
Advantages Disadvantages
❑The absolute programmer control ❑ programming with explicit parallelism
over the parallel execution. is often difficult, especially for non
computing specialists,
❑ A skilled parallel programmer takes
advantage of explicit parallelism to ❑ because of the extra work involved in
produce very efficient code. planning the task division and
synchronization of concurrent
26
processes.
Architectural concepts related to parallelism
❖ Important architectural concepts relate to parallel processing :
1- Implicit parallel platforms.
2- Explicit parallel platforms.
2
1
Parallel Processing Concepts
1- Implicit Parallelism
❖Concerning Memory- processor data path bottlenecks, microprocessor
designer invent (trend) alternate routs to cost effective performance.
❖Execution of multiple instruction in a single clock cycle.
Ex: microprocessors Itanium, Sparc ultra, MIPS and Power4
❖So, what is the mechanism used by various processors to support Execution
of multiple instruction in a single clock cycle
1.1- Pipelining and superscalar execution.
1.2- Very long instruction word processors.
2
2
Parallel Processing Concepts
Implicit Parallelism
1.1- Pipelining & super scalar execution
❖ Pentium4 operates at 2.0GHz has 20 stage pipeline
◦ Actually for increasing the speed more we have to dived more smaller and
smaller
❖Also, the speed of a single pipeline is limited by largest atomic task in
pipeline
◦ For more enhancement in speed a multiple pipelines is used
◦ During each cycle, multiple instruction are piped into processor in parallel
then these instruction are executed on multiple functional units.
2
3
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism
Parallel Processing Concepts
Implicit Parallelism
❖Use the Idea of Pipelining in a
Computer
Fetch + Decode + Execution + Write
❖Pipelining overlaps various stages of
instruction execution to achieve
performance.
❖Superscalar execution : the ability of a
processor to issue multiple instructions
in the same cycle
28
Parallel Processing Concepts
Implicit Parallelism
❖Consider a processor with two pipelines & ability to simultaneously issue
two instructions (per clock cycle) (hence it is named dual issue execution)
- Consider the execution for adding 4 numbers
Three Different code fragments for adding a list of four numbers
29
Parallel Processing Concept
Implicit Parallelism
30
Parallel Processing Concepts
Implicit Parallelism
Limitation:
❖ True Data Dependency: The result of one operation is an input to the next.
❖ Resource Dependency: Two operations require the same resource.
❖Branch Dependency: in typical program traces, every 5-6th instruction is a
conditional jump! This requires very accurate branch prediction.
❖Scheduling instructions across conditional branch statements cannot be done
deterministically a-priori.
32
Parallel Processing Concepts
Implicit Parallelism
Superscalar Execution
❖ Scheduling of instructions is determined by a number of factors:
• True Data Dependency
• Resource Dependency
• Branch Dependency
• The scheduler, a piece of hardware looks at a large number of instructions
in an instruction queue and selects appropriate number of instructions to
execute concurrently based on these factors.
• The complexity of this hardware is an important constraint on superscalar
processors.
33
Parallel Processing Concepts
Superscalar Execution (cont.): Implicit Parallelism
Issue Mechanisms
❖In the simpler model, instructions can be issued only in the order in which they
are encountered. That is, if the second instruction cannot be issued because it
has a data dependency with the first, only one instruction is issued in the cycle.
This is called in-order issue.
❖In a more aggressive model, instructions can be issued out of order. In this case,
if the second instruction has data dependencies with the first, but the third
instruction does not, the first and third instructions can be co-scheduled. This is
also called dynamic issue.
❖Performance of in-order issue is generally limited.
Efficiency Considerations
❖Due to limited parallelism in typical instruction traces, dependencies, or the
inability of the scheduler to extract parallelism, the performance of superscalar
18
Parallel Processing Concepts
1.2 Very long instruction word processors VLIW Implicit Parallelism
❖The parallelism extracted by superscalar processor is often limited by the
instruction look ahead.
❖The hardware cost and complexity of the superscalar scheduler is a major
consideration in processor design.
❖ An alternative concept for exploiting instruction parallelism used in VLIW.
❖Where processors relies on the compiler to resolve dependencies & resource
availability at compile time, where :
◦ Instruction that can be executed concurrently are packed into groups given to
processor as a single long instruction word to be executed on multiple functional
units at the same time.
◦ Very sensitive to compilers ability to detect data & resource dependencies 19
2- Explicit Parallelism
❖ Elements of a Parallel Computer Hardware
*Hardware:
▪ Multiple Processors
▪ Multiple Memories
▪ Interconnection Network System Software
*System Software:
▪ Parallel Operating System
▪ Programming Constructs to Express/Orchestrate Concurrency Application Software
*Application Software:
▪ Parallel Algorithms
❖ Goal:
▪ Utilize the Hardware, System, & Application Software to either:
- Achieve Speedup.
- Solve problems requiring a large amount of memory.
36
What is Think - Different
How many people doing the work → (Degree of Parallelism)
What is needed to begin the work → (Initialization)
Who does what → (Work distribution)
Access to work part → (Data/IO access)
Whether they need info from each other to finish their own job → (Communication)
When are they all done → (Synchronization)
What needs to be done to collate the result
37