KEMBAR78
Introduction To Parallel Programming | PDF | Parallel Computing | Operating System Technology
0% found this document useful (0 votes)
4 views29 pages

Introduction To Parallel Programming

Uploaded by

lafemmedargent95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views29 pages

Introduction To Parallel Programming

Uploaded by

lafemmedargent95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Salah El Falou

salah.elfalou@gmail.com
Introduction Parallel Programming

References

This course mainly refers/reuses/adapts the following materials:


• Introduction to Parallel Computing by Cédric Bastoul at the
university of Strasbourg
• Introduction fo Parallel Computing by Doug James, Texas
Advanced Computing Center, 2013. Available under a
Creative Commons Attribution Non-Commercial 3.0
Unported License
• Parallel Computing course by Vincent Loechner at the
university of Strasbourg
• Introduction to Parallel Computing by Blaise Barney,
Lawrence Livermore National Laboratory
https://computing.11n1.gov/tutorials/para11e1_comp

2
Introduction Parallel Programming

Parallel Computing
Definition [Wikipedia]

Parallel computing is a type of computation in which several


calculations are carried out simultaneously

Objectives:
I Complete computation faster
I Allocate more computing resources to a given task
I Solve larger problems
I Overcome single machine limits (speed, memory)
I Address several problems simultaneously
I Enable concurrency
I Exploit hardware resources
I Use parallel features of modern architectures
C3
Introduction Parallel Programming

Why Do We Need Performance?


I Simulation
I Natural phenomena: meteorology, seismology, etc.
I Economics: market trends, fast trading, etc.
I Life: biology, medicine, etc.
I Industrial applications: durability, aerodynamics, etc.
I Military applications: ballistics, nuclear weapons, etc.
I Human-Computer Interaction
I Voice recognition, authentication, sensors
I Artificial intelligence
I Responsiveness
I Data Processing
I Signal, photo, video, etc.
I Big data
I For everything ;-)
C4
Introduction Parallel Programming

Computational Cost Unit: The flop


Definition
The computational cost of a problem is the number of floating
point operations (flop) necessary to solve it

I Count them all!


I Some useful patterns:
f o r ( i = 0 ; i < N; i ++)
one_flop ( ) ; → ∑N
i=1 1 flop = N flop
f o r ( i = 0 ; i < N; i ++)
f o r ( j = i ; j < N ; j ++)
one_flop ( ) ;
∑N
i=1 i flop ≈ N 2 /2 flop

f o r ( i = 0 ; i < N; i ++)
f o r ( j = i ; j < N ; j ++)
f o r ( k = i ; k < N; k ++) ∑N 2 3
i=1 i flop ≈ N /3 flop
one_flop ( ) ;

C5
Introduction Parallel Programming

Performance Unit: The flop/s


Definition [Wikipedia]

Floating point operations per second (flop/s) is a measure of


computer performance

I LINPACK de facto standard to compare computing power


(flop/s solving a system of linear equations) see [top500]
Year Performance Computer
1947 1 kflop/s 103 ENIAC
1984 50 kflop/s 104 Intel 8087 (math coprocessor)

1997 23 Mflop/s 107 Intel Pentium MMX (200 MHz)

2007 1.5 Gflop/s 109 Intel Core 2 Duo (2.4 GHz)

2017 10 Tflop/s 1013 Intel Core i7 + GPU

C6
Introduction Parallel Programming

Performance Unit: The flop/s


Definition [Wikipedia]

Floating point operations per second (flop/s) is a measure of


computer performance

I LINPACK de facto standard to compare computing power


(flop/s solving a system of linear equations) see [top500]
Year Performance Computer
1947 1 kflop/s 103 ENIAC
1984 50 kflop/s 104 Intel 8087 (math coprocessor)
800 Mflop/s 109 Cray X-MP/48
1997 23 Mflop/s 107 Intel Pentium MMX (200 MHz)
1 Tflop/s 1012 Intel ASCI Red
2007 1.5 Gflop/s 109 Intel Core 2 Duo (2.4 GHz)
500 Tflop/s 1014 IBM BlueGene/L
2017 10 Tflop/s 1013 Intel Core i7 + GPU
93 Pflop/s 1017 Sunway TaihuLight
2019 148 Pflop/s 1017 IBM Summit

C6
Introduction Parallel Programming

Performance Example

Matrix Multiplication
I C = A × B with matrix size 10000 × 10000
I 10000 times computation ci,j = ∑ ai,k × bk,j
I Computational cost = 2 × 104 × 104 × 104 = 1012 flop

Execution time:
I 1947 – ENIAC: 63 years
I 1997 – PC: 24 minutes
I 1997 – ASCI Red: 2 s
I 2019 – PC: 0.2 s → 5 per second
I 2019 – Summit: < 0.02 ms → 60000 per second

C7
Introduction Parallel Programming

How Do We Get Performance?

I Understand the problem


I Chose the best data structures and algorithms
I Decompose the problem into (independent) tasks
I Analyze data dependences
I Understand the target architecture
I Chose the best tools (languages, libraries, compilers)
I Exploit available resources
I Optimize
I Parallelization is a key
This course is an introduction to all these aspects

C8
Introduction Parallel Programming

Review of a Modern Architecture


Samsung 830 500 Go SSD

My computer: 256 MB Cache

16 GB DRAM

AVX 256 bits vectors

Intel Core i7 3720QM


ILP (4-issue pipeline)
C C C C
O O O O 14-stage pipeline
R R R R
E E E E
Registers
32 KB L1 I-cache
6 MB L3 Cache
32 KB L1 D-cache

256 KB L2 cache

2.7 billion transistors


nVidia GT 650M 64K registers (32 bits)

17 types of memory CORE … CORE Local memory


48 KB/core shared memory
CORE …
384 cores
5 types of parallelism …
8 KB constant cache memory
8 KB texture cache memory
CORE … CORE

At least 4 programming Texture memory

models + API Global memory


1 GB VRAM Constant memory

C9
Introduction Parallel Programming

Cluster of Modern Architectures

c Carlos Jones/ORNL Wikimedia


IBM Summit Supercomputer
World’s most powerful computer as of November 2019
4,608 nodes, 2,414,592 cores, 148 Pflop/s
Dual-rail Mellanox EDR Infiniband interconnect
C 10
Types of Parallelism Parallel Programming

Types of Parallelism

C 11
Types of Parallelism Parallel Programming

Bit-Level Parallelism
Definition [Wikipedia]

Bit-level parallelism performs basic operations on processor


words in parallel

I Increase processor word size (e.g., 32 bits to 64 bits)


I Reduce the number of instructions to process wide data
I Exploit elementary circuits in parallel
I Use additional logic for more performance
I Example: addition
I Right-to-left sequential process because of the carry-in!
I How to achieve a faster addition with several adders?

C 12
Types of Parallelism Parallel Programming

Bit-Level Parallelism
Definition [Wikipedia]

Bit-level parallelism performs basic operations on processor


words in parallel

I Increase processor word size (e.g., 32 bits to 64 bits)


I Reduce the number of instructions to process wide data
I Exploit elementary circuits in parallel
I Use additional logic for more performance
I Example: addition
I Right-to-left sequential process because of the carry-in!
I How to achieve a faster addition with several adders?
I Carry-select adder: e.g., in parallel one adder for the right
part, one adder for the left part with carry-in, one adder for
the left part without carry-in, then trivially select the left part
C 12
Types of Parallelism Parallel Programming

Instruction-Level Parallelism
Definition [Wikipedia]

Instruction-level parallelism performs multiple instructions from


the same instruction stream concurrently

I Exploit the multiple functional units of the processor


I Managed by hardware (superscalar) or compiler (VLIW)
I Limited by control or data dependences (speculation helps)
I Example: instruction pipeline
I Decompose the instruction cycle to n stages to allow the
simultaneous processing of up to n instructions
I Start an instruction cycle before finishing the previous one

X1=A1+B1
X2=A2+B2
X3++
X4=X1+X2

C 13
Types of Parallelism Parallel Programming

Data-Level Parallelism
Definition [Wikipedia]

Data-level parallelism performs the same processing on mutiple


data operands concurrently

I Partition the data to different computation units


I Applies preferably on regular data structures like arrays
I Limited by the memory bandwidth
I Execute same instructions synchronously
I Example: graphics processing
I 4K display at 60 FPS: 500+ million pixels per second!
Columns / / I d e a l case : compute each p i x e l i n d e p e n d e n t l y
1 2 3 4 5 i f ( row == 1 && is_even ( column ) )

1 p i x e l = BLACK ;
else i f ( row == 3 && is_odd ( column ) )
2 p i x e l = BLACK ;
Rows 3 else i f ( row > 3 && i s _ o n _ b o r d e r ( row , column ) )
p i x e l = BLACK ;
4 else
5 p i x e l = WHITE ;

C 14
Types of Parallelism Parallel Programming

Data-Level Parallelism
Definition [Wikipedia]

Data-level parallelism performs the same processing on mutiple


data operands concurrently

I Partition the data to different computation units


I Applies preferably on regular data structures like arrays
I Limited by the memory bandwidth
I Execute same instructions synchronously
I Example: graphics processing
I 4K display at 60 FPS: 500+ million pixels per second!
Columns / / I d e a l case : compute each p i x e l i n d e p e n d e n t l y
1 2 3 4 5 i f ( row == 1 && is_even ( column ) )

1 p i x e l = BLACK ;
else i f ( row == 3 && is_odd ( column ) )
2 p i x e l = BLACK ;
Rows 3 else i f ( row > 3 && i s _ o n _ b o r d e r ( row , column ) )
p i x e l = BLACK ;
4 else
5 p i x e l = WHITE ;

C 14
Types of Parallelism Parallel Programming

Task/Thread-Level Parallelism
Definition [Wikipedia]

Task/Thread-level parallelism performs multiple instruction


sequences from the same application concurrently

I Partition application into tasks to be run simultaneously


I Identify tasks manually (user) or automatically (compiler)
I Limited by communication and/or synchronization costs
I Example: compute the sum of n numbers (reduction)
I How computing units may share work and communicate?

C 15
Types of Parallelism Parallel Programming

Task/Thread-Level Parallelism
Definition [Wikipedia]

Task/Thread-level parallelism performs multiple instruction


sequences from the same application concurrently

I Partition application into tasks to be run simultaneously


I Identify tasks manually (user) or automatically (compiler)
I Limited by communication and/or synchronization costs
I Example: compute the sum of n numbers (reduction)
I How computing units may share work and communicate?
Worker 0 Worker 1 Worker 2 Worker 3

2 3 4 6 5 8 9 5

Step 1 + + + + I Activate n/2 workers, each with 2 numbers


5 10 13 14
I Each worker computes its intermediate result
Step 2 + +
15 27 I Half workers communicate their result and deactivate
Step 3 + I Iterate the two last step until only one worker is active
42

C 15
Types of Parallelism Parallel Programming

Accelerator-Level Parallelism
Definition [Wikipedia]

Accelerator-level parallelism performs application-specific


processing concurrently on dedicated hardware

I Exploit application-specific properties


I Implement specialized functional component
I Link components using specialized interconnects
I Example: sort strings in alphabetical order
I How more computation units may be more efficient?

C 16
Types of Parallelism Parallel Programming

Accelerator-Level Parallelism
Definition [Wikipedia]

Accelerator-level parallelism performs application-specific


processing concurrently on dedicated hardware

I Exploit application-specific properties


I Implement specialized functional component
I Link components using specialized interconnects
I Example: sort strings in alphabetical order
I How more computation units may be more efficient?
I Specialized comparator units
I Input: two strings
I Output: lower string on left, higher on right
I Specialized interconnection
I Whole accelerator has n input and output
I Interconnect so that the output is sorted

C 16
Flynn’s Taxonomy Parallel Programming

Flynn’s Taxonomy

C 17
Flynn’s Taxonomy Parallel Programming

Flynn’s Taxonomy
Definition [Wikipedia]

Flynn’s Taxonomy is a classification for parallel and sequential


systems and programs proposed by Michael J. Flynn in 1966, it
distinguishes single or multiple instruction or data streams

Classification
Single Instruction stream Multiple

Single Instruction Multiple Instruction


Single Multiple Data stream Single Multiple

SISD SIMD MISD MIMD


Single Instruction Single Instruction Single Instruction Multiple Instruction
Single Data Multiple Data Single Data Multiple Data
Shared Memory Model Distributed

Shared Memory Distributed Memory


MIMD MIMD

C 18
Flynn’s Taxonomy Parallel Programming

Single Instruction, Single Data (SISD)


Definition [Wikipedia]

SISD is a computer architecture class in which a single


processing unit executes a single instruction stream to operate
on a single data stream

Instruction memory
I Standard uniprocessor Instruction
stream
I Sequential processing
PU
I Von Neuman architecture
Data
stream
I E.g., ENIAC, Intel 8088
Data memory

C 19
Flynn’s Taxonomy Parallel Programming

Single Instruction, Multiple Data (SIMD)


Definition [Wikipedia]

SIMD is a computer architecture class in which multiple


processing units execute the same instruction stream on
multiple data streams simultaneously

I Appropriate for regular


problems such as scientific or Instruction memory
image processing Instruction stream

I Benefit from loading


PU PU PU PU
mechanisms which load
Data Data Data Data
several data at once stream stream stream stream

Data memory
I E.g., modern processor vector
instructions, GPUs
C 20
Flynn’s Taxonomy Parallel Programming

Multiple Instruction, Single Data (MISD)


Definition [Wikipedia]

MISD is a computer architecture class in which multiple


processing units perform the same operation on the same data

I Not very useful in practice Instruction memory

I Fault tolerant systems Instruction Instruction Instruction Instruction


stream stream stream stream

executing the same instruction PU PU PU PU


redundantly, or instruction
Data stream
pipeline may be considered as
MISD implementations Data memory

C 21
Flynn’s Taxonomy Parallel Programming

Multiple Instruction, Multiple Data (MIMD)


Definition [Wikipedia]

MIMD is a computer architecture class in which multiple


processing units can perform different operations on different
data

I Very general and flexible Instruction memory


I Easier to build and to program Instruction Instruction Instruction Instruction
stream stream stream stream

I Either shared memory or PU PU PU PU


distributed memory Data Data Data Data
stream stream stream stream
I E.g., multi-core processors,
Data memory
computer clusters

C 22
Flynn’s Taxonomy Parallel Programming

Shared Memory MIMD


Definition [Wikipedia]

Shared memory is a subclass of MIMD where processors share


a common memory space and they all have access to it

I All processes can use the


same virtual address space
I Communication through PU PU PU

shared variables
I Hardware and system support Interconnect

for synchronization
I Easier to program (but beware Memory

of race conditions)
I E.g., multi-core processors OpenMP
C 23
Flynn’s Taxonomy Parallel Programming

Distributed Memory MIMD


Definition [Wikipedia]

Distributed memory is a subclass of MIMD where each


processor has its own private memory and is connected to other
processors through an interconnection network

I All processes use different


virtual address spaces
Memory Memory Memory
I Explicit messages for
communication and
synchronization PU PU PU

I Harder to program, high


communication latency Interconnect

I E.g., computer clusters


MPI
C 24

You might also like