CS3210
Introduction
Lecture 1
1
Outline
Parallel Computing
Motivation
Serial vs Parallel Computing
Serial vs Parallel Processing Units
Basic of Parallel Computing
Decompose, Schedule and Map
Performance
Challenges
Landscape of Parallel Computing
Applications
Systems
[ CS3210 AY2425S1 - L1 ]
2
Cheetah vs Hyena
Weigh 60 kg Weigh 50 kg
Fastest land animal Top speed of 65 km/h
Top speed of 120 km/h
0-100 km/h in 3 seconds
Hunt alone Hunt in group of 4-40
Usually over in 1 minute Can take down large animals
50% of losing its catch to other like buffalo, rhino, lion
predators
[ CS3210 AY2425S1 - L1 ]
3
Observations
Optimized on speed only Optimized to cooperate and work
in parallel
Single Processor: Parallel Computer:
Failed to meet Moore's Law since early A collection of processing elements that
2000 cooperate to solve problems quickly
Huge amount of computation power
[ CS3210 AY2425S1 - L1 ]
4
Parallel Computing - Motivation
In the first issue of CACM (1958), Saul Gorn notes that:
“We know that the so-called parallel computers are somewhat faster
than the serial ones, and that no matter how fast we make access and
arithmetic serially, we can do correspondingly better in parallel.
However access and arithmetic speeds seem to be approaching a
definite limit ... Does this mean that digital speeds are reaching a limit,
that digital computation of multi-variate partial differential systems must
accept it, and that predicting tomorrow’s weather must require
more than one day? Not if we have truly-parallel machines, with say,
a thousand instruction registers.”
[ CS3210 AY2425S1 - L1 ]
5
Hitting the Wall
‘70s: “Supercomputers” for Scientific Computing
Early ‘90s: Databases
Inflection point in 2004: Intel hits the Power Density Wall
“obtaining more computing power by stamping multiple processors
on a single chip rather than straining to increase the speed of a
single processor.”
John Markoff, New York Times, May 17, 2004
*Cooking aware computing
[ CS3210 AY2425S1 - L1 ]
6
How do you make your program run faster?
Answer before 2004:
Just wait 6 months and buy a new machine!
Answer after 2004:
Write parallel software
[ CS3210 AY2425S1 - L1 ]
7
Parallel Computing - Challenges
Gorn also recognized the challenges that would be posed to
the system designer by the new parallel computers:
“But visualize what it would be like to program for such a machine! If a
thousand subroutines were in progress simultaneously, one must program
the recognition that they have all finished, if one is to use their results
together. The programmer must not only synchronize his subroutines, but
schedule all his machine units, unless he is willing to have most of them
sitting idle most of the time. Not only would programming techniques
be vastly different from the serial ones in the serial languages we now
use, but they would be humanly impossible without machine intervention.”
[ CS3210 AY2425S1 - L1 ]
8
Serial Computing
problem
processing
unit
in i2 i1
instructions
Traditionally, a problem is divided into a discrete
series of instructions
Instructions are executed one after another
Only one instruction executed at any moment in time
[ CS3210 AY2425S1 - L1 ]
9
Example: Evaluate a = (b+1) * (b-c)
Program Memory
Locations
b (4)
Add 1 to b
i: + b 1 t1
Store result in t1 1
t1 (5 )
Subtract c from b c (2)
Store result in t2 i2: - b c t2
t2 (2 )
Multiply t1, t2 i3: * t1 t2 a
Store result in a
a (10 )
single control flow
Sequential Computation Model
[ CS3210 AY2425S1 - L1 ]
10
Processing Unit: von Neumann Model
Memory
(holds instructions and data)
P
R
Control Arithmetic O
Unit Logic C
E
(Instruction Unit S
Counter) (Registers) S
O
R
input output
[ CS3210 AY2425S1 - L1 ]
11
von Neumann Computation Model
Processor:
Performs instructions
Memory:
Stores both the instructions and data in cells with unique addresses
Control Scheme:
Fetches instructions one after another from memory
Shuttles data between memory and processor
Memory wall
disparity between processor (< 1 nanosecond or 10-9 sec) and
memory speed (100-1,000 nanoseconds)
[ CS3210 AY2425S1 - L1 ]
12
Ways to improve performance
1. Work hard higher clock frequency
2. Work smart pipelining, superscalar
processor
3. Get help replication – multicore, cluster, ..
[ CS3210 AY2425S1 - L1 ]
13
Parallel Computing
Simultaneous use of multiple processing units to solve a
problem fast / solve a larger problem
Processing units could be:
i. A single processor with multiple cores
ii. A single computer with multiple processors
iii. A number of computers connected by a network
iv. Combinations of the above
Ideally, a problem (application) is partitioned into sufficient
number of independent parts for execution on parallel
processing elements
[ CS3210 AY2425S1 - L1 ]
14
Parallel Computing
problem instructions
processing
task0 unit0
processing
task1 unit1
…… ………… ……
processing
taskm unitp-1
in i2 i1
1. A problem is divided into m discrete parts (tasks) that can be solved concurrently
2. Each part is further broken down to a series of instructions (i)
3. Instructions from each part execute in parallel on different processing units (p)
[ CS3210 AY2425S1 - L1 ]
15
Example: Evaluate a = (b+1) * (b-c)
fork creates parallel
fork i2
executing "thread"
i1: + b 1 t1 i2: - b c t2
join blocks executing
join 2 threads until n
wait until join i3:
threads reaches the
same point
* t1 t 2 a
multiple control flows
Parallel Computing (Shared-Memory)
[ CS3210 AY2425S1 - L1 ]
16
Why Parallel Computing
Primary reasons:
1. Overcome limits of serial computing
2. Save (wall-clock) time
3. Solve larger problems
Other reasons:
1. Take advantage of non-local resources
2. Cost saving – use multiple “cheaper” (commoditized) computing
resources
3. Overcome memory constraints
Performance - exploits a large collection of processing units that
can communicate and cooperate to solve large problems fast
[ CS3210 AY2425S1 - L1 ]
17
BASIC OF
PARALLEL COMPUTING
[ CS3210 AY2425S1 - L1 ]
18
Computational Model Attributes
[Operation Mechanism]
• The primitive units of computation or basic actions of the computer (data types and
operations defined by the instruction set)
[Data Mechanism]
• The definition of address spaces available to the computation (how data are
accessed and stored)
[Control Mechanism]
• The schedulable units of computation (rules for partitioning and scheduling the
problem for computation using the primitive units)
[Communication Mechanism]
• The modes and patterns of communication among processing elements working
in parallel, so that they can exchange needed information
[Synchronization Mechanism ]
• Mechanism to ensure needed information arrives at the right time
[ CS3210 AY2425S1 - L1 ]
19
Basics of Parallel Computing
Application
Problem
decompose
Tasks
schedule
& map
Physical Cores &
Processors
[ CS3210 AY2425S1 - L1 ]
20
Decomposition
One problem many possible decompositions
Complicated and laborious process
Potential parallelism of an application dictates how it should be split
into tasks
Size of tasks is called granularity – can choose different task sizes
Defining the tasks is challenging and difficult to automate (why?)
[ CS3210 AY2425S1 - L1 ]
21
Tasks
Tasks are coded in a parallel programming language
Scheduling: Assignment of tasks to processes or threads
Dictates the task execution order
Manually defined? Static? Dynamic?
Mapping: Assignment of processes/threads to physical
cores/processors for execution
Tasks may depend on each other resulting in data or control
dependencies
impose execution order of parallel tasks
[ CS3210 AY2425S1 - L1 ]
22
Dependences and Coordination
Dependences among tasks impose constraints on scheduling
To execute correctly, processes and threads need
synchronization and coordination
depends on information exchanges between processes and threads
and in turns, depends on the hardware memory organization
Memory organizations: shared-memory (threads) and
distributed-memory (processes)
[ CS3210 AY2425S1 - L1 ]
23
An illusion?
[ CS3210 AY2425S1 - L1 ]
24
Concurrency vs. Parallelism
Concurrency Parallelism
Two or more tasks can start, run, Two or more tasks can run (execute)
and complete in overlapping simultaneously, at the exact same
time periods time
They might not be running (executing Tasks do not only make progress,
on CPU) at the same instant but they also actually execute
Two or more execution flows make simultaneously
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 AY2425S1 - L1 ]
25
Example – Serial Solution
Find the sum for n numbers:
sum = 0;
for (i = 0; i < n; i++) {
x = Compute_next_value(. . .);
sum += x;
}
Serial Code
[ CS3210 AY2425S1 - L1 ]
26
Example – Parallel Solution (1/3)
Suppose we have p cores ( p < n )
Each core can perform a partial sum of n/p values
my_sum = 0;
my_start = ……;
my_end = ……;
for (my_i = my_start; my_i < my_end; my_i++){
my_x = Compute_next_value(. . .);
my_sum += my_x;
}
Each core uses its own private variables and executes this
block of code independently
[ CS3210 AY2425S1 - L1 ]
27
Example – Parallel Solution (2/3)
Upon completion, each core contains:
the partial sum in my_sum
E.g. If n (=24), and p = 8
Values generated:
1 4 3 9 2 8 5 1 1 5 3 7 2 5 0 4 1 8 6 5 1 2 3 9
Partial sum calculated in each core:
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
[ CS3210 AY2425S1 - L1 ]
28
Example – Parallel Solution (3/3)
Next, a core designated as the master core adds up all values
of my_sum to form the global sum (8 + 19 + 7 + 15 + 7 + 13 +
12 + 14 = 95)
if (I’m the master core){
sum = my_x;
for each core other than myself {
receive value from other core
sum += value;
}
} else { //not master core
send my_x to the master;
}
Parallel Code
Core 0 1 2 3 4 5 6 7
my_sum 8
95 19 7 15 7 13 12 14
[ CS3210 AY2425S1 - L1 ]
29
Better Parallel Algorithm
Share the work of the global summing too:
CORES
0 1 2 3 4 5 6 7
8 19 7 15 7 13 12 14
+ 27 + 22 + 20 + 26
time
+ 49 + 46
+ 95
[ CS3210 AY2425S1 - L1 ]
30
Comparison: Two Parallel Solutions
8 cores 1000 cores
1st version = 7 steps 1st version = 999 steps
2nd version = 3 steps 2nd version = 10 steps
factor of 2 improvement factor of 100 improvement
[ CS3210 AY2425S1 - L1 ]
31
Parallel Performance
Two perspectives:
Execution time versus Throughput
Parallel execution time = computation time + parallelization
overheads
Parallelization overheads:
distribution of work (tasks) to processors
information exchange or synchronization
idle time
etc
[ CS3210 AY2425S1 - L1 ]
32
Parallel Computing is HARD
“…when we start talking about parallelism and ease of use of truly parallel
computers, we're talking about a problem that's as hard as any that computer
science has faced. …
I would be panicked if I were in industry.”
John Hennessy, President, Stanford University, 2007
[ CS3210 AY2425S1 - L1 ]
33
Compiler Challenges
Heterogeneous processors
Increase in the design space for code optimization
Auto-tuners:
Optimizing code at runtime
Software-controlled memory management
[ CS3210 AY2425S1 - L1 ]
34
Parallel Programming Challenges
Finding enough parallelism (Amdahl's Law!)
Granularity
Locality
Load balance
Coordination and synchronization
Debugging
Performance modeling / Monitoring
[ CS3210 AY2425S1 - L1 ]
35
LANDSCAPE OF
PARALLEL COMPUTING
[ CS3210 AY2425S1 - L1 ]
36
Uses of Parallel Computing
Data mining, web search engines, web-based business
services, pharmaceutical design, financial and economic
modeling, AI, etc.
http://computing.llnl.gov/tutorials/parallel_comp
[ CS3210 AY2425S1 - L1 ]
37
Types of Computer Systems
Computer Systems
Single System Distributed Systems
(multiple systems)
PC/Workstation SMP/NUMA Vector Mainframe
Clusters Client Server Grids Peer-to-Peer Cloud
Control and Management
Centralised Decentralized
[ CS3210 AY2425S1 - L1 ]
38
Parallel Computing and Parallel Machines
In your pocket:
Intel Skylake architecture:
ARM big.LITTLE 64-bit
Quad-core CPU + multi-core
(Cortex-A53 + Cortex-A73)
GPU integrated on one chip
4-8 cores, >2GHz
[ CS3210 AY2425S1 - L1 ]
40
Fastest Supercomputer
[ CS3210 AY2425S1 - L1 ]
41
2022-24 Fastest Supercomputer
Supercomputer Frontier
1,206 petaflops
8,699,904 cores
AMD Optimized 3rd Gen EPYC 64C 2GHz
22 Mwatts
No. 7* in GREEN500
Video
[ CS3210 AY2425S1 - L1 ]
42
2020-21 Fastest Supercomputer
Supercomputer
Fugaku
415 petaflops
~7300000 cores ARM A64FX
29 Mwatts
No. 33 in GREEN500
Video
Virtual tour
[ CS3210 AY2425S1 - L1 ]
43
Usage of Frontier
Energy
Fusion energy studies
Combustion energy improvements
Innovative clean energy systems
Materials study
Social and scientific issues
Innovative drug discovery
Personalized and preventive medicine
Disaster prevention and global climate problems
Meteorological and global environmental predictions
Evolution of the universe
[ CS3210 AY2425S1 - L1 ]
45
Deconstructing Frontier
Node:
1 HPC and AI Optimized AMD EPYC CPU
4 Purpose Built AMD Radeon Instinct GPU
74 cabinets:
Over 9000 nodes
Interconnect:
Multiple Slingshot NICs
Slingshot dragonfly network
[ CS3210 AY2425S1 - L1 ]
46
Deconstructing Fugaku
Processor (CPU):
• Fujitsu’s 48-core A64FX SoC (optional assistant cores)
• 158,976 nodes (CPUs):
396 racks of 384 nodes
36 racks of 192 nodes
• Fujitsu Tofu (Torus Fusion) interconnect
[ CS3210 AY2425S1 - L1 ]
47
TOP500 Performance over Time
name prefix multiplier
exa E 1018
peta P 1015
tera T 1012
giga G 109
mega M 106
kilo K 103
source: top500.org
[ CS3210 AY2425S1 - L1 ]
48
Summary
Serial versus parallel computing
Parallel computing - what, why and how
Performance and challenges
Landscape of parallel computing
[ CS3210 AY2425S1 - L1 ]
49
Evolution of Prominent Computers
year location name first
1834 Cambridge Difference engine Programmable computer
1943 Bletchley Colossus Electronic computer
1948 Manchester SSEM (Baby) Stored-program computer
1951 MIT Whirlwind 1 Real-time I/O computer
1953 Manchester Transistor computer Transistorized computer
1971 California Intel 4004 Mass-market CPU & IC
1979 Cambridge Sinclair ZX-79 Mass-market home computer
1981 New York IBM PC Personal computer
1987 Cambridge Acorn A400 series High-street RISC PC sales
1990 New York IBM RS6000 Superscalar RISC processor
1998 California Sun picolJAVA Computer Based on a language
[ CS3210 AY2425S1 - L1 ]
50
Progression of Computation Speed
year Floating point operations per second (FLOPS)
1941 1
1945 100
1949 1,000 (1 KiloFLOPS, kFLOPS)
1951 10,000
1961 100,000
1964 1,000,000 (1 MegaFLOPS, MFLOPS)
1968 10,000,000
1975 100,000,000
1987 1,000,000,000 (1 GigaFLOPS, GFLOPS)
1992 10,000,000,000
1993 100,000,000,000
1997 1,000,000,000,000 (1 TeraFLOPS, TFLOPS)
2000 10,000,000,000,000
2007 478,000,000,000,000 (478 TFLOPS)
2009 1,100,000,000,000,000 (1.1 PetaFLOPS)
[ CS3210 AY2425S1 - L1 ]
51
Computing Units
International System of Units (SI) International Electrotechnical
Commission (IEC)
name prefix multiplier
Nam prefix multiplier
exa E 1018
e
peta P 1015
Exbi Ei 260
tera T 1012
pebi Pi 250
giga G 109
tebi Ti 240
mega M 106
gibi Gi 230
kilo K 103
mebi Mi 220
milli m 10-3
kibi Ki 210
micro µ 10-6
nano n 10-9 Example of disk capacity:
pico p 10-12 20 Mibytes
20 MiB
20 Mebibytes
= 20 x 220 = 20,971,520 bytes
[ CS3210 AY2425S1 - L1 ]
52
Reference & Readings
Fugaku technical details:
https://www.r-ccs.riken.jp/en/fugaku/
Reading
Robert Robey, Yuliana Zamora, ”Parallel and High-Performance
Computing”, Manning Publications – Chapter 1.
A View of the Parallel Computing Landscape, CACM Vol. 52 No. 10,
pp. 56-67, October 2009. [Technical Report: The Landscape of
Parallel Computing Research: A View from Berkeley, Dec 2006]
[ CS3210 AY2425S1 - L1 ]
53