Introduction to
Parallel Architecture
R. Govindarajan
Indian Institute of Science,
Bangalore, INDIA
govind@iisc.ac.in
(C) RG@SERC,IISc
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
2
Introduction
§ Parallelism everywhere
Ø Pipelining, Instruction-Level Parallelism
Ø Vector Processing
Ø Array processors/MPP
Ø Multiprocessor Systems
Ø Multicomputers/cluster computing
Ø Multicores
Ø Graphics Processing Units (GPUs) and other
Accelerators
3
Basic Computer Organization
CPU
ALU Registers
Cache Control
MMU
Memory
Bus
I/O I/O I/O
4
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
5
Pipelined Processor
§ Pipelining instruction execution
ØInstrn. Fetch, Decode/Reg.Fetch, Execute,
Memory and WriteBack
§ Why pipelined exeuction?
ØImproves instruction throughput
ØIdeal : 1 instruction every cycle!
6
Click to edit Datapath
Processor Master title style
4 Zero?
+
Reg ALU Mem
PC
File
Mem
Sign
extend
Instrn. Fetch Instrn.Decode Execution Memory
IF ID EX MEM WB
7
Pipelined Exeuction
time
i1 IF ID EX MEM WB
i2 IF ID EX MEM WB
i3 IF ID EX MEM WB
i4 IF ID EX MEM WB
• Execution time of instruction is still 5 cycles,
but throughput is now 1 instruction per cycle
• Initial pipeline fill time (4 cycles), after
which 1 instruction completes every cycle
8
Memory Hierarchy
• (Pipelined) Instruction execution assumes fetching
instruction and data from memory in single cycle.
– Memory access takes several processor cycles!
• Instruction-level parallelism requires multiple instrn.
and data to be fetched in the same cycle.
• Memory hierarchy designed to address this!
• Memory hierarchy exploits locality of reference.
– Temporal Locality
– Spatial Locality
– Locality in instruction and data
9
Memory Hierarchy
© RG – SERC - IISc 10
Memory Hierarchy : Caches
• Avg. Memory Access Time
CPU MMU (with one level of cache)
AMAT = hit time of L1 +
miss-rate at L1 *
1–2 L1 L1
cycles I-Cache D-Cache miss-penalty at L1
• Avg. Memory Access Time
AMAT = hit time of L1 +
5 – 10 miss-rate at L1 *
cycles L2 Unified Cache
(hit time of L2 +
miss-rate at L2 *
100 –
300 Memory miss-penalty at L2)
cycles
© RG@SERC,IISc 11
Instruction Level Parallelism
§ Multiple independent instructions issued/
executed together
§ Why?
ØImprove throughput (Instrns. Per Cycle or
IPC) beyond 1
§ How independent instructions are
identified?
ØHardware – Superscalar processor
ØCompiler – VLIW processor
© RG – SERC - IISc 12
Superscalar Execution Model
Instrn. Instrn. Instrn.
Static Program Dispatch Execution
Issue Instrn.
Fetch & reorder &
Decode commit
Instrn.
Window
True Data Dependency
© RG@SERC,IISc 13
Superscalar Overview
Mem.
interface
Decode,
Pre- Instrn. Instrn. Rename Instrn. Int. FUs
decode Cache Buffer & Queues
FP. FUs
dispatch
Branch Reg.
Pred. File
Reorder Buffer
© RG@SERC,IISc 14
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
15
Parallelism in Processor
• Pipelined processor
• Instruction-Level Parallelism
• What next? Multicore processors
– Multiple processors in a single chip
• Why?
– To improve performance of a single program
– To execute multiple processes on different cores
16
Multicore Processors
© RG@SERC,IISc 17
Multicore Processor
C0 C1 C2 C3
L1$ L1$ L1$ L1$
L2$ L2$ L2$ L2$
L3-Cache
IMC QPI
Memory
© RG@SERC,IISc 18
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
19
Classification of Parallel
Machines
Flynn’s Classification: in terms of number of
Instruction streams and Data streams
ØSISD: Single Instruction Single Data
ØSIMD: Single Instruction Multiple Data
ØMISD: Multiple Instruction Single Data
ØMIMD: Multiple Instruction Multiple Data
© RG/MJT –SERC - IISc 20
SIMD Machines
• Vector Processors
– Single instruction on multiple data (elements of a
vector – temporal)
• Array Processors
– Single instruction on multiple data (elements of a
vector / array – spatial )
• Modern Processors
– AVX / MMX instructions
• Graphic Processing Units
– Multiple SIMD Cores in each Streaming Processors
© RG@SERC,IISc 21
MIMD Machines
Parallel Architecture Programming Models
§ Shared Memory • What programmer
Ø Centralized shared uses in coding applns.
memory (UMA) • Specifies synch. And
Ø Distributed Shared communication.
Memory (NUMA) • Programming Models:
§ Distributed Memory – Shared address
Ø A.k.a. Message space, e.g., OpenMP
passing – Message passing,
Ø E.g., Clusters e.g., MPI
22
Shared Memory Architecture
M M °°° M
Network
Uniform Memory Non-Uniform Memory
Network
Access (UMA) Access (NUMA)
M Architecture M
Architecture
$ $
°°°
$ $ $
°°° P P
P P P
Centralized Shared Memory Distributed Shared Memory
© RG@SERC,IISc 23
UMA Architecture
C0 C2 C4 C6 C1 C3 C5 C7
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$
L2-Cache L2-Cache L2-Cache L2-Cache
Memory
© RG@SERC,IISc 24
NUMA Architecture
C0 C2 C4 C6 C1 C3 C5 C7
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
L3-Cache L3-Cache
IMC QPI QPI IMC
Memory Memory
© RG@SERC,IISc 25
Caches in Shared Memory
• Reduce average latency
– automatic replication closer
to processor
• What happens when store &
load are executed on
different processors?
⇒ Cache Coherence Problem
P P P
© RG –SERC - IISc 26
Cache Coherence Problem
P1 P2
Read X Read X Read X
Write X=1 Cache hit: Cache Miss
X: 1 Wrong data!! Wrong data!!
X: 0 X: 0
X: 1
X: 0
© RG –SERC - IISc 27
Cache Coherence Solutions
• Snoopy Protocol: shared bus interconnect where
all cache controllers monitor all bus activity
– Cache controllers to take corrective action based on
traffic in the interconnect network
– Corrective action: update or invalidate a cache block
• Directory Based Protocols: Cache controllers
maintain info. of shared copies of cache block
– Send invalidation/update message to copies
© RG –SERC - IISc 28
Distributed Memory
Architecture
Cluster
Network
Proc.P P
Proc. °°° P
Proc.
Node
M$ Node
M$ Node
M$
• Message Passing Architecture
– Memory is private to each node
– Processes communicate by messages
© RG@SERC,IISc 29
NUMA Architecture
C0 C2 C4 C6 C1 C3 C5 C7
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
L3-Cache L3-Cache
IMC QPI QPI IMC
Memory IO-Hub Memory
NIC
© RG@SERC,IISc 30
Distributed Memory
Architecture
Node 0 Node 1
Memory NIC Memory NIC
N/W
Node 3 Switch Node 2
Memory NIC Memory NIC
31
Interconnection Network
§ Processors and Memory modules connected to
each other through Interconnect Network
§ Indirect interconnects: nodes are connected to
interconnection medium, not directly to each
other
Ø Shared bus, multiple bus, crossbar, MIN
§ Direct interconnects: nodes are connected
directly to each other
Ø Topology: linear, ring, star, mesh, torus, hypercube
Ø Routing techniques: how the route taken by the
message from source to destination is decided
© RG –SERC - IISc 32
Indirect Network
Click to edit Master Topology
title style
Shared bus
Multiple bus
2x2 crossbar
Crossbar switch
Multistage Interconnection Network
© RG –SERC - IISc 33
Click to Interconnect
Direct Topology
edit Master title style
Star
Ring
Linear
2D
Mesh
Hypercube (binary n-cube)
n=2 n=3
Torus
© RG –SERC - IISc 34
Atria-CC-
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
35
Accelerators and
Manycore Architectures
© RG@SERC,IISc
36
Accelerator – Fermi S2050
© RG@SERC,IISc
37
Combining CPU and GPU Arch.
• 8 CPU cores @ 3 GHz • 2880 CUDA cores @ 1.67 GHz
• 0.38 TFLOPS • 1.5 TFLOPS (DP)
C0 C1 C7
SLC
IMC QPI PCI-e
Memory
CPU
© RG@SERC,IISc GPU 38
Heterogeneous Clusters with GPUs
Memory NIC
NIC Memory
N/W
Switch
Memory NIC
© RG@SERC,IISc 39
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems
40
What is a Supercomputer?
• A hardware and software system that provides
close to the maximum performance than can
currently be achieved.
• What was a supercomputer a few (5) years ago,
is probably an order of magnitude slower system
compared to today’s supercomputer system!
Therefore, we use the term “high performance
computing” also to refer to Supercomputing!
© RG-SERC-IISc 41
Era of Supercomputing
1976
• Introduction of Cray 1 in 1976
ushered era of Supercomputing
–Shared memory, vector processing
–Good software environment
–A few 100 MFLOPS peak
–Cost about $5 million
Apple1
iPhone
2007
© RG-SERC-IISc 42
Performance of Supercomputer
• What are the top 10 or top 500 computers?
– www.top500.org
– Updated every 6 months
– Measured using Rmax of Linpack (solving Ax = b )
• What is the trend?
Year Performance (GFLOPS)
~2,397,824
2,403,685 x Impr.!! #1 processors!
#500
1993 59.7 0.422
2018 143,500,000 874,800
© RG-SERC-IISc 43
Components of a Supercomputer
Interconnect
Processor
tectue
Archi-
Software
Terascale
How building
blocks are put
together
What gives the
systems its
Connection betn. Exascale
compute power building blocks
44
The TOP 500 (Nov. 2019)
Rmax Power
Rank Site Manufacturer Computer Country Cores
[Pflops] [MW]
Oak Ridge National Summit: IBMPower9 22c,
1 IBM USA 2,414,592 148.60 10.1
Labs, DOE/SC/ORNL Nvidia V100,Mellanox EDR
Sierra: IBMPower9 22c,
2 DOE/NSA/LLNL IBM USA 1,572,480 94.64 7.43
Nvidia V100,Mellanox EDR
National
Sunway TaihuLight Sunway
3 SuperComputer Center NRCPC China 10,649,600 93.01 15.37
SW26010, 260C, 1.45 GHz
in Wux
National Tianhe-2, NUDT TH MPP,
4 SuperComputer Center NUDT Xeon E5 2691 and Xeon Phi China 4,981,760 61.44 18.48
in Tianjin 31S1
Texas Advanced Fronterra, Dell C6420, Intel
5 Computing Centre DELL Xeon 8280 28c 2.7GHz, USA 448,448 23.52 2.38
(TACC) Infiniband HDR
PizDaint Cray XC-50,
Swiss National
6 Cray Xeon E5-2690, 12C (2.6GHz) USA 387,872 21.23 2.30
Supercomputing
+ NVIDIA Tesla P100
Cori, Cray XC-40, Intel Xeon
7 DOE/NNSA/LANL/SNL Cray USA 979,072 20.16 7.58
E52698, 16c, Aries
Intel Xeon 6148, 20c, Tesla
8 AIST, Japan Fujitsu Japan 391,680 19.88 1.65
V100 SXM2, Infiniband EDR
Top 500 List : www.top500.org
© RG@SERC,IISc 46
Supercomputing Systems &
Applications are Challenging!
Cosmic millennium --
Astrophysics
Social Network Analysi
Climate and Weather Modeling
Computational Fluid Dynamics
© RG@SERC,IISc 47