KEMBAR78
RG1 Intro ParallelArch HPCAI Jan2020 | PDF | Parallel Computing | Central Processing Unit
0% found this document useful (0 votes)
19 views47 pages

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views47 pages

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to

Parallel Architecture

R. Govindarajan
Indian Institute of Science,
Bangalore, INDIA
govind@iisc.ac.in

(C) RG@SERC,IISc
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

2
Introduction
§ Parallelism everywhere
Ø Pipelining, Instruction-Level Parallelism
Ø Vector Processing
Ø Array processors/MPP
Ø Multiprocessor Systems
Ø Multicomputers/cluster computing
Ø Multicores
Ø Graphics Processing Units (GPUs) and other
Accelerators

3
Basic Computer Organization
CPU
ALU Registers

Cache Control

MMU
Memory

Bus

I/O I/O I/O

4
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

5
Pipelined Processor
§ Pipelining instruction execution
ØInstrn. Fetch, Decode/Reg.Fetch, Execute,
Memory and WriteBack
§ Why pipelined exeuction?
ØImproves instruction throughput
ØIdeal : 1 instruction every cycle!

6
Click to edit Datapath
Processor Master title style

4 Zero?
+
Reg ALU Mem
PC
File
Mem

Sign
extend

Instrn. Fetch Instrn.Decode Execution Memory


IF ID EX MEM WB

7
Pipelined Exeuction
time
i1 IF ID EX MEM WB
i2 IF ID EX MEM WB
i3 IF ID EX MEM WB
i4 IF ID EX MEM WB

• Execution time of instruction is still 5 cycles,


but throughput is now 1 instruction per cycle
• Initial pipeline fill time (4 cycles), after
which 1 instruction completes every cycle
8
Memory Hierarchy
• (Pipelined) Instruction execution assumes fetching
instruction and data from memory in single cycle.
– Memory access takes several processor cycles!
• Instruction-level parallelism requires multiple instrn.
and data to be fetched in the same cycle.
• Memory hierarchy designed to address this!
• Memory hierarchy exploits locality of reference.
– Temporal Locality
– Spatial Locality
– Locality in instruction and data

9
Memory Hierarchy

© RG – SERC - IISc 10
Memory Hierarchy : Caches

• Avg. Memory Access Time


CPU MMU (with one level of cache)
AMAT = hit time of L1 +
miss-rate at L1 *
1–2 L1 L1
cycles I-Cache D-Cache miss-penalty at L1
• Avg. Memory Access Time
AMAT = hit time of L1 +
5 – 10 miss-rate at L1 *
cycles L2 Unified Cache
(hit time of L2 +
miss-rate at L2 *
100 –
300 Memory miss-penalty at L2)
cycles

© RG@SERC,IISc 11
Instruction Level Parallelism
§ Multiple independent instructions issued/
executed together
§ Why?
ØImprove throughput (Instrns. Per Cycle or
IPC) beyond 1
§ How independent instructions are
identified?
ØHardware – Superscalar processor
ØCompiler – VLIW processor

© RG – SERC - IISc 12
Superscalar Execution Model

Instrn. Instrn. Instrn.


Static Program Dispatch Execution
Issue Instrn.
Fetch & reorder &
Decode commit

Instrn.
Window
True Data Dependency

© RG@SERC,IISc 13
Superscalar Overview
Mem.
interface

Decode,
Pre- Instrn. Instrn. Rename Instrn. Int. FUs
decode Cache Buffer & Queues
FP. FUs
dispatch

Branch Reg.
Pred. File

Reorder Buffer

© RG@SERC,IISc 14
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

15
Parallelism in Processor
• Pipelined processor
• Instruction-Level Parallelism
• What next? Multicore processors
– Multiple processors in a single chip
• Why?
– To improve performance of a single program
– To execute multiple processes on different cores

16
Multicore Processors

© RG@SERC,IISc 17
Multicore Processor

C0 C1 C2 C3

L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$

L3-Cache

IMC QPI

Memory

© RG@SERC,IISc 18
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

19
Classification of Parallel
Machines
Flynn’s Classification: in terms of number of
Instruction streams and Data streams
ØSISD: Single Instruction Single Data
ØSIMD: Single Instruction Multiple Data
ØMISD: Multiple Instruction Single Data
ØMIMD: Multiple Instruction Multiple Data

© RG/MJT –SERC - IISc 20


SIMD Machines

• Vector Processors
– Single instruction on multiple data (elements of a
vector – temporal)
• Array Processors
– Single instruction on multiple data (elements of a
vector / array – spatial )
• Modern Processors
– AVX / MMX instructions
• Graphic Processing Units
– Multiple SIMD Cores in each Streaming Processors

© RG@SERC,IISc 21
MIMD Machines
Parallel Architecture Programming Models
§ Shared Memory • What programmer
Ø Centralized shared uses in coding applns.
memory (UMA) • Specifies synch. And
Ø Distributed Shared communication.
Memory (NUMA) • Programming Models:
§ Distributed Memory – Shared address
Ø A.k.a. Message space, e.g., OpenMP
passing – Message passing,
Ø E.g., Clusters e.g., MPI

22
Shared Memory Architecture

M M °°° M

Network
Uniform Memory Non-Uniform Memory
Network
Access (UMA) Access (NUMA)
M Architecture M
Architecture
$ $
°°°
$ $ $
°°° P P
P P P

Centralized Shared Memory Distributed Shared Memory


© RG@SERC,IISc 23
UMA Architecture

C0 C2 C4 C6 C1 C3 C5 C7
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2-Cache L2-Cache L2-Cache L2-Cache

Memory

© RG@SERC,IISc 24
NUMA Architecture

C0 C2 C4 C6 C1 C3 C5 C7

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

L3-Cache L3-Cache

IMC QPI QPI IMC

Memory Memory

© RG@SERC,IISc 25
Caches in Shared Memory
• Reduce average latency
– automatic replication closer
to processor
• What happens when store &
load are executed on
different processors?
⇒ Cache Coherence Problem
P P P

© RG –SERC - IISc 26
Cache Coherence Problem
P1 P2
Read X Read X Read X
Write X=1 Cache hit: Cache Miss
X: 1 Wrong data!! Wrong data!!
X: 0 X: 0

X: 1
X: 0
© RG –SERC - IISc 27
Cache Coherence Solutions

• Snoopy Protocol: shared bus interconnect where


all cache controllers monitor all bus activity
– Cache controllers to take corrective action based on
traffic in the interconnect network
– Corrective action: update or invalidate a cache block
• Directory Based Protocols: Cache controllers
maintain info. of shared copies of cache block
– Send invalidation/update message to copies

© RG –SERC - IISc 28
Distributed Memory
Architecture
Cluster
Network

Proc.P P
Proc. °°° P
Proc.
Node
M$ Node
M$ Node
M$

• Message Passing Architecture


– Memory is private to each node
– Processes communicate by messages
© RG@SERC,IISc 29
NUMA Architecture

C0 C2 C4 C6 C1 C3 C5 C7

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

L3-Cache L3-Cache

IMC QPI QPI IMC

Memory IO-Hub Memory

NIC

© RG@SERC,IISc 30
Distributed Memory
Architecture
Node 0 Node 1

Memory NIC Memory NIC

N/W
Node 3 Switch Node 2

Memory NIC Memory NIC

31
Interconnection Network
§ Processors and Memory modules connected to
each other through Interconnect Network
§ Indirect interconnects: nodes are connected to
interconnection medium, not directly to each
other
Ø Shared bus, multiple bus, crossbar, MIN
§ Direct interconnects: nodes are connected
directly to each other
Ø Topology: linear, ring, star, mesh, torus, hypercube
Ø Routing techniques: how the route taken by the
message from source to destination is decided
© RG –SERC - IISc 32
Indirect Network
Click to edit Master Topology
title style

Shared bus
Multiple bus

2x2 crossbar

Crossbar switch
Multistage Interconnection Network
© RG –SERC - IISc 33
Click to Interconnect
Direct Topology
edit Master title style
Star
Ring
Linear

2D
Mesh
Hypercube (binary n-cube)

n=2 n=3

Torus

© RG –SERC - IISc 34
Atria-CC-
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

35
Accelerators and
Manycore Architectures

© RG@SERC,IISc
36
Accelerator – Fermi S2050

© RG@SERC,IISc
37
Combining CPU and GPU Arch.
• 8 CPU cores @ 3 GHz • 2880 CUDA cores @ 1.67 GHz
• 0.38 TFLOPS • 1.5 TFLOPS (DP)

C0 C1 C7

SLC

IMC QPI PCI-e

Memory

CPU
© RG@SERC,IISc GPU 38
Heterogeneous Clusters with GPUs

Memory NIC

NIC Memory

N/W
Switch

Memory NIC

© RG@SERC,IISc 39
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

40
What is a Supercomputer?

• A hardware and software system that provides


close to the maximum performance than can
currently be achieved.
• What was a supercomputer a few (5) years ago,
is probably an order of magnitude slower system
compared to today’s supercomputer system!

Therefore, we use the term “high performance


computing” also to refer to Supercomputing!
© RG-SERC-IISc 41
Era of Supercomputing
1976
• Introduction of Cray 1 in 1976
ushered era of Supercomputing
–Shared memory, vector processing
–Good software environment
–A few 100 MFLOPS peak
–Cost about $5 million

Apple1
iPhone
2007

© RG-SERC-IISc 42
Performance of Supercomputer

• What are the top 10 or top 500 computers?


– www.top500.org
– Updated every 6 months
– Measured using Rmax of Linpack (solving Ax = b )
• What is the trend?
Year Performance (GFLOPS)
~2,397,824
2,403,685 x Impr.!! #1 processors!
#500
1993 59.7 0.422
2018 143,500,000 874,800

© RG-SERC-IISc 43
Components of a Supercomputer

Interconnect
Processor
tectue
Archi-

Software
Terascale

How building
blocks are put
together
What gives the
systems its
Connection betn. Exascale
compute power building blocks
44
The TOP 500 (Nov. 2019)
Rmax Power
Rank Site Manufacturer Computer Country Cores
[Pflops] [MW]
Oak Ridge National Summit: IBMPower9 22c,
1 IBM USA 2,414,592 148.60 10.1
Labs, DOE/SC/ORNL Nvidia V100,Mellanox EDR

Sierra: IBMPower9 22c,


2 DOE/NSA/LLNL IBM USA 1,572,480 94.64 7.43
Nvidia V100,Mellanox EDR
National
Sunway TaihuLight Sunway
3 SuperComputer Center NRCPC China 10,649,600 93.01 15.37
SW26010, 260C, 1.45 GHz
in Wux
National Tianhe-2, NUDT TH MPP,
4 SuperComputer Center NUDT Xeon E5 2691 and Xeon Phi China 4,981,760 61.44 18.48
in Tianjin 31S1
Texas Advanced Fronterra, Dell C6420, Intel
5 Computing Centre DELL Xeon 8280 28c 2.7GHz, USA 448,448 23.52 2.38
(TACC) Infiniband HDR
PizDaint Cray XC-50,
Swiss National
6 Cray Xeon E5-2690, 12C (2.6GHz) USA 387,872 21.23 2.30
Supercomputing
+ NVIDIA Tesla P100
Cori, Cray XC-40, Intel Xeon
7 DOE/NNSA/LANL/SNL Cray USA 979,072 20.16 7.58
E52698, 16c, Aries
Intel Xeon 6148, 20c, Tesla
8 AIST, Japan Fujitsu Japan 391,680 19.88 1.65
V100 SXM2, Infiniband EDR

Top 500 List : www.top500.org


© RG@SERC,IISc 46
Supercomputing Systems &
Applications are Challenging!
Cosmic millennium --
Astrophysics

Social Network Analysi


Climate and Weather Modeling

Computational Fluid Dynamics


© RG@SERC,IISc 47

You might also like