0% found this document useful (0 votes)

19 views47 pages

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views47 pages

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to

Parallel Architecture

R. Govindarajan
Indian Institute of Science,
Bangalore, INDIA
govind@iisc.ac.in

(C) RG@SERC,IISc
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

2
Introduction
§ Parallelism everywhere
Ø Pipelining, Instruction-Level Parallelism
Ø Vector Processing
Ø Array processors/MPP
Ø Multiprocessor Systems
Ø Multicomputers/cluster computing
Ø Multicores
Ø Graphics Processing Units (GPUs) and other
Accelerators

3
Basic Computer Organization
CPU
ALU Registers

Cache Control

MMU
Memory

Bus

I/O I/O I/O

4
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

5
Pipelined Processor
§ Pipelining instruction execution
ØInstrn. Fetch, Decode/Reg.Fetch, Execute,
Memory and WriteBack
§ Why pipelined exeuction?
ØImproves instruction throughput
ØIdeal : 1 instruction every cycle!

6
Click to edit Datapath
Processor Master title style

4 Zero?
+
Reg ALU Mem
PC
File
Mem

Sign
extend

Instrn. Fetch Instrn.Decode Execution Memory

IF ID EX MEM WB

7
Pipelined Exeuction
time
i1 IF ID EX MEM WB
i2 IF ID EX MEM WB
i3 IF ID EX MEM WB
i4 IF ID EX MEM WB

• Execution time of instruction is still 5 cycles,

but throughput is now 1 instruction per cycle
• Initial pipeline fill time (4 cycles), after
which 1 instruction completes every cycle
8
Memory Hierarchy
• (Pipelined) Instruction execution assumes fetching
instruction and data from memory in single cycle.
– Memory access takes several processor cycles!
• Instruction-level parallelism requires multiple instrn.
and data to be fetched in the same cycle.
• Memory hierarchy designed to address this!
• Memory hierarchy exploits locality of reference.
– Temporal Locality
– Spatial Locality
– Locality in instruction and data

9
Memory Hierarchy

• Avg. Memory Access Time

CPU MMU (with one level of cache)
AMAT = hit time of L1 +
miss-rate at L1 *
1–2 L1 L1
cycles I-Cache D-Cache miss-penalty at L1
• Avg. Memory Access Time
AMAT = hit time of L1 +
5 – 10 miss-rate at L1 *
cycles L2 Unified Cache
(hit time of L2 +
miss-rate at L2 *
100 –
300 Memory miss-penalty at L2)
cycles

© RG@SERC,IISc 11
Instruction Level Parallelism
§ Multiple independent instructions issued/
executed together
§ Why?
ØImprove throughput (Instrns. Per Cycle or
IPC) beyond 1
§ How independent instructions are
identified?
ØHardware – Superscalar processor
ØCompiler – VLIW processor

Instrn. Instrn. Instrn.

Static Program Dispatch Execution
Issue Instrn.
Fetch & reorder &
Decode commit

Instrn.
Window
True Data Dependency

Decode,
Pre- Instrn. Instrn. Rename Instrn. Int. FUs
decode Cache Buffer & Queues
FP. FUs
dispatch

Branch Reg.
Pred. File

Reorder Buffer

© RG@SERC,IISc 14
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

15
Parallelism in Processor
• Pipelined processor
• Instruction-Level Parallelism
• What next? Multicore processors
– Multiple processors in a single chip
• Why?
– To improve performance of a single program
– To execute multiple processes on different cores

16
Multicore Processors

C0 C1 C2 C3

L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$

L3-Cache

IMC QPI

Memory

© RG@SERC,IISc 18
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

19
Classification of Parallel
Machines
Flynn’s Classification: in terms of number of
Instruction streams and Data streams
ØSISD: Single Instruction Single Data
ØSIMD: Single Instruction Multiple Data
ØMISD: Multiple Instruction Single Data
ØMIMD: Multiple Instruction Multiple Data

© RG/MJT –SERC - IISc 20

SIMD Machines

• Vector Processors
– Single instruction on multiple data (elements of a
vector – temporal)
• Array Processors
– Single instruction on multiple data (elements of a
vector / array – spatial )
• Modern Processors
– AVX / MMX instructions
• Graphic Processing Units
– Multiple SIMD Cores in each Streaming Processors

© RG@SERC,IISc 21
MIMD Machines
Parallel Architecture Programming Models
§ Shared Memory • What programmer
Ø Centralized shared uses in coding applns.
memory (UMA) • Specifies synch. And
Ø Distributed Shared communication.
Memory (NUMA) • Programming Models:
§ Distributed Memory – Shared address
Ø A.k.a. Message space, e.g., OpenMP
passing – Message passing,
Ø E.g., Clusters e.g., MPI

22
Shared Memory Architecture

M M °°° M

Network
Uniform Memory Non-Uniform Memory
Network
Access (UMA) Access (NUMA)
M Architecture M
Architecture
$ $
°°°
$ $ $
°°° P P
P P P

Centralized Shared Memory Distributed Shared Memory

C0 C2 C4 C6 C1 C3 C5 C7
L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2-Cache L2-Cache L2-Cache L2-Cache

Memory

C0 C2 C4 C6 C1 C3 C5 C7

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

L3-Cache L3-Cache

IMC QPI QPI IMC

Memory Memory

© RG@SERC,IISc 25
Caches in Shared Memory
• Reduce average latency
– automatic replication closer
to processor
• What happens when store &
load are executed on
different processors?
⇒ Cache Coherence Problem
P P P

© RG –SERC - IISc 26
Cache Coherence Problem
P1 P2
Read X Read X Read X
Write X=1 Cache hit: Cache Miss
X: 1 Wrong data!! Wrong data!!
X: 0 X: 0

X: 1
X: 0
© RG –SERC - IISc 27
Cache Coherence Solutions

• Snoopy Protocol: shared bus interconnect where

all cache controllers monitor all bus activity
– Cache controllers to take corrective action based on
traffic in the interconnect network
– Corrective action: update or invalidate a cache block
• Directory Based Protocols: Cache controllers
maintain info. of shared copies of cache block
– Send invalidation/update message to copies

Proc.P P
Proc. °°° P
Proc.
Node
M$ Node
M$ Node
M$

• Message Passing Architecture

– Memory is private to each node
– Processes communicate by messages
© RG@SERC,IISc 29
NUMA Architecture

C0 C2 C4 C6 C1 C3 C5 C7

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

L3-Cache L3-Cache

IMC QPI QPI IMC

Memory IO-Hub Memory

NIC

Memory NIC Memory NIC

N/W
Node 3 Switch Node 2

Memory NIC Memory NIC

31
Interconnection Network
§ Processors and Memory modules connected to
each other through Interconnect Network
§ Indirect interconnects: nodes are connected to
interconnection medium, not directly to each
other
Ø Shared bus, multiple bus, crossbar, MIN
§ Direct interconnects: nodes are connected
directly to each other
Ø Topology: linear, ring, star, mesh, torus, hypercube
Ø Routing techniques: how the route taken by the
message from source to destination is decided
© RG –SERC - IISc 32
Indirect Network
Click to edit Master Topology
title style

Shared bus
Multiple bus

2x2 crossbar

Crossbar switch
Multistage Interconnection Network
© RG –SERC - IISc 33
Click to Interconnect
Direct Topology
edit Master title style
Star
Ring
Linear

2D
Mesh
Hypercube (binary n-cube)

n=2 n=3

Torus

© RG –SERC - IISc 34
Atria-CC-
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

35
Accelerators and
Manycore Architectures

© RG@SERC,IISc
37
Combining CPU and GPU Arch.
• 8 CPU cores @ 3 GHz • 2880 CUDA cores @ 1.67 GHz
• 0.38 TFLOPS • 1.5 TFLOPS (DP)

C0 C1 C7

SLC

IMC QPI PCI-e

Memory

Memory NIC

NIC Memory

N/W
Switch

Memory NIC

© RG@SERC,IISc 39
Overview
• Introduction
• Pipelining, Instruction Level Parallelism
• Multicore Architectures
• Multiprocessor Architecture
– Shared Address Space
– Distributed Address Space
• Accelerators – GPUs
• Supercomputer Systems

40
What is a Supercomputer?

• A hardware and software system that provides

close to the maximum performance than can
currently be achieved.
• What was a supercomputer a few (5) years ago,
is probably an order of magnitude slower system
compared to today’s supercomputer system!

Therefore, we use the term “high performance

computing” also to refer to Supercomputing!
© RG-SERC-IISc 41
Era of Supercomputing
1976
• Introduction of Cray 1 in 1976
ushered era of Supercomputing
–Shared memory, vector processing
–Good software environment
–A few 100 MFLOPS peak
–Cost about $5 million

Apple1
iPhone
2007

• What are the top 10 or top 500 computers?

– www.top500.org
– Updated every 6 months
– Measured using Rmax of Linpack (solving Ax = b )
• What is the trend?
Year Performance (GFLOPS)
~2,397,824
2,403,685 x Impr.!! #1 processors!
#500
1993 59.7 0.422
2018 143,500,000 874,800

Interconnect
Processor
tectue
Archi-

Software
Terascale

How building
blocks are put
together
What gives the
systems its
Connection betn. Exascale
compute power building blocks
44
The TOP 500 (Nov. 2019)
Rmax Power
Rank Site Manufacturer Computer Country Cores
[Pflops] [MW]
Oak Ridge National Summit: IBMPower9 22c,
1 IBM USA 2,414,592 148.60 10.1
Labs, DOE/SC/ORNL Nvidia V100,Mellanox EDR

Sierra: IBMPower9 22c,

2 DOE/NSA/LLNL IBM USA 1,572,480 94.64 7.43
Nvidia V100,Mellanox EDR
National
Sunway TaihuLight Sunway
3 SuperComputer Center NRCPC China 10,649,600 93.01 15.37
SW26010, 260C, 1.45 GHz
in Wux
National Tianhe-2, NUDT TH MPP,
4 SuperComputer Center NUDT Xeon E5 2691 and Xeon Phi China 4,981,760 61.44 18.48
in Tianjin 31S1
Texas Advanced Fronterra, Dell C6420, Intel
5 Computing Centre DELL Xeon 8280 28c 2.7GHz, USA 448,448 23.52 2.38
(TACC) Infiniband HDR
PizDaint Cray XC-50,
Swiss National
6 Cray Xeon E5-2690, 12C (2.6GHz) USA 387,872 21.23 2.30
Supercomputing
+ NVIDIA Tesla P100
Cori, Cray XC-40, Intel Xeon
7 DOE/NNSA/LANL/SNL Cray USA 979,072 20.16 7.58
E52698, 16c, Aries
Intel Xeon 6148, 20c, Tesla
8 AIST, Japan Fujitsu Japan 391,680 19.88 1.65
V100 SXM2, Infiniband EDR

Top 500 List : www.top500.org

Social Network Analysi

Climate and Weather Modeling

Computational Fluid Dynamics

02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Architecture
No ratings yet
Architecture
67 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Introduction About ACA Syllabus
No ratings yet
Introduction About ACA Syllabus
18 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Organization of Multiprocessor Systems
No ratings yet
Organization of Multiprocessor Systems
87 pages
04 Architecture
No ratings yet
04 Architecture
22 pages
Cs7103 Multicore Architecture
No ratings yet
Cs7103 Multicore Architecture
5 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
atII Bks Lec 2021 31 32
No ratings yet
atII Bks Lec 2021 31 32
16 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
Computer Architecture P1
No ratings yet
Computer Architecture P1
37 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
29 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Hardware
No ratings yet
Hardware
54 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Aca
No ratings yet
Aca
13 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
2.1 Advanced Processor Technology
No ratings yet
2.1 Advanced Processor Technology
40 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
Seminar
No ratings yet
Seminar
85 pages
Processor and Computer Achitecture
No ratings yet
Processor and Computer Achitecture
26 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
Advanced Multiprocessor Architecture
No ratings yet
Advanced Multiprocessor Architecture
42 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Newton 1
No ratings yet
Newton 1
114 pages
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
No ratings yet
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
27 pages
Kinematic Wave Theory
No ratings yet
Kinematic Wave Theory
2 pages
Biorthogonal System
No ratings yet
Biorthogonal System
54 pages
Klausen 1999
No ratings yet
Klausen 1999
20 pages
1D Conservation Laws
No ratings yet
1D Conservation Laws
38 pages
Mathematicians & Computational Scientists
No ratings yet
Mathematicians & Computational Scientists
35 pages
Lie NOTES
No ratings yet
Lie NOTES
114 pages
Nonlinear Differential Equations
No ratings yet
Nonlinear Differential Equations
25 pages
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
No ratings yet
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
156 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Task Scheduling For Multi Core and Parallel Architectures Challenges Solutions and Perspectives 1st Edition Quan Chen
No ratings yet
Task Scheduling For Multi Core and Parallel Architectures Challenges Solutions and Perspectives 1st Edition Quan Chen
58 pages
Chapter 3 Projects
No ratings yet
Chapter 3 Projects
13 pages
TrueCrack English Presentation
No ratings yet
TrueCrack English Presentation
16 pages
Parallel Multi-Core Verilog HDL Simulation Based On Domain Partitioning
No ratings yet
Parallel Multi-Core Verilog HDL Simulation Based On Domain Partitioning
6 pages
Ch-4 - Threads and Concurency
No ratings yet
Ch-4 - Threads and Concurency
35 pages
Admin Guide
No ratings yet
Admin Guide
208 pages
Pragmatic FP With Haskell: Don Stewart Standard Chartered Bank
No ratings yet
Pragmatic FP With Haskell: Don Stewart Standard Chartered Bank
44 pages
Nptel Cao Imp Questions
No ratings yet
Nptel Cao Imp Questions
58 pages
Parallel Sorting Techniques Guide
No ratings yet
Parallel Sorting Techniques Guide
6 pages
Concurrent Programming in ML PDF
No ratings yet
Concurrent Programming in ML PDF
325 pages
24CS301 COA Unit 1
No ratings yet
24CS301 COA Unit 1
102 pages
DSP Presentation Overview For Class
100% (2)
DSP Presentation Overview For Class
71 pages
Parallel Computing for Undergrads
No ratings yet
Parallel Computing for Undergrads
4 pages
Parallel, Distributed & NoSQL Databases
No ratings yet
Parallel, Distributed & NoSQL Databases
13 pages
Ds Assignment
No ratings yet
Ds Assignment
6 pages
Assignment No 1 of Assembly Language
No ratings yet
Assignment No 1 of Assembly Language
3 pages
VLSI Design
0% (1)
VLSI Design
31 pages
Csa Module Iv Notes
No ratings yet
Csa Module Iv Notes
59 pages
Computer Architecture Course Guide
No ratings yet
Computer Architecture Course Guide
4 pages
Hadoop for Big Data Beginners
No ratings yet
Hadoop for Big Data Beginners
87 pages
Foundations For High Scalability in Mule 4 PDF
No ratings yet
Foundations For High Scalability in Mule 4 PDF
33 pages
Operating System
No ratings yet
Operating System
37 pages
HPC Question Bank From SNGCE, Kadayirippu
No ratings yet
HPC Question Bank From SNGCE, Kadayirippu
3 pages
Arquitectura
No ratings yet
Arquitectura
8 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
01 Whyparallelism
No ratings yet
01 Whyparallelism
39 pages
Brochure 2
No ratings yet
Brochure 2
3 pages
PDC 1
No ratings yet
PDC 1
41 pages
Computer System Architecture Course
No ratings yet
Computer System Architecture Course
6 pages

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

RG1 Intro ParallelArch HPCAI Jan2020

Uploaded by

Introduction to

I/O I/O I/O

Instrn. Fetch Instrn.Decode Execution Memory

• Execution time of instruction is still 5 cycles,

• Avg. Memory Access Time

Instrn. Instrn. Instrn.

L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$

© RG/MJT –SERC - IISc 20

Centralized Shared Memory Distributed Shared Memory

L2-Cache L2-Cache L2-Cache L2-Cache

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

IMC QPI QPI IMC

• Snoopy Protocol: shared bus interconnect where

• Message Passing Architecture

L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

IMC QPI QPI IMC

Memory IO-Hub Memory

Memory NIC Memory NIC

Memory NIC Memory NIC

IMC QPI PCI-e

• A hardware and software system that provides

Therefore, we use the term “high performance

• What are the top 10 or top 500 computers?

Sierra: IBMPower9 22c,

Top 500 List : www.top500.org

Social Network Analysi

Computational Fluid Dynamics

You might also like