Parallel Computing Architectures
Lecture 03
Parallel Computing
Application application
Problem parallelism
decompose
Tasks
compute
hardware
Physical parallelism
Cores &
Processors
[ CS3210 - AY2425S1 - L3 ]
2
Computer architecture
Key concepts about how modern computers work
Concerns on parallel execution
Challenges of accessing memory
Understanding these architecture basics help you
Understand and optimize the performance of your parallel
programs
Gain intuition about what workloads might benefit from fast parallel
machines
[ CS3210 - AY2425S1 - L3 ]
3
Outline
Processor Architecture and Technology Trends
Various forms of parallelism
Flynn’s Parallel Architecture Taxonomy
Architecture of Multicore Processors
Memory Organization
Distributed-memory Systems
Shared-memory Systems
Hybrid (Distributed-Shared Memory) Systems
[ CS3210 - AY2425S1 - L3 ]
4
Concurrency vs. Parallelism
Concurrency Parallelism
Two or more tasks can start, run, Two or more tasks can run
and complete in overlapping (execute) simultaneously, at the
time periods exact same time
They might not be running Tasks do not only make progress,
(executing on CPU) at the same they actually execute
instant simultaneously
Two or more execution flows make
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 - AY2425S1 - L3 ]
5
Source of Processor Performance Gain
Parallelism of various forms are the main source of
performance gain
Let us understand parallelism at the:
Bit Level
Instruction Level Single
Thread Level Processor
Processor Level:
Multiple
Shared Memory Processors
Distributed Memory
[ CS3210 - AY2425S1 - L3 ]
8
Bit Level Parallelism
Parallelism via increasing the processor word size
Word size may mean:
Unit of transfer between processor memory
Memory address space capacity
Integer size
Single precision floating point number size
Word size trend:
Varied in the 50s to 70s
Following x86 processors:
16 bits (8086 – 1978)
32 bits (80386 – 1985)
64 bits (Pentium 4 / Opteron – 2003)
[ CS3210 - AY2425S1 - L3 ]
9
Instruction Level Parallelism
Execute instructions in parallel:
Pipelining (parallelism across time)
Superscalar (parallelism across space)
Pipelining:
Split instruction execution in multiple stages, e.g.
Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
Allow multiple instructions to occupy different stages in the same
clock cycle
Provided there is no data / control dependencies
Number of pipeline stages == Maximum achievable speedup
[ CS3210 - AY2425S1 - L3 ]
10
Pipelined Execution: Illustration
Non-pipelined
Disadvantages
Time Independence
IF ID EX WB IF ID EX WB Bubbles
Hazards: data
and control flow
Pipelined
Time Speculation
IF ID EX WB
Out-of-order
execution
IF ID EX WB
Read-after-write
Program Flow IF ID EX WB
(Instructions)
IF ID EX WB
[ CS3210 - AY2425S1 - L3 ]
11
The end of ILP
Processor clock rate
stops increasing
No further benefit from ILP
[ CS3210 - AY2425S1 - L3 ]
12
Instruction Level Parallelism: Superscalar
Duplicate the pipelines:
Allow multiple instructions to pass through the same stage
Scheduling is challenging (decide which instructions can be
executed together):
Dynamic (Hardware decision)
Static (Compiler decision)
Most modern processors are superscalar
e.g. each intel i7 core has 14 pipeline stages and can execute 6 micro-ops in
the same cycle
[ CS3210 - AY2425S1 - L3 ]
13
Superscalar Execution: Illustration
Superscalar (2-wide)
Time Disadvantages:
IF ID EX WB structural hazard
IF ID EX WB
IF ID EX WB
IF ID EX WB
Program Flow
(Instructions)
Cycles-per-instruction (CPI) Instructions-per-cycle (IPC)
[ CS3210 - AY2425S1 - L3 ]
14
Pipelined Processor Superscalar Processor
Determine what instruction to run Processor automatically finds
next independent instructions in an
Execution unit: performs the instruction sequence and can
operation described by an execute them in parallel on
instruction execution units
Registers: store value of variables Instructions come from the same
used as inputs and outputs to execution flow (thread)
operations
[ CS3210 - AY2425S1 - L3 ]
15
Instruction Level Parallelism: SIMD
Superscalar: decode and Add ALU to increase compute
execute multiple (i.e. two) capability
instructions per clock Single instruction, multiple
Issue: not enough ILP, but data: same instruction
math operations are long broadcasted to and executed
by all ALUs
[ CS3210 - AY2425S1 - L3 ]
16
Thread Level Parallelism: Motivation
Instruction-level parallelism is limited
For typical programs only 2-3 instructions can be executed in parallel
(either pipelined / superscalar )
Due to data/control dependencies
Multithreading was originally a software mechanism
Allow multiple parts of the same program to execute concurrently
Key idea:
The processor can execute the threads in parallel
[ CS3210 - AY2425S1 - L3 ]
17
Superscalar SMT – 2 hardware threads
Registers: store value of Can run one scalar instruction
variables used as inputs and per clock from one of the
outputs to operations from hardware threads
one execution flow Two logical cores
[ CS3210 - AY2425S1 - L3 ]
18
Thread Level Parallelism in the Processor
Processor can provide hardware support for multiple "thread
contexts“
Known as simultaneous multithreading (SMT)
Information specific to each thread, e.g. Program Counter, Registers,
etc
Software threads can then execute in parallel
Many implementation approaches
Example:
Intel processors with hyper-threading technology, e.g. each i7 core
can execute 2 threads at the same time
[ CS3210 - AY2425S1 - L3 ]
19
Processor Level Parallelism (Multiprocessing)
Add more cores to the processor
The application should have multiple execution flows
Each process/thread has an independent context that can be
mapped to multiple processor cores
[ CS3210 - AY2425S1 - L3 ]
20
Flynn's Parallel Architecture Taxonomy
One commonly used taxonomy of parallel architecture:
Based on the parallelism of instructions and data streams in the most
constrained component of the processor
Proposed by M.Flynn in 1972(!)
Instruction stream:
A single execution flow
i.e. a single Program Counter (PC)
Data stream:
Data being manipulated by the instruction stream
[ CS3210 - AY2425S1 - L3 ]
23
Single Instruction Single Data (SISD)
A single instruction stream is executed
Each instruction work on single data
Most of the uniprocessors fall into this category
Processing
Unit
[ CS3210 - AY2425S1 - L3 ]
24
Single Instruction Multiple Data (SIMD)
A single stream of instructions
For example: one program counter
Each instruction works on multiple data
Popular model for supercomputer during 1980s:
Exploit data parallelism, commonly known as vector processor
Modern processor has some forms of SIMD:
E.g. the SSE, AVX instructions in Intel x86 processors
One Instruction
shared by many
Multiple data being PUs
processed
[ CS3210 - AY2425S1 - L3 ]
25
SIMD nowadays
Data parallel architectures
AVX instructions
GPGPUs
Same instruction broadcasted to all ALUs
AVX: Intrinsic functions operate on vectors of four
64-bit values (e.g., vector of 4 doubles)
Not great for divergent executions
[ CS3210 - AY2425S1 - L3 ]
26
Original program
Processes one array element using scalar instructions on
scalar registers (e.g., 32-bit floats)
[ CS3210 - AY2425S1 - L3 ]
27
Vector program (using AVX intrinsics)
Intrinsic functions operate on vectors of
eight 32-bit values (e.g., vector of floats)
Processes eight array elements
simultaneously using vector instructions on
256-bit vector registers
[ CS3210 - AY2425S1 - L3 ]
28
Multiple Instruction Single Data (MISD)
Multiple instruction streams
All instruction work on the same data at any time
No actual implementation except for the systolic array
[ CS3210 - AY2425S1 - L3 ]
29
Multiple Instruction Multiple Data (MIMD)
Each PU fetch its own instruction
Each PU operates on its data
Currently the most popular model for multiprocessor
[ CS3210 - AY2425S1 - L3 ]
30
Variant – SIMD + MIMD
Stream processor (nVidia GPUs)
A set of threads executing the same code (effectively SIMD)
Multiple set of threads executing in parallel (effectively MIMD at this
level)
[ CS3210 - AY2425S1 - L3 ]
31
MULTICORE ARCHITECTURE
[ CS3210 - AY2425S1 - L3 ]
32
Architecture of Multicore Processors
Hierarchical design
Pipelined design
Network-based design
[ CS3210 - AY2425S1 - L3 ]
33
Hierarchical Design
Multiple cores share multiple caches
Cache size increases from the leaves to the root
Each core can have a separate L1 cache and shares the L2
cache with other cores
All cores share the common external memory L3
Usages L2
Standard desktop
Server processors L1
Graphics processing units
[ CS3210 - AY2425S1 - L3 ]
34
Hierarchical Design - Examples
Each core is sophisticated, out-of-order
processor to maximize ILP
Quad-Core AMD Opteron Intel Quad-Core Xeon
[ CS3210 - AY2425S1 - L3 ]
35
Pipelined Design
Data elements are processed by multiple execution cores in a
pipelined way
Useful if same computation steps have to be applied to a long
sequence of data elements
E.g. processors used in routers
and graphics processors
[ CS3210 - AY2425S1 - L3 ]
36
Example: Pipelined Design
Xelerator X11 network processor
[ CS3210 - AY2425S1 - L3 ]
37
Network-Based Design
Cores and their local caches and memories are connected via
an interconnection network
SUN Niagara 2 (UltraSPARC T2)
[ CS3210 - AY2425S1 - L3 ]
38
Future Trends
Efficient on-chip interconnection
Enough bandwidth for data transfers between the cores
Scalable
Robust to tolerate failures
Efficient energy management
Reduce memory access time
Key word: Network on Chip (NoC)
[ CS3210 - AY2425S1 - L3 ]
39
MEMORY ORGANIZATION
[ CS3210 - AY2425S1 - L3 ]
40
Parallel Computer Component
Typical uniprocessor components:
Core
Processor
One or more level of caches
Cache Memory module
Other (e.g. I/O)
Memory These components similarly present in a parallel computer setup
Uniprocessor
Processors in a parallel computer systems is also commonly
known as processing element
[ CS3210 - AY2425S1 - L3 ]
43
Recap: why do modern processors have cache?
Processors run efficiently when data is resident in caches
Caches reduce memory access latency *
* Caches provide high
bandwidth data transfer to CPU
[ CS3210 - AY2425S1 - L3 ]
44
Recap: memory latency and bandwidth
Memory latency: the amount of time for a memory request
(e.g., load, store) from a processor to be serviced by the
memory system
Example: 100 cycles, 100 nsec
Memory bandwidth: the rate at which the memory system can
provide data to a processor
Example: 20 GB/s
Processor “stalls” when it cannot run the next instruction in an
instruction stream because of a dependency on a previous
instruction
[ CS3210 - AY2425S1 - L3 ]
45
Execution on a Processor (one add per clock)
The memory system is
occupied 100% of the time!
Time
[ CS3210 - AY2425S1 - L3 ]
46
In Modern Computing, Bandwidth is the Critical Resource
Performant parallel programs should:
Organize computation to fetch data from memory less often
Reuse data previously loaded by the same thread (temporal locality
optimizations)
Share data across threads (inter-thread cooperation)
Favor performing additional arithmetic to storing/reloading values
(the math is “free”)
Main point: programs must access memory infrequently to utilize
modern processors efficiently
[ CS3210 - AY2425S1 - L3 ]
47
Memory Organization of Parallel Computers
Parallel Computers
Distributed-Memory Shared-memory Hybrid (Distributed-
Multicomputers Multiprocessors Shared Memory)
Uniform Non-Uniform Cache-only
Memory Memory Memory
Access Access Access
(UMA) (NUMA) (COMA)
[ CS3210 - AY2425S1 - L3 ]
48
Distributed-Memory Systems
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
(Interconnection) Network
Memory Memory Memory
Cache Cache Cache
Processor Processor Processor
Each node is an independent unit:
With processor, memory and, sometimes, peripheral elements
Physically distributed memory module:
Memory in a node is private
[ CS3210 - AY2425S1 - L3 ]
49
Shared Memory System
PU1 PU2 PU3
Shared Memory Provider
I/O Memory
Parallel programs / threads access
memory through the shared memory
provider
Program is unaware of the actual
hardware memory architecture
Cache coherence and memory consistency
[ CS3210 - AY2425S1 - L3 ]
50
Cache Coherence
Multiple copies of the same data exist on different caches
Local update by processing unit Other PUs should not see
the unchanged data
3
PU1 PU2 PU3
u=? 4 u 7
u=? 5
Cache u 5 Cache Cache u 5
1
2
Memory u 5
[ CS3210 - AY2425S1 - L3 ]
51
Further Classification – Shared Memory
Two factors can further differentiate shared memory systems:
Processor to Memory Delay (UMA / NUMA)
Whether delay to memory is uniform
Presence of a local cache with cache coherence protocol (CC/NCC):
Same shared variable may exist in multiple caches
Hardware ensures correctness via cache coherence protocol
[ CS3210 - AY2425S1 - L3 ]
53
Uniform Memory Access (Time) (UMA)
PU PU PU PU
Cache Cache Cache Cache
Shared Memory Provider
Memory
UMA organization for Multiprocessor
Latency of accessing the main memory is the same for every
processor:
Uniform access time, hence the name
Suitable for small number of processing units – due to
contention
[ CS3210 - AY2425S1 - L3 ]
54
Non-Uniform Memory Access (NUMA)
PU PU PU PU
Cache Cache Cache Cache
Interconnection (Shared Memory Provider)
Memory Memory Memory
Physically distributed memory of all processing units are
combined to form a global shared-memory address space:
also called distributed shared-memory
Accessing local memory is faster than remote memory for a
processor
Non-uniform access time
[ CS3210 - AY2425S1 - L3 ]
55
AMD Ryzen
[ CS3210 - AY2425S1 - L3 ]
56
Example: Multicore NUMA
core core core
1 2
... n
Multicore Processor P
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
Interconnection(Shared Memory Provider)
Memory Memory Memory
Cache Cache Cache
Processor Processor Processor
[ CS3210 - AY2425S1 - L3 ]
57
ccNUMA
Processor Processor Processor Processor
Cache Cache Cache Cache
Memory Memory Memory Memory
Interconnection (Shared Memory Provider)
Cache Coherent Non-Uniform Memory Access
Each node has cache memory to reduce contention
[ CS3210 - AY2425S1 - L3 ]
58
COMA
Processor Processor Processor Processor
Cache Cache Cache Cache
Interconnection Network
Cache Only Memory Architecture
Each memory block works as cache memory
Data migrates dynamically and continuously according to the cache
coherence scheme
[ CS3210 - AY2425S1 - L3 ]
59
Summary: Shared Memory Systems
Advantages:
No need to partition code or data
No need to physically move data among processors
communication is efficient
Disadvantages:
Special synchronization constructs are required
Lack of scalability due to contention
[ CS3210 - AY2425S1 - L3 ]
60
[ CS3210 - AY2425S1 - L3 ]
64
Summary
Goal of parallel architecture is to reduce the average time to
execute an instruction
Various forms of parallelism
Different types of multicore processors
Different types of parallel systems and different memory
systems
[ CS3210 - AY2425S1 - L3 ]
65
Reading
CS149 - PARALLEL COMPUTING class at Stanford
Kayvon Fatahalian and Kunle Olukotun
Platform 2015: Intel Processor and Platform Evolution for the
Next Decade
Intel white paper, 2005
https://www.brendangregg.com/blog/2023-03-01/computer-
performance-future-2022.html
[ CS3210 - AY2425S1 - L3 ]
66