0% found this document useful (0 votes)

62 views56 pages

L03 Architecture Memory

The document discusses parallel computing architectures, focusing on key concepts such as concurrency, parallelism, and various forms of processor performance gains. It outlines the importance of understanding processor architecture, memory organization, and different types of parallelism, including bit, instruction, thread, and processor level parallelism. Additionally, it covers Flynn's taxonomy of parallel architectures and the design of multicore processors, emphasizing the significance of efficient memory access and data management in modern computing.

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views56 pages

L03 Architecture Memory

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Parallel Computing Architectures

Lecture 03
Parallel Computing
Application application
Problem parallelism

decompose

Tasks

compute
hardware
Physical parallelism
Cores &
Processors

[ CS3210 - AY2425S1 - L3 ]
2
Computer architecture
 Key concepts about how modern computers work
 Concerns on parallel execution
 Challenges of accessing memory
 Understanding these architecture basics help you
 Understand and optimize the performance of your parallel
programs
 Gain intuition about what workloads might benefit from fast parallel
machines

[ CS3210 - AY2425S1 - L3 ]
3
Outline
 Processor Architecture and Technology Trends
 Various forms of parallelism
 Flynn’s Parallel Architecture Taxonomy
 Architecture of Multicore Processors
 Memory Organization
 Distributed-memory Systems
 Shared-memory Systems
 Hybrid (Distributed-Shared Memory) Systems

[ CS3210 - AY2425S1 - L3 ]
4
Concurrency vs. Parallelism

Concurrency Parallelism
 Two or more tasks can start, run,  Two or more tasks can run
and complete in overlapping (execute) simultaneously, at the
time periods exact same time
 They might not be running  Tasks do not only make progress,
(executing on CPU) at the same they actually execute
instant simultaneously
 Two or more execution flows make
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 - AY2425S1 - L3 ]
5
Source of Processor Performance Gain
 Parallelism of various forms are the main source of
performance gain

 Let us understand parallelism at the:

 Bit Level
 Instruction Level Single
 Thread Level Processor

 Processor Level:
Multiple
 Shared Memory Processors
 Distributed Memory
[ CS3210 - AY2425S1 - L3 ]
8
Bit Level Parallelism
 Parallelism via increasing the processor word size
 Word size may mean:
 Unit of transfer between processor  memory
 Memory address space capacity
 Integer size
 Single precision floating point number size

 Word size trend:

 Varied in the 50s to 70s
 Following x86 processors:
16 bits (8086 – 1978)
 32 bits (80386 – 1985)
 64 bits (Pentium 4 / Opteron – 2003)
[ CS3210 - AY2425S1 - L3 ]
9
Instruction Level Parallelism
 Execute instructions in parallel:
 Pipelining (parallelism across time)
 Superscalar (parallelism across space)

 Pipelining:
 Split instruction execution in multiple stages, e.g.
 Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
 Allow multiple instructions to occupy different stages in the same
clock cycle
 Provided there is no data / control dependencies
 Number of pipeline stages == Maximum achievable speedup
[ CS3210 - AY2425S1 - L3 ]
10
Pipelined Execution: Illustration
Non-pipelined
 Disadvantages
Time  Independence
IF ID EX WB IF ID EX WB  Bubbles
 Hazards: data
and control flow
Pipelined
Time  Speculation
IF ID EX WB
 Out-of-order
execution
IF ID EX WB
 Read-after-write
Program Flow IF ID EX WB
(Instructions)
IF ID EX WB

[ CS3210 - AY2425S1 - L3 ]
11
The end of ILP

Processor clock rate

stops increasing

No further benefit from ILP

[ CS3210 - AY2425S1 - L3 ]
12
Instruction Level Parallelism: Superscalar
 Duplicate the pipelines:
 Allow multiple instructions to pass through the same stage
 Scheduling is challenging (decide which instructions can be
executed together):
 Dynamic (Hardware decision)
 Static (Compiler decision)

 Most modern processors are superscalar

 e.g. each intel i7 core has 14 pipeline stages and can execute 6 micro-ops in
the same cycle

[ CS3210 - AY2425S1 - L3 ]
13
Superscalar Execution: Illustration
Superscalar (2-wide)
Time Disadvantages:
IF ID EX WB structural hazard
IF ID EX WB
IF ID EX WB
IF ID EX WB
Program Flow
(Instructions)

 Cycles-per-instruction (CPI)  Instructions-per-cycle (IPC)

[ CS3210 - AY2425S1 - L3 ]
14
Pipelined Processor Superscalar Processor
 Determine what instruction to run  Processor automatically finds
next independent instructions in an
 Execution unit: performs the instruction sequence and can
operation described by an execute them in parallel on
instruction execution units
 Registers: store value of variables  Instructions come from the same
used as inputs and outputs to execution flow (thread)
operations

[ CS3210 - AY2425S1 - L3 ]
15
Instruction Level Parallelism: SIMD
 Superscalar: decode and  Add ALU to increase compute
execute multiple (i.e. two) capability
instructions per clock  Single instruction, multiple
 Issue: not enough ILP, but data: same instruction
math operations are long broadcasted to and executed
by all ALUs

[ CS3210 - AY2425S1 - L3 ]
16
Thread Level Parallelism: Motivation
 Instruction-level parallelism is limited
 For typical programs only 2-3 instructions can be executed in parallel
(either pipelined / superscalar )
 Due to data/control dependencies

 Multithreading was originally a software mechanism

 Allow multiple parts of the same program to execute concurrently
 Key idea:
 The processor can execute the threads in parallel

[ CS3210 - AY2425S1 - L3 ]
17
Superscalar SMT – 2 hardware threads
 Registers: store value of  Can run one scalar instruction
variables used as inputs and per clock from one of the
outputs to operations from hardware threads
one execution flow  Two logical cores

[ CS3210 - AY2425S1 - L3 ]
18
Thread Level Parallelism in the Processor
 Processor can provide hardware support for multiple "thread
contexts“
 Known as simultaneous multithreading (SMT)
 Information specific to each thread, e.g. Program Counter, Registers,
etc
 Software threads can then execute in parallel
 Many implementation approaches

 Example:
 Intel processors with hyper-threading technology, e.g. each i7 core
can execute 2 threads at the same time
[ CS3210 - AY2425S1 - L3 ]
19
Processor Level Parallelism (Multiprocessing)
 Add more cores to the processor
 The application should have multiple execution flows
 Each process/thread has an independent context that can be
mapped to multiple processor cores

[ CS3210 - AY2425S1 - L3 ]
20
Flynn's Parallel Architecture Taxonomy
 One commonly used taxonomy of parallel architecture:
 Based on the parallelism of instructions and data streams in the most
constrained component of the processor
 Proposed by M.Flynn in 1972(!)

 Instruction stream:
 A single execution flow
 i.e. a single Program Counter (PC)
 Data stream:
 Data being manipulated by the instruction stream

[ CS3210 - AY2425S1 - L3 ]
23
Single Instruction Single Data (SISD)
 A single instruction stream is executed
 Each instruction work on single data
 Most of the uniprocessors fall into this category

Processing
Unit

[ CS3210 - AY2425S1 - L3 ]
24
Single Instruction Multiple Data (SIMD)
 A single stream of instructions
 For example: one program counter
 Each instruction works on multiple data
 Popular model for supercomputer during 1980s:
 Exploit data parallelism, commonly known as vector processor
 Modern processor has some forms of SIMD:
 E.g. the SSE, AVX instructions in Intel x86 processors

One Instruction
shared by many
Multiple data being PUs
processed

[ CS3210 - AY2425S1 - L3 ]
25
SIMD nowadays
 Data parallel architectures
 AVX instructions
 GPGPUs
 Same instruction broadcasted to all ALUs
 AVX: Intrinsic functions operate on vectors of four
64-bit values (e.g., vector of 4 doubles)
 Not great for divergent executions

[ CS3210 - AY2425S1 - L3 ]
26
Original program
 Processes one array element using scalar instructions on
scalar registers (e.g., 32-bit floats)

[ CS3210 - AY2425S1 - L3 ]
27
Vector program (using AVX intrinsics)
 Intrinsic functions operate on vectors of
eight 32-bit values (e.g., vector of floats)
 Processes eight array elements
simultaneously using vector instructions on
256-bit vector registers

[ CS3210 - AY2425S1 - L3 ]
28
Multiple Instruction Single Data (MISD)
 Multiple instruction streams
 All instruction work on the same data at any time
 No actual implementation except for the systolic array

[ CS3210 - AY2425S1 - L3 ]
29
Multiple Instruction Multiple Data (MIMD)
 Each PU fetch its own instruction
 Each PU operates on its data
 Currently the most popular model for multiprocessor

[ CS3210 - AY2425S1 - L3 ]
30
Variant – SIMD + MIMD
 Stream processor (nVidia GPUs)
 A set of threads executing the same code (effectively SIMD)
 Multiple set of threads executing in parallel (effectively MIMD at this
level)

[ CS3210 - AY2425S1 - L3 ]
31
MULTICORE ARCHITECTURE

[ CS3210 - AY2425S1 - L3 ]
32
Architecture of Multicore Processors
 Hierarchical design
 Pipelined design
 Network-based design

[ CS3210 - AY2425S1 - L3 ]
33
Hierarchical Design
 Multiple cores share multiple caches

 Cache size increases from the leaves to the root

 Each core can have a separate L1 cache and shares the L2

cache with other cores

 All cores share the common external memory L3

 Usages L2
 Standard desktop
 Server processors L1
 Graphics processing units
[ CS3210 - AY2425S1 - L3 ]
34
Hierarchical Design - Examples
Each core is sophisticated, out-of-order
processor to maximize ILP

Quad-Core AMD Opteron Intel Quad-Core Xeon

[ CS3210 - AY2425S1 - L3 ]
35
Pipelined Design
 Data elements are processed by multiple execution cores in a
pipelined way

 Useful if same computation steps have to be applied to a long

sequence of data elements
 E.g. processors used in routers
and graphics processors

[ CS3210 - AY2425S1 - L3 ]
36
Example: Pipelined Design

Xelerator X11 network processor

[ CS3210 - AY2425S1 - L3 ]
37
Network-Based Design
 Cores and their local caches and memories are connected via
an interconnection network
SUN Niagara 2 (UltraSPARC T2)

[ CS3210 - AY2425S1 - L3 ]
38
Future Trends
 Efﬁcient on-chip interconnection
 Enough bandwidth for data transfers between the cores
 Scalable
 Robust to tolerate failures
 Efﬁcient energy management
 Reduce memory access time
 Key word: Network on Chip (NoC)

[ CS3210 - AY2425S1 - L3 ]
39
MEMORY ORGANIZATION

[ CS3210 - AY2425S1 - L3 ]
40
Parallel Computer Component
 Typical uniprocessor components:
Core
 Processor
 One or more level of caches
Cache  Memory module
 Other (e.g. I/O)

Memory  These components similarly present in a parallel computer setup

Uniprocessor
 Processors in a parallel computer systems is also commonly
known as processing element

[ CS3210 - AY2425S1 - L3 ]
43
Recap: why do modern processors have cache?
 Processors run eﬃciently when data is resident in caches
 Caches reduce memory access latency *
* Caches provide high
bandwidth data transfer to CPU

[ CS3210 - AY2425S1 - L3 ]
44
Recap: memory latency and bandwidth
 Memory latency: the amount of time for a memory request
(e.g., load, store) from a processor to be serviced by the
memory system
 Example: 100 cycles, 100 nsec
 Memory bandwidth: the rate at which the memory system can
provide data to a processor
 Example: 20 GB/s
 Processor “stalls” when it cannot run the next instruction in an
instruction stream because of a dependency on a previous
instruction

[ CS3210 - AY2425S1 - L3 ]
45
Execution on a Processor (one add per clock)

The memory system is

occupied 100% of the time!

Time
[ CS3210 - AY2425S1 - L3 ]
46
In Modern Computing, Bandwidth is the Critical Resource
 Performant parallel programs should:
 Organize computation to fetch data from memory less often
 Reuse data previously loaded by the same thread (temporal locality
optimizations)
 Share data across threads (inter-thread cooperation)
 Favor performing additional arithmetic to storing/reloading values
(the math is “free”)
 Main point: programs must access memory infrequently to utilize
modern processors efficiently

[ CS3210 - AY2425S1 - L3 ]
47
Memory Organization of Parallel Computers

Parallel Computers

Distributed-Memory Shared-memory Hybrid (Distributed-

Multicomputers Multiprocessors Shared Memory)

Uniform Non-Uniform Cache-only

Memory Memory Memory
Access Access Access
(UMA) (NUMA) (COMA)

[ CS3210 - AY2425S1 - L3 ]
48
Distributed-Memory Systems
Processor Processor Processor
Cache Cache Cache
Memory Memory Memory
(Interconnection) Network

Memory Memory Memory

Cache Cache Cache
Processor Processor Processor

 Each node is an independent unit:

 With processor, memory and, sometimes, peripheral elements
 Physically distributed memory module:
  Memory in a node is private
[ CS3210 - AY2425S1 - L3 ]
49
Shared Memory System
PU1 PU2 PU3

Shared Memory Provider

I/O Memory

 Parallel programs / threads access

memory through the shared memory
provider
 Program is unaware of the actual
hardware memory architecture
 Cache coherence and memory consistency

[ CS3210 - AY2425S1 - L3 ]
50
Cache Coherence
 Multiple copies of the same data exist on different caches
 Local update by processing unit  Other PUs should not see
the unchanged data

3
PU1 PU2 PU3
u=? 4 u 7

u=? 5
Cache u 5 Cache Cache u 5
1

2
Memory u 5

[ CS3210 - AY2425S1 - L3 ]
51
Further Classification – Shared Memory
 Two factors can further differentiate shared memory systems:
 Processor to Memory Delay (UMA / NUMA)
 Whether delay to memory is uniform

 Presence of a local cache with cache coherence protocol (CC/NCC):

 Same shared variable may exist in multiple caches
 Hardware ensures correctness via cache coherence protocol

[ CS3210 - AY2425S1 - L3 ]
53
Uniform Memory Access (Time) (UMA)
PU PU PU PU

Cache Cache Cache Cache

Shared Memory Provider

Memory
UMA organization for Multiprocessor

 Latency of accessing the main memory is the same for every

processor:
 Uniform access time, hence the name
 Suitable for small number of processing units – due to
contention
[ CS3210 - AY2425S1 - L3 ]
54
Non-Uniform Memory Access (NUMA)
PU PU PU PU

Cache Cache Cache Cache

Interconnection (Shared Memory Provider)

Memory Memory Memory

 Physically distributed memory of all processing units are

combined to form a global shared-memory address space:
 also called distributed shared-memory
 Accessing local memory is faster than remote memory for a
processor
 Non-uniform access time
[ CS3210 - AY2425S1 - L3 ]
55
AMD Ryzen

[ CS3210 - AY2425S1 - L3 ]
56
Example: Multicore NUMA
core core core
1 2
... n

Multicore Processor P

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnection(Shared Memory Provider)

Memory Memory Memory

Cache Cache Cache

Processor Processor Processor

[ CS3210 - AY2425S1 - L3 ]
57
ccNUMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Memory Memory Memory Memory

Interconnection (Shared Memory Provider)

 Cache Coherent Non-Uniform Memory Access

 Each node has cache memory to reduce contention

[ CS3210 - AY2425S1 - L3 ]
58
COMA
Processor Processor Processor Processor

Cache Cache Cache Cache

Interconnection Network

 Cache Only Memory Architecture

 Each memory block works as cache memory
 Data migrates dynamically and continuously according to the cache
coherence scheme

[ CS3210 - AY2425S1 - L3 ]
59
Summary: Shared Memory Systems
 Advantages:
 No need to partition code or data
 No need to physically move data among processors 
communication is efficient

 Disadvantages:
 Special synchronization constructs are required
 Lack of scalability due to contention

[ CS3210 - AY2425S1 - L3 ]
60
[ CS3210 - AY2425S1 - L3 ]
64
Summary
 Goal of parallel architecture is to reduce the average time to
execute an instruction

 Various forms of parallelism

 Different types of multicore processors
 Different types of parallel systems and different memory
systems

[ CS3210 - AY2425S1 - L3 ]
65
Reading
 CS149 - PARALLEL COMPUTING class at Stanford
 Kayvon Fatahalian and Kunle Olukotun

 Platform 2015: Intel Processor and Platform Evolution for the

Next Decade
 Intel white paper, 2005

 https://www.brendangregg.com/blog/2023-03-01/computer-
performance-future-2022.html

[ CS3210 - AY2425S1 - L3 ]
66

RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
HSE-6-Soc Introduction To The System Design Approach
No ratings yet
HSE-6-Soc Introduction To The System Design Approach
69 pages
Unit 1
No ratings yet
Unit 1
5 pages
Unit 5
No ratings yet
Unit 5
96 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Computer Architecture for CS Students
No ratings yet
Computer Architecture for CS Students
72 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
Communication Costs in Parallel Machines
No ratings yet
Communication Costs in Parallel Machines
80 pages
Computer Architecture P1
No ratings yet
Computer Architecture P1
37 pages
Parallel Archit 1
No ratings yet
Parallel Archit 1
18 pages
Parallel Architecture Fundamental
No ratings yet
Parallel Architecture Fundamental
18 pages
Computer Architecture for Students
No ratings yet
Computer Architecture for Students
118 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Review of LSS CSC
No ratings yet
Review of LSS CSC
21 pages
Unit V
No ratings yet
Unit V
95 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
Unit 5
No ratings yet
Unit 5
66 pages
Presentation 3
No ratings yet
Presentation 3
37 pages
Architecture
No ratings yet
Architecture
67 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
2 - Cpe410l2
No ratings yet
2 - Cpe410l2
10 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Computer Architecture
100% (1)
Computer Architecture
125 pages
Coa 3.2 - Risc - Cisc
No ratings yet
Coa 3.2 - Risc - Cisc
20 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
74 pages
Unit 5
No ratings yet
Unit 5
96 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Architecture and Micro
No ratings yet
Architecture and Micro
69 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
04 Architecture
No ratings yet
04 Architecture
22 pages
PDC Architectures
No ratings yet
PDC Architectures
24 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
L01 Introduction
No ratings yet
L01 Introduction
51 pages
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
No ratings yet
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
41 pages
Microprocessor Book
No ratings yet
Microprocessor Book
296 pages
Unit 5
No ratings yet
Unit 5
44 pages
Boasch Ignition Modules
100% (1)
Boasch Ignition Modules
3 pages
VNC Connect Rpi
No ratings yet
VNC Connect Rpi
1 page
Hipath 4000 V4 Manual de Servicio
No ratings yet
Hipath 4000 V4 Manual de Servicio
220 pages
ESP8266 Toggle Leds
No ratings yet
ESP8266 Toggle Leds
8 pages
Error Code Pal 50
100% (7)
Error Code Pal 50
11 pages
Pitjvol 1 No 1 Final
No ratings yet
Pitjvol 1 No 1 Final
48 pages
Tech Template
No ratings yet
Tech Template
8 pages
CCNA ACL Training Guide
No ratings yet
CCNA ACL Training Guide
47 pages
Data Parallelism, Task Parallelism, CPU, GPU
No ratings yet
Data Parallelism, Task Parallelism, CPU, GPU
13 pages
Computer Networks for 1st Graders
No ratings yet
Computer Networks for 1st Graders
4 pages
Getting Started With IFIX
No ratings yet
Getting Started With IFIX
71 pages
UNAVCO Teqc Tutorial
No ratings yet
UNAVCO Teqc Tutorial
61 pages
BCS6010 Question Paper
No ratings yet
BCS6010 Question Paper
17 pages
FlexCache Technical FAQ
No ratings yet
FlexCache Technical FAQ
10 pages
AM Radio Band-Pass Filter Design
No ratings yet
AM Radio Band-Pass Filter Design
4 pages
3.multi Processor Os
No ratings yet
3.multi Processor Os
6 pages
Malp Lab Manual
No ratings yet
Malp Lab Manual
43 pages
MPR63 Series User Manual Eng v1.78
No ratings yet
MPR63 Series User Manual Eng v1.78
54 pages
Fanuc Focas Ethernet Manual
No ratings yet
Fanuc Focas Ethernet Manual
82 pages
11 Video
No ratings yet
11 Video
7 pages
1.internet of Thing (Iot)
No ratings yet
1.internet of Thing (Iot)
16 pages
Operating System MCQs and Answers
No ratings yet
Operating System MCQs and Answers
47 pages
Operating Instructions Temperature-Dependent Charging Compusave/ Protect
No ratings yet
Operating Instructions Temperature-Dependent Charging Compusave/ Protect
5 pages
Compal LA-6751P Lenovo G470 G570 Power On Sequence
No ratings yet
Compal LA-6751P Lenovo G470 G570 Power On Sequence
2 pages
X-A/V Twin 860S RF Modulator Specs
No ratings yet
X-A/V Twin 860S RF Modulator Specs
1 page
Mobile App Dev: Skills & Careers
No ratings yet
Mobile App Dev: Skills & Careers
11 pages
Olympus Digital Cameras for Microscopy
No ratings yet
Olympus Digital Cameras for Microscopy
8 pages
Assignment On Course Outcome (Jyoti Sharma)
No ratings yet
Assignment On Course Outcome (Jyoti Sharma)
2 pages
Software Requirements Specification: Sanni Kumar Gupta
No ratings yet
Software Requirements Specification: Sanni Kumar Gupta
25 pages
OmniPCX 4400 Hotel Console Issue Fix
No ratings yet
OmniPCX 4400 Hotel Console Issue Fix
10 pages

L03 Architecture Memory

Uploaded by

L03 Architecture Memory

Uploaded by

Parallel Computing Architectures

 Let us understand parallelism at the:

 Word size trend:

Processor clock rate

No further benefit from ILP

 Most modern processors are superscalar

 Cycles-per-instruction (CPI)  Instructions-per-cycle (IPC)

 Multithreading was originally a software mechanism

 Cache size increases from the leaves to the root

 Each core can have a separate L1 cache and shares the L2

 All cores share the common external memory L3

Quad-Core AMD Opteron Intel Quad-Core Xeon

 Useful if same computation steps have to be applied to a long

Xelerator X11 network processor

Memory  These components similarly present in a parallel computer setup

The memory system is

Distributed-Memory Shared-memory Hybrid (Distributed-

Uniform Non-Uniform Cache-only

Memory Memory Memory

 Each node is an independent unit:

Shared Memory Provider

 Parallel programs / threads access

 Presence of a local cache with cache coherence protocol (CC/NCC):

Cache Cache Cache Cache

Shared Memory Provider

 Latency of accessing the main memory is the same for every

Cache Cache Cache Cache

Interconnection (Shared Memory Provider)

Memory Memory Memory

 Physically distributed memory of all processing units are

Processor Processor Processor

Cache Cache Cache

Memory Memory Memory

Interconnection(Shared Memory Provider)

Memory Memory Memory

Cache Cache Cache

Processor Processor Processor

Cache Cache Cache Cache

Memory Memory Memory Memory

Interconnection (Shared Memory Provider)

 Cache Coherent Non-Uniform Memory Access

Cache Cache Cache Cache

 Cache Only Memory Architecture

 Various forms of parallelism

 Platform 2015: Intel Processor and Platform Evolution for the

You might also like