0% found this document useful (0 votes)

17 views16 pages

Lecture #4

Data-Level Parallelism (DLP) allows simultaneous execution of operations on multiple data items, enhancing performance and energy efficiency, especially for workloads with structured data. SIMD architectures, such as vector machines and GPUs, leverage DLP by executing the same instruction across multiple data elements, making them suitable for tasks like scientific computing and multimedia processing. Key concepts include strip mining to handle data exceeding vector lengths and the comparison of various SIMD architectures, highlighting their strengths and limitations.

Uploaded by

Braincain007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

Lecture #4

Uploaded by

Braincain007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Lecture #3 - Introduction to Data-Level

Parallelism

What is Data-Level Parallelism?

Data-Level Parallelism (DLP) exploits the fact that many data operations can be
performed independently and simultaneously, particularly when the same operation
is applied to multiple data items.

Figure 1: Block diagram of a vector processor showing scalar unit, vector registers,
vector load/store unit, and vector functional units.

Why DLP?

• Increased performance without increasing clock speed.

• Improves energy efficiency.

• Perfect for workloads with regular, structured data (e.g., matrix ops, signal
processing, image manipulation).

SIMD Architectures and Applications

SIMD = Single Instruction, Multiple Data
Executes the same instruction across multiple data elements.

Use Cases:

1
• Matrix-oriented scientific computing
• Image and audio processing
• Mobile and embedded systems (due to power efficiency)
• Linear Algebra
Exploits data independence, not instruction independence

Figure 2: Instruction formats in RV64V showing operand flexibility.

Benefits of DLP:
• One instruction per many data operations
• Reduces instruction fetch and decode
• Energy efficient
• Compiler-friendly parallelism
• Keep sequential programming model

2
Advantages Over MIMD:
• SIMD fetches one instruction for all operations

• Less instruction bandwidth

• Higher energy efficiency

Architecture Example Focus

Vector Machines Cray-1, RV64V Scientific
SIMD Extensions MMX, SSE, AVX Multimedia
GPUs (SIMT) NVIDIA CUDA Large-scale data tasks

Table 1: Variants of SIMD Architectures

Vector Architectures

Figure 3: A vector processor architecture where each lane contains functional units
and access to the vector register bank.

Key Concepts:
• Use vector registers to load and process sets of data

• Operate on data using vector functional units

– Vector instructions: operate across registers

• Output results back to memory

• Dynamic vector length: using Vector Length Register (VL)

3
• Deep pipelining (e.g., Cray-1: 6-20 cycles)

Example of VMIPS DAXPY (Basic Linear Algebra Subroutine (BLAS)):

; addition of constant times a vector plus a vector
fld f0 , a ; floating - point load - load scalar
vld v0 , X ; vector load
vmul v1 , v0 , f0 ; vector - scalar multiply
; operation -> v1 [ i ] = f0 * v0 [ i ]
vld v2 , Y ; vector load
vadd v3 , v1 , v2 ; vector add
; operation -> v3 [ i ] = v1 [ i ] + v2 [ i ]
vst v3 , Y ; vector store
; operation -> Y [ i ] = a * X [ i ] + Y [ i ]

This is the essence of DLP. Each instruction here may operate on 32 or 64 elements
in a single go, depending on vector length.

Vector Benefits
• Compact representation

• High throughput

• Chaining enables overlap

• Unit-stride or strided access

• Predicate-based conditional operations

Figure 4: Vector Processors with Single and Multiple Add Pipelines

4
Execution Time Concepts
• Convoy: Set of instructions without structural hazards
• Chime: Number of cycles per convoy
• Total cycles = chimes x vector length

Figure 5: Breaking a vectorizable loop into multiple strips compatible with Maximum
Vector Length (MVL).

Strip Mining
Strip Mining is a compiler or programmer technique used to:
• Adapt loops to match hardware vector lengths (VL).
• Break a large loop into smaller “chunks” or “strips” whose size is limited by
the available vector register length.

VL = min (n , MVL )
while ( n > 0) {
execute vector op with VL
n -= VL
}

Why even use Strip Mining?

1. Vector Length Limit
• Real hardware has a maximum vector length (MVL) determined by the
number of vector registers and lanes.
2. Dynamic Problem Sizes
• Program data structures often exceed MVL:
– Imagine processing 10,000 elements with a vector length of 64.
∗ You can’t load 10,000 elements into vector registers at once.
3. Solution is to process data in strips
• Each strip handles up to MVL elements.
• Process multiple strips to cover the whole array.

5
Figure 6: Breaking a vectorizable loop into multiple strips compatible with MVL.

The main ideas behind Strip Mining

1. Generality → Allows you to handle arbitrary N with hardware-limited MVL.

2. Maximizes Parallelism → Always tries to use the full vector width when
possible and only the last strip may have fewer than MVL elements.

3. Simple Control Flow → Adds an outer loop to iterate over the strips and
the inner vectorized body remains the same.

4. Automatic in Compilers → Many compilers perform strip mining auto-

matically when vectorizing loops and is explicit in hand-written assembly or
intrinsic-based programming.

Applied in:

• Vector Processors (VMIPS, RV64V)

• SIMD Extensions (implicitly when needed)

• GPUs (equivalent handled via thread block sizes)

• Compilers (LLVM, GCC, ICC) apply strip mining during automatic vectoriza-
tion

6
SIMD Extensions Evolutions
• SIMD added to x86 over time.

• Packed operations across fixed-width registers.

Name Width Introduced

MMX 64-bit 1996
SSE 128-bit 1999
AVX 256-bit 2010
AVX-512 512-bit 2017

Table 2: Each new SIMD generation increased register width, enabling more DLP.
But these are packed SIMD, not true vector processors.

Figure 7: Typical multimedia SIMD instruction capabilities.

Figure 8: Examples of AVX instructions for double-precision packed operations.

7
VMIPS → x86 Mapping

VMIPS x86 SIMD

vadd v3,v1,v2 vaddpd ymm3, ymm1, ymm2
VL dynamic VL fixed
Predicate registers Mask registers (AVX-512)
Strided loads Limited to AVX-512

Table 3: AVX-512 comes closest to offering vector-like features like masking and
gather/scatter support.

SIMD DAXPY in AVX:

vbroadcastsd ymm0 , [ a ] ; Broadcast Scalar Double to Vector
; ymm0 = {a , a , a , a }

vmovapd ymm1 , [ X ] ; Aligned Packed Double Load

; ymm1 = { X [0] , X [1] , X [2] , X [3]}

vmulpd ymm2 , ymm0 , ymm1 ; Packed Double - Precision Multiply

; ymm2 [ i ] = ymm0 [ i ] * ymm1 [ i ] = a * X [ i ]

vmovapd ymm3 , [ Y ] ; Aligned Packed Double Load

; ymm3 = { Y [0] , Y [1] , Y [2] , Y [3]}

vaddpd ymm4 , ymm2 , ymm3 ; Packed Double - Precision Add

; ymm4 [ i ] = ymm2 [ i ] + ymm3 [ i ] = a * X [ i ] + Y [ i ]

vmovapd [ Y ] , ymm4 ; Aligned Packed Double Store

; Y [0 ..3 ] <- ymm4

; DAXPY operation for i =0 to 3 ( in parallel ) using packed SIMD

instructions
; Y[i] = a * X[i] + Y[i]

Each AVX instruction handles 4-8 doubles depending on register width. Compilers
can auto-vectorize such loops.

SIMD Limitations
• Fixed width = less flexibility

• Limited masking

• No dynamic VL

• No strip mining

• No true vector masking (before AVX-512)

8
• Scatter/gather only available recently (appear late)

• Alignment constraints (SSE)

GPU as SIMD (SIMT) → Single Instruction, Multiple Threads

Feature GPU
Registers 256 per thread
Threads Thousands (SIMT)
Vectorization Done via threads
Masking Warp divergence
Memory Shared/global registers

CUDA DAXPY Example:

__global__ void daxpy ( int n , double a , double * x , double * y ) {
int i = blockIdx . x * blockDim . x + threadIdx . x ;
if ( i < n ) y [ i ] = a * x [ i ] + y [ i ];
}

This looks like scalar C, but each thread runs in parallel. CUDA manages all the
vectorization, synchronization, and execution.

Figure 9: CUDA thread hierarchy showing grids, thread blocks, and threads.

9
Figure 10: Diagram of a Pascal GPU’s SIMD processor.

Figure 10 shows the internal block diagram of a Pascal Streaming Multiprocessor

(SM), where we see how multiple warps map onto SIMD lanes and share resources
like registers and shared memory.

Figure 11: High-level architecture of a full Pascal GPU chip.

Hardware = Strip-Mined Loops

• Thread blocks = tiles

• Threads = SIMD lanes

• Shared memory = software-managed cache

• Warps scheduled dynamically

10
Memory & Control Flow
Memory System Challenges

• Memory bank conflicts stall vector loads

• Alignment constraints in SIMD

• Shared memory conflicts in GPUs

• Gather/Scatter:

– Native: Vector, GPU

– Limited: SIMD (AVX-512)

Conditional Execution

• Vector: Predicate registers

• SIMD: AVX-512k masks

• GPU: Warp divergence, reconvergence stack

Conditional execution can degrade DLP. Using masks helps avoid branches.

Thread Scheduling & Divergence

• Warps are scheduled onto SIMD lanes.

• Divergent branches handled with masking + reconvergence.

Figure 12: Hardware scheduling of SIMD threads in GPUs.

11
Memory and Predicated Execution

Figure 13: Basic PTX assembly showing thread-level operations.

12
Figure 14: GPU memory structure including global, local, and private memories.

Figure 15: Dual-issue SIMT thread scheduler used in modern GPUs.

13
Roofline Model
This model helps reason about whether your performance is limited by computation
or memory bandwidth.

Figure 16: Relationship between FLOPs and memory bytes accessed.

Figure 17: Roofline comparison between a vector supercomputer and a modern SIMD-
capable processor.

• X-axis: Arithmetic Intensity (FLOPs/Byte)

• Y-axis: Attainable GFLOPs/sec

• Two ceilings:

– Memory bound
– Compute bound

What it means to be memory-bound or compute-bound?

Why GPUs often reach higher attainable GFLOPS?

14
Comparative Models

Figure 18: Comparison of a traditional vector processor with a GPU’s multithreaded

SIMD processor.

Figure 19: Comparison of SIMD extensions (MMX, SSE, AVX) to GPU SIMT model.

15
Figure 20: Roofline performance comparison of a CPU with SIMD extensions and a
GPU.

Quick Summary

Feature Vector SIMD GPU

ISA Full vector Packed SIMT
Gather/Scatter Yes AVX-512 Yes
Masking Predicate AVX-512k Warp diverge
Strip-Mining Compiler Hidden Grid/block sizes
Best For Scientific apps Media workloads ML, DL, HPC

Key Takeaways:

• DLP is more scalable than ILP

• Vector machines offer clarity and flexibility

• SIMD extensions are pragmatic but constrained

• GPUs provide massive DLP, but require careful management of memory and
threads

CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Unit 4 - 5th Sem-Ec355tbf
No ratings yet
Unit 4 - 5th Sem-Ec355tbf
67 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Unit 2
No ratings yet
Unit 2
43 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Zareen 6
No ratings yet
Zareen 6
11 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
SIMD
No ratings yet
SIMD
44 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Practical SIMD Programming Guide
No ratings yet
Practical SIMD Programming Guide
17 pages
Vector
No ratings yet
Vector
38 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Vector and SIMD Computer Systems
No ratings yet
Vector and SIMD Computer Systems
59 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
CS-482 - Lecture#4 - Vector and Array Processors
No ratings yet
CS-482 - Lecture#4 - Vector and Array Processors
40 pages
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
No ratings yet
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
9 pages
Design by Mohammed Intekhab Khan
No ratings yet
Design by Mohammed Intekhab Khan
33 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
Auto-Vectorization for Intel CPUs
No ratings yet
Auto-Vectorization for Intel CPUs
12 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Advanced Parallel Computing Concepts
No ratings yet
Advanced Parallel Computing Concepts
38 pages
Module 1
No ratings yet
Module 1
63 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Supercomputers and Vector Machines
No ratings yet
Supercomputers and Vector Machines
40 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Unit5 Aca
No ratings yet
Unit5 Aca
11 pages
Capaper 1
No ratings yet
Capaper 1
8 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
ACA1
No ratings yet
ACA1
29 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
SIMD v1
No ratings yet
SIMD v1
31 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
GPU Programming for IIT Students
No ratings yet
GPU Programming for IIT Students
37 pages
Memory Controller
No ratings yet
Memory Controller
26 pages
Riscv Vector Workshop June2015
No ratings yet
Riscv Vector Workshop June2015
58 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
35 pages
Perrich Tealeaf (Thief)
No ratings yet
Perrich Tealeaf (Thief)
2 pages
Wk3 - Lecture 3-27-25 Practical Firewalls - WB
No ratings yet
Wk3 - Lecture 3-27-25 Practical Firewalls - WB
41 pages
Eng Et Al. - 2024 - Patterns of Multi-Container Composition For Service Orchestration With Docker Compose
No ratings yet
Eng Et Al. - 2024 - Patterns of Multi-Container Composition For Service Orchestration With Docker Compose
43 pages
Practice Final Exam
No ratings yet
Practice Final Exam
11 pages
Final - Study Guide
No ratings yet
Final - Study Guide
3 pages
Midterm - Study Guide
No ratings yet
Midterm - Study Guide
4 pages
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
No ratings yet
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
25 pages
Assembly #2
No ratings yet
Assembly #2
5 pages
Choi Lecture CH19
No ratings yet
Choi Lecture CH19
2 pages
Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
No ratings yet
Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
23 pages
Lesson 1 - Overview & Key Concepts
No ratings yet
Lesson 1 - Overview & Key Concepts
12 pages
Instruction Scheduling For Instruction Level Parallel Processors
No ratings yet
Instruction Scheduling For Instruction Level Parallel Processors
22 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
JavaScript Basics and Examples
No ratings yet
JavaScript Basics and Examples
45 pages
CHAPTER 4 - Programs and Applications
No ratings yet
CHAPTER 4 - Programs and Applications
22 pages
Mozilla
No ratings yet
Mozilla
31 pages
Introduction To Matlab: Kadin Tseng Boston University Scientific Computing and Visualization
No ratings yet
Introduction To Matlab: Kadin Tseng Boston University Scientific Computing and Visualization
35 pages
Updated Quizlet Flashcards
No ratings yet
Updated Quizlet Flashcards
2 pages
M.Tech (REMOTE SENSING 2009)
No ratings yet
M.Tech (REMOTE SENSING 2009)
42 pages
Steganography Image Project Proposal
No ratings yet
Steganography Image Project Proposal
4 pages
Power BI Desktop Shortcut Guide
No ratings yet
Power BI Desktop Shortcut Guide
6 pages
Library of JavaScript Rules For FusionPro
No ratings yet
Library of JavaScript Rules For FusionPro
3 pages
DAVE3 Release Notes v3 1 10 Build 2014 05 23
No ratings yet
DAVE3 Release Notes v3 1 10 Build 2014 05 23
21 pages
Transdutor de Força Oes 202M23
No ratings yet
Transdutor de Força Oes 202M23
6 pages
Night Vision Pedestrian Detection
No ratings yet
Night Vision Pedestrian Detection
17 pages
First Midterm Exam Grade 6
No ratings yet
First Midterm Exam Grade 6
5 pages
Bill Gates 1997 Annual Report Letter To Shareholders
No ratings yet
Bill Gates 1997 Annual Report Letter To Shareholders
4 pages
Power BI Interview Questions Set II
50% (2)
Power BI Interview Questions Set II
17 pages
Installation Guide Data Integration Linux en
No ratings yet
Installation Guide Data Integration Linux en
205 pages
CamCo: Transforming Image-To-Video Generation With 3D Consistency
No ratings yet
CamCo: Transforming Image-To-Video Generation With 3D Consistency
7 pages
Modifying An Existing Numerical Shade Sorting System Through Cluster Analysis
No ratings yet
Modifying An Existing Numerical Shade Sorting System Through Cluster Analysis
8 pages
Software Engineer's Career Journey
No ratings yet
Software Engineer's Career Journey
1 page
MS Excel Shortcut Keys
No ratings yet
MS Excel Shortcut Keys
9 pages
Bank Management Report
No ratings yet
Bank Management Report
17 pages
Winphlash User Guide PDF
No ratings yet
Winphlash User Guide PDF
12 pages
Data Compression (KCS-064) FIRST SESSIONAL EXAM 2020-21 EVEN SEMESTER B.TECH CSE-3RD YEAR
No ratings yet
Data Compression (KCS-064) FIRST SESSIONAL EXAM 2020-21 EVEN SEMESTER B.TECH CSE-3RD YEAR
10 pages
Bridging The Gap From Food Recognition To Accurate Weight Estimation
No ratings yet
Bridging The Gap From Food Recognition To Accurate Weight Estimation
6 pages
iCAM7100S - Hardware - Guide - 160215 - Ver 1.1
No ratings yet
iCAM7100S - Hardware - Guide - 160215 - Ver 1.1
16 pages
2nd Year Test 2nd Chapter
No ratings yet
2nd Year Test 2nd Chapter
2 pages
Unit II Programming Using Embedded C Final
No ratings yet
Unit II Programming Using Embedded C Final
24 pages
Bahir Dar University Faculty of Electrical and Computer Engineering
No ratings yet
Bahir Dar University Faculty of Electrical and Computer Engineering
35 pages
Metrosoft QUARTIS User Manual PDF
No ratings yet
Metrosoft QUARTIS User Manual PDF
524 pages
Color Sorter Manual
100% (1)
Color Sorter Manual
34 pages

Lecture #4

Uploaded by

Lecture #4

Uploaded by

Lecture #3 - Introduction to Data-Level

What is Data-Level Parallelism?

• Increased performance without increasing clock speed.

• Improves energy efficiency.

SIMD Architectures and Applications

Figure 2: Instruction formats in RV64V showing operand flexibility.

• Less instruction bandwidth

• Higher energy efficiency

Architecture Example Focus

Table 1: Variants of SIMD Architectures

• Operate on data using vector functional units

– Vector instructions: operate across registers

• Output results back to memory

• Dynamic vector length: using Vector Length Register (VL)

Example of VMIPS DAXPY (Basic Linear Algebra Subroutine (BLAS)):

• Chaining enables overlap

• Unit-stride or strided access

• Predicate-based conditional operations

Figure 4: Vector Processors with Single and Multiple Add Pipelines

Why even use Strip Mining?

The main ideas behind Strip Mining

1. Generality → Allows you to handle arbitrary N with hardware-limited MVL.

4. Automatic in Compilers → Many compilers perform strip mining auto-

• Vector Processors (VMIPS, RV64V)

• SIMD Extensions (implicitly when needed)

• GPUs (equivalent handled via thread block sizes)

• Packed operations across fixed-width registers.

Name Width Introduced

Figure 7: Typical multimedia SIMD instruction capabilities.

Figure 8: Examples of AVX instructions for double-precision packed operations.

VMIPS x86 SIMD

SIMD DAXPY in AVX:

vmovapd ymm1 , [ X ] ; Aligned Packed Double Load

vmulpd ymm2 , ymm0 , ymm1 ; Packed Double - Precision Multiply

vmovapd ymm3 , [ Y ] ; Aligned Packed Double Load

vaddpd ymm4 , ymm2 , ymm3 ; Packed Double - Precision Add

vmovapd [ Y ] , ymm4 ; Aligned Packed Double Store

; DAXPY operation for i =0 to 3 ( in parallel ) using packed SIMD

• No true vector masking (before AVX-512)

• Alignment constraints (SSE)

GPU as SIMD (SIMT) → Single Instruction, Multiple Threads

CUDA DAXPY Example:

Figure 10 shows the internal block diagram of a Pascal Streaming Multiprocessor

Figure 11: High-level architecture of a full Pascal GPU chip.

Hardware = Strip-Mined Loops

• Threads = SIMD lanes

• Shared memory = software-managed cache

• Warps scheduled dynamically

• Memory bank conflicts stall vector loads

• Alignment constraints in SIMD

• Shared memory conflicts in GPUs

– Native: Vector, GPU

• Vector: Predicate registers

• SIMD: AVX-512k masks

• GPU: Warp divergence, reconvergence stack

Thread Scheduling & Divergence

• Warps are scheduled onto SIMD lanes.

• Divergent branches handled with masking + reconvergence.

Figure 12: Hardware scheduling of SIMD threads in GPUs.

Figure 13: Basic PTX assembly showing thread-level operations.

Figure 15: Dual-issue SIMT thread scheduler used in modern GPUs.

Figure 16: Relationship between FLOPs and memory bytes accessed.

• X-axis: Arithmetic Intensity (FLOPs/Byte)

• Y-axis: Attainable GFLOPs/sec

What it means to be memory-bound or compute-bound?

Figure 18: Comparison of a traditional vector processor with a GPU’s multithreaded

Feature Vector SIMD GPU

• DLP is more scalable than ILP

• Vector machines offer clarity and flexibility

• SIMD extensions are pragmatic but constrained

You might also like