Lecture #3 - Introduction to Data-Level
Parallelism
What is Data-Level Parallelism?
Data-Level Parallelism (DLP) exploits the fact that many data operations can be
performed independently and simultaneously, particularly when the same operation
is applied to multiple data items.
Figure 1: Block diagram of a vector processor showing scalar unit, vector registers,
vector load/store unit, and vector functional units.
Why DLP?
• Increased performance without increasing clock speed.
• Improves energy efficiency.
• Perfect for workloads with regular, structured data (e.g., matrix ops, signal
processing, image manipulation).
SIMD Architectures and Applications
SIMD = Single Instruction, Multiple Data
Executes the same instruction across multiple data elements.
Use Cases:
1
• Matrix-oriented scientific computing
• Image and audio processing
• Mobile and embedded systems (due to power efficiency)
• Linear Algebra
Exploits data independence, not instruction independence
Figure 2: Instruction formats in RV64V showing operand flexibility.
Benefits of DLP:
• One instruction per many data operations
• Reduces instruction fetch and decode
• Energy efficient
• Compiler-friendly parallelism
• Keep sequential programming model
2
Advantages Over MIMD:
• SIMD fetches one instruction for all operations
• Less instruction bandwidth
• Higher energy efficiency
Architecture Example Focus
Vector Machines Cray-1, RV64V Scientific
SIMD Extensions MMX, SSE, AVX Multimedia
GPUs (SIMT) NVIDIA CUDA Large-scale data tasks
Table 1: Variants of SIMD Architectures
Vector Architectures
Figure 3: A vector processor architecture where each lane contains functional units
and access to the vector register bank.
Key Concepts:
• Use vector registers to load and process sets of data
• Operate on data using vector functional units
– Vector instructions: operate across registers
• Output results back to memory
• Dynamic vector length: using Vector Length Register (VL)
3
• Deep pipelining (e.g., Cray-1: 6-20 cycles)
Example of VMIPS DAXPY (Basic Linear Algebra Subroutine (BLAS)):
; addition of constant times a vector plus a vector
fld f0 , a ; floating - point load - load scalar
vld v0 , X ; vector load
vmul v1 , v0 , f0 ; vector - scalar multiply
; operation -> v1 [ i ] = f0 * v0 [ i ]
vld v2 , Y ; vector load
vadd v3 , v1 , v2 ; vector add
; operation -> v3 [ i ] = v1 [ i ] + v2 [ i ]
vst v3 , Y ; vector store
; operation -> Y [ i ] = a * X [ i ] + Y [ i ]
This is the essence of DLP. Each instruction here may operate on 32 or 64 elements
in a single go, depending on vector length.
Vector Benefits
• Compact representation
• High throughput
• Chaining enables overlap
• Unit-stride or strided access
• Predicate-based conditional operations
Figure 4: Vector Processors with Single and Multiple Add Pipelines
4
Execution Time Concepts
• Convoy: Set of instructions without structural hazards
• Chime: Number of cycles per convoy
• Total cycles = chimes x vector length
Figure 5: Breaking a vectorizable loop into multiple strips compatible with Maximum
Vector Length (MVL).
Strip Mining
Strip Mining is a compiler or programmer technique used to:
• Adapt loops to match hardware vector lengths (VL).
• Break a large loop into smaller “chunks” or “strips” whose size is limited by
the available vector register length.
VL = min (n , MVL )
while ( n > 0) {
execute vector op with VL
n -= VL
}
Why even use Strip Mining?
1. Vector Length Limit
• Real hardware has a maximum vector length (MVL) determined by the
number of vector registers and lanes.
2. Dynamic Problem Sizes
• Program data structures often exceed MVL:
– Imagine processing 10,000 elements with a vector length of 64.
∗ You can’t load 10,000 elements into vector registers at once.
3. Solution is to process data in strips
• Each strip handles up to MVL elements.
• Process multiple strips to cover the whole array.
5
Figure 6: Breaking a vectorizable loop into multiple strips compatible with MVL.
The main ideas behind Strip Mining
1. Generality → Allows you to handle arbitrary N with hardware-limited MVL.
2. Maximizes Parallelism → Always tries to use the full vector width when
possible and only the last strip may have fewer than MVL elements.
3. Simple Control Flow → Adds an outer loop to iterate over the strips and
the inner vectorized body remains the same.
4. Automatic in Compilers → Many compilers perform strip mining auto-
matically when vectorizing loops and is explicit in hand-written assembly or
intrinsic-based programming.
Applied in:
• Vector Processors (VMIPS, RV64V)
• SIMD Extensions (implicitly when needed)
• GPUs (equivalent handled via thread block sizes)
• Compilers (LLVM, GCC, ICC) apply strip mining during automatic vectoriza-
tion
6
SIMD Extensions Evolutions
• SIMD added to x86 over time.
• Packed operations across fixed-width registers.
Name Width Introduced
MMX 64-bit 1996
SSE 128-bit 1999
AVX 256-bit 2010
AVX-512 512-bit 2017
Table 2: Each new SIMD generation increased register width, enabling more DLP.
But these are packed SIMD, not true vector processors.
Figure 7: Typical multimedia SIMD instruction capabilities.
Figure 8: Examples of AVX instructions for double-precision packed operations.
7
VMIPS → x86 Mapping
VMIPS x86 SIMD
vadd v3,v1,v2 vaddpd ymm3, ymm1, ymm2
VL dynamic VL fixed
Predicate registers Mask registers (AVX-512)
Strided loads Limited to AVX-512
Table 3: AVX-512 comes closest to offering vector-like features like masking and
gather/scatter support.
SIMD DAXPY in AVX:
vbroadcastsd ymm0 , [ a ] ; Broadcast Scalar Double to Vector
; ymm0 = {a , a , a , a }
vmovapd ymm1 , [ X ] ; Aligned Packed Double Load
; ymm1 = { X [0] , X [1] , X [2] , X [3]}
vmulpd ymm2 , ymm0 , ymm1 ; Packed Double - Precision Multiply
; ymm2 [ i ] = ymm0 [ i ] * ymm1 [ i ] = a * X [ i ]
vmovapd ymm3 , [ Y ] ; Aligned Packed Double Load
; ymm3 = { Y [0] , Y [1] , Y [2] , Y [3]}
vaddpd ymm4 , ymm2 , ymm3 ; Packed Double - Precision Add
; ymm4 [ i ] = ymm2 [ i ] + ymm3 [ i ] = a * X [ i ] + Y [ i ]
vmovapd [ Y ] , ymm4 ; Aligned Packed Double Store
; Y [0 ..3 ] <- ymm4
; DAXPY operation for i =0 to 3 ( in parallel ) using packed SIMD
instructions
; Y[i] = a * X[i] + Y[i]
Each AVX instruction handles 4-8 doubles depending on register width. Compilers
can auto-vectorize such loops.
SIMD Limitations
• Fixed width = less flexibility
• Limited masking
• No dynamic VL
• No strip mining
• No true vector masking (before AVX-512)
8
• Scatter/gather only available recently (appear late)
• Alignment constraints (SSE)
GPU as SIMD (SIMT) → Single Instruction, Multiple Threads
Feature GPU
Registers 256 per thread
Threads Thousands (SIMT)
Vectorization Done via threads
Masking Warp divergence
Memory Shared/global registers
CUDA DAXPY Example:
__global__ void daxpy ( int n , double a , double * x , double * y ) {
int i = blockIdx . x * blockDim . x + threadIdx . x ;
if ( i < n ) y [ i ] = a * x [ i ] + y [ i ];
}
This looks like scalar C, but each thread runs in parallel. CUDA manages all the
vectorization, synchronization, and execution.
Figure 9: CUDA thread hierarchy showing grids, thread blocks, and threads.
9
Figure 10: Diagram of a Pascal GPU’s SIMD processor.
Figure 10 shows the internal block diagram of a Pascal Streaming Multiprocessor
(SM), where we see how multiple warps map onto SIMD lanes and share resources
like registers and shared memory.
Figure 11: High-level architecture of a full Pascal GPU chip.
Hardware = Strip-Mined Loops
• Thread blocks = tiles
• Threads = SIMD lanes
• Shared memory = software-managed cache
• Warps scheduled dynamically
10
Memory & Control Flow
Memory System Challenges
• Memory bank conflicts stall vector loads
• Alignment constraints in SIMD
• Shared memory conflicts in GPUs
• Gather/Scatter:
– Native: Vector, GPU
– Limited: SIMD (AVX-512)
Conditional Execution
• Vector: Predicate registers
• SIMD: AVX-512k masks
• GPU: Warp divergence, reconvergence stack
Conditional execution can degrade DLP. Using masks helps avoid branches.
Thread Scheduling & Divergence
• Warps are scheduled onto SIMD lanes.
• Divergent branches handled with masking + reconvergence.
Figure 12: Hardware scheduling of SIMD threads in GPUs.
11
Memory and Predicated Execution
Figure 13: Basic PTX assembly showing thread-level operations.
12
Figure 14: GPU memory structure including global, local, and private memories.
Figure 15: Dual-issue SIMT thread scheduler used in modern GPUs.
13
Roofline Model
This model helps reason about whether your performance is limited by computation
or memory bandwidth.
Figure 16: Relationship between FLOPs and memory bytes accessed.
Figure 17: Roofline comparison between a vector supercomputer and a modern SIMD-
capable processor.
• X-axis: Arithmetic Intensity (FLOPs/Byte)
• Y-axis: Attainable GFLOPs/sec
• Two ceilings:
– Memory bound
– Compute bound
What it means to be memory-bound or compute-bound?
Why GPUs often reach higher attainable GFLOPS?
14
Comparative Models
Figure 18: Comparison of a traditional vector processor with a GPU’s multithreaded
SIMD processor.
Figure 19: Comparison of SIMD extensions (MMX, SSE, AVX) to GPU SIMT model.
15
Figure 20: Roofline performance comparison of a CPU with SIMD extensions and a
GPU.
Quick Summary
Feature Vector SIMD GPU
ISA Full vector Packed SIMT
Gather/Scatter Yes AVX-512 Yes
Masking Predicate AVX-512k Warp diverge
Strip-Mining Compiler Hidden Grid/block sizes
Best For Scientific apps Media workloads ML, DL, HPC
Key Takeaways:
• DLP is more scalable than ILP
• Vector machines offer clarity and flexibility
• SIMD extensions are pragmatic but constrained
• GPUs provide massive DLP, but require careful management of memory and
threads
16