KEMBAR78
Lecture #4 | PDF | Parallel Computing | Integrated Circuit
0% found this document useful (0 votes)
17 views16 pages

Lecture #4

Data-Level Parallelism (DLP) allows simultaneous execution of operations on multiple data items, enhancing performance and energy efficiency, especially for workloads with structured data. SIMD architectures, such as vector machines and GPUs, leverage DLP by executing the same instruction across multiple data elements, making them suitable for tasks like scientific computing and multimedia processing. Key concepts include strip mining to handle data exceeding vector lengths and the comparison of various SIMD architectures, highlighting their strengths and limitations.

Uploaded by

Braincain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Lecture #4

Data-Level Parallelism (DLP) allows simultaneous execution of operations on multiple data items, enhancing performance and energy efficiency, especially for workloads with structured data. SIMD architectures, such as vector machines and GPUs, leverage DLP by executing the same instruction across multiple data elements, making them suitable for tasks like scientific computing and multimedia processing. Key concepts include strip mining to handle data exceeding vector lengths and the comparison of various SIMD architectures, highlighting their strengths and limitations.

Uploaded by

Braincain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture #3 - Introduction to Data-Level

Parallelism

What is Data-Level Parallelism?


Data-Level Parallelism (DLP) exploits the fact that many data operations can be
performed independently and simultaneously, particularly when the same operation
is applied to multiple data items.

Figure 1: Block diagram of a vector processor showing scalar unit, vector registers,
vector load/store unit, and vector functional units.

Why DLP?

• Increased performance without increasing clock speed.

• Improves energy efficiency.

• Perfect for workloads with regular, structured data (e.g., matrix ops, signal
processing, image manipulation).

SIMD Architectures and Applications


SIMD = Single Instruction, Multiple Data
Executes the same instruction across multiple data elements.

Use Cases:

1
• Matrix-oriented scientific computing
• Image and audio processing
• Mobile and embedded systems (due to power efficiency)
• Linear Algebra
Exploits data independence, not instruction independence

Figure 2: Instruction formats in RV64V showing operand flexibility.

Benefits of DLP:
• One instruction per many data operations
• Reduces instruction fetch and decode
• Energy efficient
• Compiler-friendly parallelism
• Keep sequential programming model

2
Advantages Over MIMD:
• SIMD fetches one instruction for all operations

• Less instruction bandwidth

• Higher energy efficiency

Architecture Example Focus


Vector Machines Cray-1, RV64V Scientific
SIMD Extensions MMX, SSE, AVX Multimedia
GPUs (SIMT) NVIDIA CUDA Large-scale data tasks

Table 1: Variants of SIMD Architectures

Vector Architectures

Figure 3: A vector processor architecture where each lane contains functional units
and access to the vector register bank.

Key Concepts:
• Use vector registers to load and process sets of data

• Operate on data using vector functional units

– Vector instructions: operate across registers

• Output results back to memory

• Dynamic vector length: using Vector Length Register (VL)

3
• Deep pipelining (e.g., Cray-1: 6-20 cycles)

Example of VMIPS DAXPY (Basic Linear Algebra Subroutine (BLAS)):


; addition of constant times a vector plus a vector
fld f0 , a ; floating - point load - load scalar
vld v0 , X ; vector load
vmul v1 , v0 , f0 ; vector - scalar multiply
; operation -> v1 [ i ] = f0 * v0 [ i ]
vld v2 , Y ; vector load
vadd v3 , v1 , v2 ; vector add
; operation -> v3 [ i ] = v1 [ i ] + v2 [ i ]
vst v3 , Y ; vector store
; operation -> Y [ i ] = a * X [ i ] + Y [ i ]

This is the essence of DLP. Each instruction here may operate on 32 or 64 elements
in a single go, depending on vector length.

Vector Benefits
• Compact representation

• High throughput

• Chaining enables overlap

• Unit-stride or strided access

• Predicate-based conditional operations

Figure 4: Vector Processors with Single and Multiple Add Pipelines

4
Execution Time Concepts
• Convoy: Set of instructions without structural hazards
• Chime: Number of cycles per convoy
• Total cycles = chimes x vector length

Figure 5: Breaking a vectorizable loop into multiple strips compatible with Maximum
Vector Length (MVL).

Strip Mining
Strip Mining is a compiler or programmer technique used to:
• Adapt loops to match hardware vector lengths (VL).
• Break a large loop into smaller “chunks” or “strips” whose size is limited by
the available vector register length.

VL = min (n , MVL )
while ( n > 0) {
execute vector op with VL
n -= VL
}

Why even use Strip Mining?


1. Vector Length Limit
• Real hardware has a maximum vector length (MVL) determined by the
number of vector registers and lanes.
2. Dynamic Problem Sizes
• Program data structures often exceed MVL:
– Imagine processing 10,000 elements with a vector length of 64.
∗ You can’t load 10,000 elements into vector registers at once.
3. Solution is to process data in strips
• Each strip handles up to MVL elements.
• Process multiple strips to cover the whole array.

5
Figure 6: Breaking a vectorizable loop into multiple strips compatible with MVL.

The main ideas behind Strip Mining

1. Generality → Allows you to handle arbitrary N with hardware-limited MVL.

2. Maximizes Parallelism → Always tries to use the full vector width when
possible and only the last strip may have fewer than MVL elements.

3. Simple Control Flow → Adds an outer loop to iterate over the strips and
the inner vectorized body remains the same.

4. Automatic in Compilers → Many compilers perform strip mining auto-


matically when vectorizing loops and is explicit in hand-written assembly or
intrinsic-based programming.

Applied in:

• Vector Processors (VMIPS, RV64V)

• SIMD Extensions (implicitly when needed)

• GPUs (equivalent handled via thread block sizes)

• Compilers (LLVM, GCC, ICC) apply strip mining during automatic vectoriza-
tion

6
SIMD Extensions Evolutions
• SIMD added to x86 over time.

• Packed operations across fixed-width registers.

Name Width Introduced


MMX 64-bit 1996
SSE 128-bit 1999
AVX 256-bit 2010
AVX-512 512-bit 2017

Table 2: Each new SIMD generation increased register width, enabling more DLP.
But these are packed SIMD, not true vector processors.

Figure 7: Typical multimedia SIMD instruction capabilities.

Figure 8: Examples of AVX instructions for double-precision packed operations.

7
VMIPS → x86 Mapping

VMIPS x86 SIMD


vadd v3,v1,v2 vaddpd ymm3, ymm1, ymm2
VL dynamic VL fixed
Predicate registers Mask registers (AVX-512)
Strided loads Limited to AVX-512

Table 3: AVX-512 comes closest to offering vector-like features like masking and
gather/scatter support.

SIMD DAXPY in AVX:


vbroadcastsd ymm0 , [ a ] ; Broadcast Scalar Double to Vector
; ymm0 = {a , a , a , a }

vmovapd ymm1 , [ X ] ; Aligned Packed Double Load


; ymm1 = { X [0] , X [1] , X [2] , X [3]}

vmulpd ymm2 , ymm0 , ymm1 ; Packed Double - Precision Multiply


; ymm2 [ i ] = ymm0 [ i ] * ymm1 [ i ] = a * X [ i ]

vmovapd ymm3 , [ Y ] ; Aligned Packed Double Load


; ymm3 = { Y [0] , Y [1] , Y [2] , Y [3]}

vaddpd ymm4 , ymm2 , ymm3 ; Packed Double - Precision Add


; ymm4 [ i ] = ymm2 [ i ] + ymm3 [ i ] = a * X [ i ] + Y [ i ]

vmovapd [ Y ] , ymm4 ; Aligned Packed Double Store


; Y [0 ..3 ] <- ymm4

; DAXPY operation for i =0 to 3 ( in parallel ) using packed SIMD


instructions
; Y[i] = a * X[i] + Y[i]

Each AVX instruction handles 4-8 doubles depending on register width. Compilers
can auto-vectorize such loops.

SIMD Limitations
• Fixed width = less flexibility

• Limited masking

• No dynamic VL

• No strip mining

• No true vector masking (before AVX-512)

8
• Scatter/gather only available recently (appear late)

• Alignment constraints (SSE)

GPU as SIMD (SIMT) → Single Instruction, Multiple Threads

Feature GPU
Registers 256 per thread
Threads Thousands (SIMT)
Vectorization Done via threads
Masking Warp divergence
Memory Shared/global registers

CUDA DAXPY Example:


__global__ void daxpy ( int n , double a , double * x , double * y ) {
int i = blockIdx . x * blockDim . x + threadIdx . x ;
if ( i < n ) y [ i ] = a * x [ i ] + y [ i ];
}

This looks like scalar C, but each thread runs in parallel. CUDA manages all the
vectorization, synchronization, and execution.

Figure 9: CUDA thread hierarchy showing grids, thread blocks, and threads.

9
Figure 10: Diagram of a Pascal GPU’s SIMD processor.

Figure 10 shows the internal block diagram of a Pascal Streaming Multiprocessor


(SM), where we see how multiple warps map onto SIMD lanes and share resources
like registers and shared memory.

Figure 11: High-level architecture of a full Pascal GPU chip.

Hardware = Strip-Mined Loops


• Thread blocks = tiles

• Threads = SIMD lanes

• Shared memory = software-managed cache

• Warps scheduled dynamically

10
Memory & Control Flow
Memory System Challenges

• Memory bank conflicts stall vector loads

• Alignment constraints in SIMD

• Shared memory conflicts in GPUs

• Gather/Scatter:

– Native: Vector, GPU


– Limited: SIMD (AVX-512)

Conditional Execution

• Vector: Predicate registers

• SIMD: AVX-512k masks

• GPU: Warp divergence, reconvergence stack

Conditional execution can degrade DLP. Using masks helps avoid branches.

Thread Scheduling & Divergence

• Warps are scheduled onto SIMD lanes.

• Divergent branches handled with masking + reconvergence.

Figure 12: Hardware scheduling of SIMD threads in GPUs.

11
Memory and Predicated Execution

Figure 13: Basic PTX assembly showing thread-level operations.

12
Figure 14: GPU memory structure including global, local, and private memories.

Figure 15: Dual-issue SIMT thread scheduler used in modern GPUs.

13
Roofline Model
This model helps reason about whether your performance is limited by computation
or memory bandwidth.

Figure 16: Relationship between FLOPs and memory bytes accessed.

Figure 17: Roofline comparison between a vector supercomputer and a modern SIMD-
capable processor.

• X-axis: Arithmetic Intensity (FLOPs/Byte)

• Y-axis: Attainable GFLOPs/sec

• Two ceilings:

– Memory bound
– Compute bound

What it means to be memory-bound or compute-bound?


Why GPUs often reach higher attainable GFLOPS?

14
Comparative Models

Figure 18: Comparison of a traditional vector processor with a GPU’s multithreaded


SIMD processor.

Figure 19: Comparison of SIMD extensions (MMX, SSE, AVX) to GPU SIMT model.

15
Figure 20: Roofline performance comparison of a CPU with SIMD extensions and a
GPU.

Quick Summary

Feature Vector SIMD GPU


ISA Full vector Packed SIMT
Gather/Scatter Yes AVX-512 Yes
Masking Predicate AVX-512k Warp diverge
Strip-Mining Compiler Hidden Grid/block sizes
Best For Scientific apps Media workloads ML, DL, HPC

Key Takeaways:

• DLP is more scalable than ILP

• Vector machines offer clarity and flexibility

• SIMD extensions are pragmatic but constrained

• GPUs provide massive DLP, but require careful management of memory and
threads

16

You might also like