KEMBAR78
Threads | PDF | Thread (Computing) | Pointer (Computer Programming)
0% found this document useful (0 votes)
32 views54 pages

Threads

The document discusses GPU programming and CUDA threads. It describes how CUDA kernels are executed by arrays of parallel threads and how threads are organized into blocks and grids. The document also covers CUDA memory models and how to allocate and transfer data between host and device memories.

Uploaded by

Cosmic02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views54 pages

Threads

The document discusses GPU programming and CUDA threads. It describes how CUDA kernels are executed by arrays of parallel threads and how threads are organized into blocks and grids. The document also covers CUDA memory models and how to allocate and transfer data between host and device memories.

Uploaded by

Cosmic02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CSE 591/CSE 392: GPU Programming

Threads

Klaus Mueller

Computer Science Department


Stony Brook University

Som e portions © David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009


ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 2


ECE 498AL, University of Illinois, Urbana-Cham paign
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)‫‏‬
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID 0 1 2 3 4 5 6 7


float x = input[threadID];
float y = func(x);
output[threadID] = y;

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 3


ECE 498AL, University of Illinois, Urbana-Cham paign
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple blocks
– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 0 Thread Block N - 1


threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 4


ECE 498AL, University of Illinois, Urbana-Cham paign
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to
Grid 1
decide what data to work on
Kernel
– Block ID: 1D or 2D 1
Block
(0, 0)
Block
(1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when Kernel

processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)

– … Thread Thread Thread Thread


(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 5
Figure 3.2. An Example of CUDA Thread Org
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W
Grid
Data between host and
device Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

– Contents visible to all Shared Memory Shared Memory


threads
Registers Registers Registers Registers
– Long latency access
• We will focus on Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

global memory for


now Host Global Memory

– Constant and texture


memory will come later
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 6
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Device Memory Allocation
• cudaMalloc()
– Allocates object in the Grid

device Global Memory Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

– Requires two parameters Shared Memory Shared Memory

• Address of a pointer to Registers Registers Registers Registers

the allocated object


Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

• Size of of allocated object


• cudaFree() Host Global
Memory

– Frees object from device


Global Memory
• Pointer to freed object
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 7
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Device Memory Allocation (cont.)‫‏‬
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d”‫‏‬is‫‏‬often‫‏‬used‫‏‬to‫‏‬indicate‫‏‬a‫‏‬device‫‏‬data‫‏‬
structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 8
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
• cudaMemcpy()‫‏‬
– memory data transfer Grid

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

• Pointer to source
Registers Registers Registers Registers
• Number of bytes copied
• Type of transfer Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

– Host to Host
– Host to Device Host Global
Memory
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 9
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 10


ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Keywords

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 11


ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc()‫‏‬ device device
__global__ void KernelFunc()‫‏‬ device host
__host__ float HostFunc()‫‏‬ host host

• __global__ defines a kernel function


– Must return void
• __device__ and __host__ can be used
together
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007- 12
2009
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Function Declarations (cont.)‫‏‬

• __device__ functions cannot have their


address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the
function
– No variable number of arguments

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 13


ECE 498AL, University of Illinois, Urbana-Cham paign
Calling a Kernel Function – Thread
Creation
• A kernel function must be called with an
execution configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per
block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 14
ECE 498AL, University of Illinois, Urbana-Cham paign
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread ID usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 15


ECE 498AL, University of Illinois, Urbana-Cham paign
Programming Model:
Square Matrix Multiplication Example
• P = M * N of size WIDTH x WIDTH N

• Without tiling:

WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 16
ECE498AL, University of Illinois, Urbana-Cham paign
Memory Layout of a Matrix in C
M 0,0 M 1,0 M 2,0 M 3,0

M 0,1 M 1,1 M 2,1 M 3,1

M 0,2 M 1,2 M 2,2 M 3,2

M 0,3 M 1,3 M 2,3 M 3,3

M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 17


ECE498AL, University of Illinois, Urbana-Cham paign
Step 1: Matrix Multiplication
A Simple Host Version in C
/ / Matrix m u ltiplication on the (CPU) host in d ou ble
N
precision
void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬ k
{
j

WIDTH
for (int i = 0; i < Width; ++i)‫‏‬
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P

sum += a * b; i
}

WIDTH
P[i * Width + j] = sum;
}
} k
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 18
ECE498AL, University of Illinois, Urbana-Cham paign
Step 2: Input Matrix Data Transfer
(Host-side Code)‫‏‬
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‫‏‬
{
int size = Width * Width * sizeof(float);
float* Md , N d , Pd ;

1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device


cudaMalloc(&Pd, size);
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 19
ECE498AL, University of Illinois, Urbana-Cham paign
Step 3: Output Matrix Data Transfer
(Host-side Code)‫‏‬

2. / / Kernel invocation cod e – to be show n later


3. / / Read P from the d evice


cudaMemcpy(P, Pd, size, cudaMemcpyD eviceToHost);

/ / Free d evice m atrices


cu d aFree(Md ); cu d aFree(N d ); cu d aFree (Pd );
}

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 20


ECE498AL, University of Illinois, Urbana-Cham paign
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‫‏‬
{

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 21


ECE498AL, University of Illinois, Urbana-Cham paign
Step 4: Kernel Function (cont.)‫‏‬
for (int k = 0; k < Width; ++k)‫{‏‬ Nd

float Melement = Md[threadIdx.y*Width+k];


float Nelement = Nd[k*Width+threadIdx.x]; k

WIDTH
Pvalue += Melement * Nelement;
} tx

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty ty

WIDTH
k tx

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 22


ECE498AL, University of Illinois, Urbana-Cham paign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration


dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 23


ECE498AL, University of Illinois, Urbana-Cham paign
Only One Thread Block Used
Grid 1 Nd
• One Block of threads computes Block 1
2
matrix Pd
4
– Each thread computes one
element of Pd Thread
(2, 2)‫‏‬
2
• Each thread 6
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Performs one multiply and
addition for each pair of Md
and Nd elements
– Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not
very high)‫‏‬
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 24
ECE498AL, University of Illinois, Urbana-Cham paign
Next: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-


matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md
You still need to pu t a loop by
arou nd the kernel call for TILE_WIDTH
cases w here ty

WIDTH
WIDTH / TILE_WIDTH is
greater than m ax grid size bx tx
(64K)!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 25
ECE498AL, University of Illinois, Urbana-Cham paign
bx
0 1 2
Matrix Multiplication Using
tx
Multiple Blocks 0 1 2 TILE_WIDTH-1

Nd
• Break-up Pd into tiles
• Each block calculates one

WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH

2 WIDTH WIDTH

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 26


ECE498AL, University of Illinois, Urbana-Cham paign
A Small Example

Block(0,0) Block(1,0)

P 0,0 P 1,0 P 2,0 P 3,0 TILE_WIDTH = 2


P 0,1 P 1,1 P 2,1 P 3,1

P 0,2 P 1,2 P 2,2 P 3,2

P 0,3 P 1,3 P 2,3 P 3,3

Block(0,1) Block(1,1)

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 27


ECE498AL, University of Illinois, Urbana-Cham paign
A Small Example: Multiplication
N d 0,0N d 1,0

N d 0,1N d 1,1

N d 0,2N d 1,2

N d 0,3N d 1,3

Md 0,0Md 1,0Md 2,0Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0

Md 0,1Md 1,1Md 2,1Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1

Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2

Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 28


ECE498AL, University of Illinois, Urbana-Cham paign
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 29
ECE498AL, University of Illinois, Urbana-Cham paign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program

• Threads in the same block share data and


synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
Cou rtesy: John N ickolls,
– Each block can execute in any order relative N VIDIA
to other blocs!

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 30


ECE498AL, University of Illinois, Urbana-Cham paign
Transparent Scalability
• Hardware is free to assign blocks to any processor at any time, given
the resources
– A kernel scales across any number of parallel processors
– When less resources are available, hardware will reduce the
number of blocks run in parallel (compare right with left block
assignment below)
Device Kernel grid
Device
Block 0 Block 1

Block 2 Block 3

Block 0 Block 1 Block 4 Block 5


Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
Block 2 Block 3
tim e
Block 4 Block 5 Block 6 Block 7

Block 4 Block 5
Each block can execu te in any ord er relative
Block 6 Block 7 to other blocks.
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 31
ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks • Threads are assigned to Streaming


Multiprocessors in block granularity
– Up to 8 blocks to each SM as
Shared Shared resource allows
Memory Memory
– SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3
blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 execution 32
ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Thread Scheduling
• Each Block is executed as …
Block 1 Warps
…Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31

– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP


there in an SM? SP SP
– Each Block is divided into SP
SFU
SP
SFU

256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 33


ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Thread Scheduling
(Cont.)

• SM implements zero-overhead warp scheduling


– Effectively provides for latency hiding (memory waits, etc.)
– At any time, only one of the warps is executed by SM
– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible Warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3


W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 34


ECE498AL, University of Illinois, Urbana-Cham paign
G80 Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should I
use 8X8, 16X16 or 32X32 blocks?

– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!
This will lead to under-utilization (bad for latency hiding).

– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.

– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 35
ECE498AL, University of Illinois, Urbana-Cham paign
Number of Threads
• All threads on an SM must run in lock-step
– if one thread is delayed because of memory load then
all threads in a warp must wait
– a new warp is scheduled
– so it is good to have more threads per block
– however, there is a limit on the number of threads per
SM
– 768, 1024, 1536, 2048 depending on compute
capability
– this is a function of the maximum number of warps
(24, 32, 48, 64) of 32 threads each
Number of Blocks
• Workload of threads is not always uniform
– all threads must complete before a new block can be
scheduled
– if the slow thread is part of a large block the idle time
is high
– so it is better to have smaller blocks
– however, there is a limit on the number of blocks per
SM (8 or less, depending on compute capability)
GPU Utilization
• Goal is to allocate as many threads per SM as
maximum limit
• Here take into account:
– max number of blocks
– modulus of warps
• So to achieve 100 % utilization depends on
compute capability and threads/block
GPU Utilization

• So 256 threads per block is safest across all


compute capabilities

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
Practical Example
• Histogram computation

for (unsigned int i=0; i< max; i++)


{
bin[array[i]]++;
}
CPU Algorithm
1. Read the value from the array into a register
2. Work out the base address and offset to the correct bin
element
3. Fetch the existing bin value
4. Increment the bin value by one
5. Write the new bin value back to the bin in memory

Steps 3, 4, 5 are not atomic


- OK for CPU since serial but not for GPU
- use atomicAdd(&value) on the GPU
GPU Algorithm 1

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 1
• Not overly fast
• Why?
– each thread only fetches 1 byte
– half warp fetches 16 bytes
– maximal supported size is 128 bytes
– hence memory bandwidth is heavily underused
GPU Algorithm 2
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit

– maybe reduce the amount of global number of atomic


writes?
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?

• compute more than one histogram per thread


GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

not m uch grow th in band w id th after N =32 d ue to other factors im ped ing grow th
(atom ics ad d s in this case)
Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬

You might also like