0% found this document useful (0 votes)

26 views45 pages

Chapter7 GPU

Chapter 6 discusses GPU and CUDA programming, explaining the architecture and functionality of GPUs, including their evolution from graphics rendering to applications in machine learning. It details the CUDA programming model, which allows developers to write parallel computing applications that leverage GPU capabilities, and outlines the steps for setting up a CUDA program. Additionally, the chapter covers parallelization strategies for neural networks, emphasizing data and model parallelism for efficient training on multiple GPUs.

Uploaded by

obadahabuodeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Chapter7 GPU

Uploaded by

obadahabuodeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Chapter 6

GPU and CUDA programming

Dr. Aryaf Al-adwan

Autonomous Systems Department
Faculty of Artificial Intelligence
What is GPU?

 A GPU, or Graphics Processing Unit, is a specialized type of processor that is

designed to handle the complex calculations required for rendering images and
video. It is called a "graphics" processor because it was originally designed to
accelerate graphics rendering, but over time, the capabilities of GPUs have been
expanded to include other types of calculations, such as those needed for
machine learning and scientific simulations.

2
Computer Graphics

3
GPU VS CPU

4
5
Interconnection networks that can be used in a GPU

1. Bus-based interconnection networks: In this type of network, all the functional units of
the GPU are connected to a shared bus, which is used to transmit data between them.
2. Crossbar interconnection networks: In this type of network, each functional unit is
connected to a central crossbar switch, which routes data between the units.
3. Mesh interconnection networks: In this type of network, the functional units of the
GPU are organized into a two-dimensional mesh, with each unit connected to its
neighbors. Data is transmitted between units by sending it along the mesh.
4. Hierarchical interconnection networks: In this type of network, the functional units of
the GPU are organized into a hierarchy, with some units connected to others at
different levels. Data is transmitted between units by passing it up and down the
hierarchy.

6
GPU Architecture
• The main components of GPU architecture:
1. Cores: These are the computational units that perform the actual work of
executing instructions and performing calculations. The number and type of
execution units can vary depending on the GPU and its intended use case.
2. Memory hierarchy: A GPU typically has multiple levels of memory, ranging
from the cache memory to the DRAM (Dynamic Random Access Memory).
3. Interconnects: These are the pathways that allow different parts of the GPU
to communicate with each other and with the rest of the system.
4. Input/output (I/O) interfaces: These are the interfaces that allow the GPU to
communicate with the rest of the computer system.

7
8
3D Block

9
10
Nvidia's GeForce RTX
Graphics Cards
• The Geforce RTX 3090 is the most powerful GPU in the series, with 10496 CUDA cores and
24GB of GDDR6X memory. It is designed for use in high-end gaming PCs and workstations,
and is capable of delivering excellent performance in a wide range of graphics-intensive
tasks.
• RTX stands for Ray Tracing eXtreme
• The RTX 3070 is a more affordable option that is still capable of delivering excellent
performance in many games and applications. It has 5888 CUDA cores and 8GB of GDDR6
memory.

• The RTX 3060 is the most budget-friendly option in the series, with 3584 CUDA cores and
12GB of GDDR6 memory. It is designed for use in mid-range gaming PCs and is capable of
delivering good performance in many games and applications. 11
block diagram of Geforce RTX 3090 12
NVIDIA ampere architecture
1. A large number of CUDA cores: The Ampere architecture features a large
number of CUDA cores, which are specialized processors that are
designed specifically for executing parallel computations.
2. RT cores: The Ampere architecture includes specialized RT cores that are
designed to accelerate ray tracing calculations. Ray tracing is a technique
for rendering realistic 3D graphics that involves simulating the path of
light as it bounces off of objects in a scene
3. Tensor cores: The Ampere architecture also includes specialized tensor
cores that are designed to accelerate machine learning workloads.
Tensor cores are capable of performing fast matrix multiplications, which
are a key building block of many machine learning algorithms.
Ampere is a microarchitecture developed by NVIDIA
that is used in its GeForce line of GPUs. 13
PORTING TO CUDA
Application Code

Rest of Sequential
Compute-Intensive Functions CPU Code
GPU Use GPU to Parallelize CPU

+ 5
1
5 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus

1. Copy input data from CPU memory to GPU

memory
1
6 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus

1. Copy input data from CPU memory to GPU

memory
2. Load GPU program and execute,
caching data on chip for performance
1
7 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus

1. Copy input data from CPU memory to GPU

memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
CUDA Programming
• Compute Unified Device Architecture (CUDA) is a parallel computing
platform and application programming interface (API) created by Nvidia
in 2006, that gives direct access to the GPU’s virtual instruction set for
the execution of compute kernels.
• Kernels are functions that run on a GPU. When we launch a kernel, it is
executed as a set of Threads. Each thread is mapped to a single CUDA
core on a GPU and performs the same operation on a subset of data.
According to Flynn‘s taxonomy, it‘s a Single Instruction Multiple Data
(SIMD) computation.

18
Cont.
• You can use a number of programming languages to write
programs that will run on CUDA-enabled GPUs, including:
• C++
• Python
• Fortran
• C#
• Java
• Some of these languages have specialized libraries or interfaces
that are designed specifically for programming with CUDA, such
as PyCUDA for Python or JCuda for Java.

https://developer.nvidia.com/cuda-toolkit 19
To set up a CUDA program
1. Install a CUDA-enabled graphics card and the
CUDA Toolkit on your system.
2. Choose a programming language. You can write
CUDA programs in C, C++, or Python.
3. Choose an Integrated Development Environment
(IDE). There are several IDEs that you can use to
write and debug CUDA programs, such as
Microsoft Visual Studio, Eclipse, and Xcode.
20
CUDA toolkit
https://developer.nvidia.com/cuda-toolkit

https://www.youtube.com/watch?v=cL05xtTocmY 21
CUDA Keywords
• The CUDA programming model is a heterogeneous
model in which both the CPU and GPU are used.
• In CUDA, the host refers to the CPU and its memory,
while the device refers to the GPU and its memory.
• Code run on the host can manage memory on both the
host and device, and also launches kernels which are
functions executed on the device. These kernels are
executed by many GPU threads in parallel.
22
HETEROGENEOUS COMPUTING
Host The CPU and its memory (host memory)

Device The GPU and its memory (device memory)

CUDA Keywords and Functions 24

1. __ global__: This keyword is used to declare a global functions are designed to be called from the CPU and
run on the GPU. These functions are called "kernels," and they are the functions that execute parallel work
on the GPU.
2. __device__: This keyword is used to declare a device functions are also designed to be called from the GPU
and run on the GPU.
3. __host__: This keyword is used to declare a function as a host function, which is a function that is executed
on the CPU and can be called from the host (CPU) only.
4. __shared__: This keyword is used to declare a variable as shared memory, which is a type of memory that is
shared among threads
5. __constant__: This keyword is used to declare a variable as constant memory, which is a type of memory
that is stored in the GPU's memory and is accessed faster than global memory.
6. __syncthreads(): This function is used to synchronize the execution of threads within a thread block.
7. cudaMemcpy(): This function is used to copy data between the host and device memories.
8. cudaMalloc(): This function is used to allocate memory on the device.
9. cudaFree(): This function is used to free memory that was previously allocated on the device.
A typical sequence of operations for a
CUDA C program is:

1. Declare and allocate host and device memory.

2. Initialize host data.
3. Transfer data from the host to the device.
4. Execute one or more kernels.
5. Transfer results from the device to the host.

25
A quick comparison between CUDA and C

The global specifier indicates a function that runs on device (GPU).

Such function can be called through host code, e.g. the main() function in the example,
and is also known as "kernels".
26
The triple angle brackets (<<<...>>>)
__global__ void mycode(float* data)
{
// kernel code goes here In this example, the
} kernel function mycode
is launched with a grid of
int main() 1 block and 256 threads
{ per block. The total
// launch the kernel with 256 threads per block, 1 block number of threads
mycode<<<1, 256>>>(data); launched is 1 x 256 =
return 0; 256.
}

27
Cont.
• CUDA makes four pieces of information available to
each thread:
1. The thread index (threadIdx)
2. The block index (blockIdx)
3. The size and shape of a block (blockDim)
4. The size and shape of a grid (gridDim)

28
29

how the thread index and thread ID relate to each other in a 1-dimensional, 2-dimensional,
and 3-dimensional block?

1-dimensional block: 2-dimensional block: 3-dimensional block:

In this case, the thread In this case, the thread ID

index and thread ID are the is computed using the
same in a 1-dimensional thread index (x, y) and the In this case, the thread ID is
block. block size (Dx, Dy) using computed using the thread index (x,
the following formula: y, z) and the block size (Dx, Dy, Dz)
using the following formula:
thread ID = x + y * Dx
thread ID = x + y * Dx + z * Dx * Dy
30
Thread Identification Example: x-direction
What is the global id for the third thread in the fourth block?

31
C++ program that utilizes CUDA for performing
vector addition on the GPU
1. This program performs vector addition of two input vectors "a" and "b" of
length N and stores the result in the output vector "c".
2. The kernel function "addVectors" is executed on the GPU using a large
number of threads, with each thread performing the addition on a different
element of the vectors.

32
#include <iostream>
#include <cuda_runtime.h>
// Launch the kernel
__global__ void addVectors(float *a, float *b, float *c, int n) int blockSize = 256;
{ int numBlocks = (N + blockSize - 1) / blockSize;
int i = threadIdx.x + blockIdx.x * blockDim.x; addVectors<<<numBlocks, blockSize>>>(a, b, c, N);
if (i < n) {
c[i] = a[i] + b[i]; // Copy result vector from device to host memory
} cudaMemcpy(c, c, N * sizeof(float),
} cudaMemcpyDeviceToHost);
int main() { // Print the result
const int N = 100; for (int i = 0; i < N; i++) {
float *a, *b, *c; std::cout << c[i] << std::endl;
cudaMalloc(&a, N * sizeof(float)); }
cudaMalloc(&b, N * sizeof(float));
cudaMalloc(&c, N * sizeof(float)); cudaFree(a);
cudaFree(b);
// Initialize input vectors a and b cudaFree(c);
for (int i = 0; i < N; i++) { return 0;
a[i] = i; }
b[i] = 2 * i;
}

// Copy input vectors from host to device memory

cudaMemcpy(a, a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(b, b, N * sizeof(float), cudaMemcpyHostToDevice); 33
1. __global__ void addVectors: This is the kernel function that performs the vector addition. The __global__
keyword specifies that this function is executed on the GPU. The function takes four arguments: the input
vectors "a" and "b", the output vector "c", and the size "n" of the vectors. Inside the function, the thread
index and block index are used to calculate the corresponding element of the output vector "c".

2. cudaMalloc: allocates memory on the GPU for the arrays "a", "b", and "c" using the cudaMalloc function.
The function takes a pointer to a memory location on the GPU, and the size of the memory to be allocated
in bytes.
3. initializes the input vectors "a" and "b" on the host. "a" is initialized with values 0 to N-1, and "b" is
initialized with values 0 to 2*(N-1)
4. cudaMemcp: This copies the input vectors "a" and "b" from host memory to the GPU memory using the
cudaMemcpy function. The function takes four arguments: the destination memory location on the GPU,
the source memory location on the host, the size of the memory to be copied in bytes, and the direction of
the copy

5. int numBlocks = (N + blockSize - 1) / blockSize;: This line calculates the number of blocks needed to
cover all elements of the input arrays.

addVectors<<<numBlocks, blockSize>>>(a, b, c, N);: This line launches the kernel. The triple angle
brackets <<< >>> are used to specify the grid and block dimensions. The first argument is the number of
blocks, and the second argument is the number of threads per block.a
34
6. cudaMemcpy(c, c, N * sizeof(float), cudaMemcpyDeviceToHost);
This line copies the result array from device memory to host memory.

7. for (int i = 0; i < N; i++) { std::cout << c[i] << std::endl; }

This loop prints the results on the host.

8. cudaFree(a); cudaFree(b); cudaFree(c);

These lines free the memory allocated for the input and output arrays on the GPU.

35
Neural Networks Are Embarrassingly Parallel
The term "embarrassingly parallel" is often used to describe problems or
computations that can be easily divided into smaller independent parts that can
be executed simultaneously, with little or no communication between the parts.
In the context of neural networks, it typically refers to the process of training a
model on multiple machines, or on multiple cores of a single machine.
An "embarrasingly parallel" problem can be easily parallelized without the need
for complex algorithms, and no communication is required during the
parallelization process.
36
37
Parallel Neural Network
• To accelerate the DNN training process on a multi-
GPUs.
• Two types of parallelization strategies:
1. data parallelism
2. model parallelism

38
In data parallelism (shown in Figure 1(a)), the input data is split into
mini-batches and the mini-batches are distributed across GPUs.

http://xzt102.github.io/publications/2021_WWW.pdf 39
Parameter averaging the simplest approach to data parallelism. The core idea is to split the training
data between the available workers, which hold the same model and train them independently. After
they have trained for a controlled amount of time, all the models are collected and the weights are
averaged between the workers

40
Convolution Example
• For each position on the blue input channel, the 3 x 3
filter does a computation that maps the shaded part of
the blue input channel to the corresponding shaded
part of the green output channel.

• In the animation, these computations are happening

sequentially one after the other. However, each
computation is independent from the others, meaning
that none of the computations depend on the results of
any of the other computations.

• As a result of this, all of these independent

computations can happen in parallel on a GPU and the
overall output channel can be produced.

• This allows us to see that the convolution operation can

be accelerated by using a parallel programming
approach and GPUs. 41
PyTorch Comes With CUDA

42
The previous code is moving the tensor t from the CPU to the GPU (specifically,
to the device at index 0 of the GPU), which will allow for faster computations on
the tensor. But, this alone is not enough for parallelism on a neural network.

To achieve parallelism, you would need to program the neural network to run
on multiple GPUs, using a library such as
1. torch.nn.DataParallel or
2. torch.nn.parallel.DistributedDataParallel.

43
import torch
import torch.nn as nn
from torch.nn.parallel import DataParallel

# Define a simple neural network

class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc = nn.Linear(10, 10) This code applies the data parallelism
def forward(self, x): torch.nn.DataParallel is a wrapper class that is
x = self.fc(x) used to parallelize the forward pass of a model
return x
across multiple GPUs in the current device. It
torch.set_num_threads(4) works by splitting the input data across the
# Create an instance of the network available GPUs, forwarding a copy of the
net = SimpleNet() model on each GPU, then concatenating the
# Wrap the network in DataParallel results before returning them.
net = DataParallel(net)

# Generate some random input data

inputs = torch.randn(64, 10)

# Perform a forward pass

outputs = net(inputs) 44
Thank You!

Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
CUDA
No ratings yet
CUDA
18 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
17 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lec 1
No ratings yet
Lec 1
27 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Cuda
No ratings yet
Cuda
25 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Cuda C
No ratings yet
Cuda C
70 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
w13s1 MultiprocessingGPU
No ratings yet
w13s1 MultiprocessingGPU
21 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Software Requirements Specification: Sanni Kumar Gupta
No ratings yet
Software Requirements Specification: Sanni Kumar Gupta
25 pages
Chapter 4 (Java Networking)
No ratings yet
Chapter 4 (Java Networking)
39 pages
SK 700E Notes - 07
No ratings yet
SK 700E Notes - 07
17 pages
Exinda 10064: High-Capacity Data Center Appliance
No ratings yet
Exinda 10064: High-Capacity Data Center Appliance
2 pages
SQL Server Backup Restore 2012
100% (1)
SQL Server Backup Restore 2012
391 pages
Sualaptop365.edu - VN - Acer Aspire E5-551G Compal LA-B221P PDF
No ratings yet
Sualaptop365.edu - VN - Acer Aspire E5-551G Compal LA-B221P PDF
53 pages
AD82584F
No ratings yet
AD82584F
87 pages
WinXP Driver Guide for PP-6800U
No ratings yet
WinXP Driver Guide for PP-6800U
11 pages
3:3 Combiner Data Sheet
No ratings yet
3:3 Combiner Data Sheet
4 pages
Step 1.: Network Setup For The Vmax ™ Systems
No ratings yet
Step 1.: Network Setup For The Vmax ™ Systems
16 pages
Numark Mixdeck: (NC06) Service Manual
100% (8)
Numark Mixdeck: (NC06) Service Manual
32 pages
EMU RS485 Communication Cable - External Alarms
No ratings yet
EMU RS485 Communication Cable - External Alarms
3 pages
Micron Partner Page
No ratings yet
Micron Partner Page
2 pages
Lecture 5 PDF
No ratings yet
Lecture 5 PDF
59 pages
mv7+ Specsheet en A4
No ratings yet
mv7+ Specsheet en A4
2 pages
Gopi M. Patel: Summary of Qualifications
No ratings yet
Gopi M. Patel: Summary of Qualifications
2 pages
OG Aruba CX Switch License
No ratings yet
OG Aruba CX Switch License
5 pages
Z92J
No ratings yet
Z92J
63 pages
CP3/CP3N 3-Series® Control System
No ratings yet
CP3/CP3N 3-Series® Control System
2 pages
Hipath 4000 V4 Manual de Servicio
No ratings yet
Hipath 4000 V4 Manual de Servicio
220 pages
Yamaha Class EEEngine HD5KW Cloned
100% (1)
Yamaha Class EEEngine HD5KW Cloned
9 pages
Sample Lex and YAcc Programs
No ratings yet
Sample Lex and YAcc Programs
22 pages
Handout 04
No ratings yet
Handout 04
17 pages
R Hurm (Rhapsody Info)
No ratings yet
R Hurm (Rhapsody Info)
115 pages
Computer Basic
No ratings yet
Computer Basic
3 pages
STM32 Nucleo Ethernet Setup Guide
100% (1)
STM32 Nucleo Ethernet Setup Guide
17 pages
Lecture 2 - IV Part 1
No ratings yet
Lecture 2 - IV Part 1
20 pages
Ena Lab Report 5
No ratings yet
Ena Lab Report 5
22 pages
Mobile Data Logger and Control System
No ratings yet
Mobile Data Logger and Control System
11 pages
Enhanced Engas and Ebudget System Version 2.0 (Ppsas and Rca Compliant) For Local Government Unit
No ratings yet
Enhanced Engas and Ebudget System Version 2.0 (Ppsas and Rca Compliant) For Local Government Unit
5 pages

Chapter7 GPU

Uploaded by

Chapter7 GPU

Uploaded by

Chapter 6

GPU and CUDA programming

Dr. Aryaf Al-adwan

 A GPU, or Graphics Processing Unit, is a specialized type of processor that is

1. Copy input data from CPU memory to GPU

1. Copy input data from CPU memory to GPU

1. Copy input data from CPU memory to GPU

Device The GPU and its memory (device memory)

1. Declare and allocate host and device memory.

The __global__ specifier indicates a function that runs on device (GPU).

1-dimensional block: 2-dimensional block: 3-dimensional block:

In this case, the thread In this case, the thread ID

// Copy input vectors from host to device memory

7. for (int i = 0; i < N; i++) { std::cout << c[i] << std::endl; }

8. cudaFree(a); cudaFree(b); cudaFree(c);

• In the animation, these computations are happening

• As a result of this, all of these independent

• This allows us to see that the convolution operation can

# Define a simple neural network

# Generate some random input data

# Perform a forward pass

You might also like

The global specifier indicates a function that runs on device (GPU).