KEMBAR78
Gpu Computing | PDF | Graphics Processing Unit | Central Processing Unit
0% found this document useful (0 votes)
69 views57 pages

Gpu Computing

Uploaded by

Sudha kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views57 pages

Gpu Computing

Uploaded by

Sudha kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

lOMoARcPSD|37680818

0 GPU Computing - I give it

computer science engg (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)
lOMoARcPSD|37680818

IF4093 GPU COMPUTING L T PC


3 0 0 3
COURSE OBJECTIVES:
• To understand the basics of GPU architectures
• To understand CPU GPU Program Partitioning
• To write programs for massively parallel processors
• To understand the issues in mapping algorithms for GPUs
• To introduce different GPU programming models

UNIT I GPU ARCHITECTURE 9


Evolution of GPU architectures - Understanding Parallelism with GPU –Typical GPU Architecture -
CUDA Hardware Overview - Threads, Blocks, Grids, Warps, Scheduling - Memory Handling with
CUDA: Shared Memory, Global Memory, Constant Memory and Texture Memory.

UNIT II CUDA PROGRAMMING 9


Using CUDA - Multi GPU - Multi GPU Solutions - Optimizing CUDA Applications: Problem
Decomposition, Memory Considerations, Transfers, Thread Usage, Resource Contentions.

UNIT III PROGRAMMING ISSUES 9


Common Problems: CUDA Error Handling, Parallel Programming Issues, Synchronization,
Algorithmic Issues, Finding and Avoiding Errors.

UNIT IV OPENCL BASICS 9


OpenCL Standard – Kernels – Host Device Interaction – Execution Environment – Memory Model
– Basic OpenCL Examples.

UNIT V ALGORITHMS ON GPU 9


Parallel Patterns: Convolution, Prefix Sum, Sparse Matrix - Matrix Multiplication - Programming
Heterogeneous Cluster.

SUGGESTED ACTIVITIES:
1. Debugging Lab
2. Performance Lab
3. Launching Nsight
4. Running Performance Analysis
5. Understanding Metrics
6. NVIDIA Visual Profiler
7. Matrix Transpose Optimization
8. Reduction Optimization

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

UNIT 1 – GPU ARCHITECTURE


GPU:
Graphics processing unit, a specialized processor originally designed to
accelerate graphics rendering. GPUs can process many pieces of data
simultaneously, making them useful for machine learning, video editing, and
gaming applications.
GPU Architecture:
The GPU is a processor that is made up of many smaller and more
specialized cores. By working together, the cores deliver massive performance
when a processing task can be divided up and processed across many cores.
There are two types of GPU, integrated and discrete:
• Integrated GPU.
• Discrete GPU.
Evolution of GPU architectures:
Modern GPU architecture is very efficient at manipulating graphics as well
as image processing. The highly parallel structure makes it more effective than
general-purpose CPU (Central Processing Unit) architecture for algorithms,
which process large blocks of data in parallel.

History of the GPU


1996: 3DFX Voodoo graphics card implements texture mapping, z-
buffering, and rasterization, but no vertex processing
1999: GPUs implement the full graphics pipeline in fixed-function
hardware (Nvidia GeForce 256, ATI Radeon 7500)
2001: Programmable shader pipelines (Nvidia Geforce 3)
2006: Unified shader architecture (ATI Radeon R600, Nvidia Geforce
8, Intel GMA X3000, ATI Xenos for Xbox360)
2010: General Purpose GPUs for non-graphical compute-intensive
applications, Nvidia CUDA parallel programming API
2014: Unprecedented compute power
Nvidia Geforce GTX Titan Z - 8.2 TFLOPS
AMD Radeon R9 295X2 (dual GPU card) - 11.5 TFLOPS

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

▪ Visible GPU memory types


▪ Registers (per thread)
▪ Local mem. (per thread)
▪ Shared mem. (per block)
▪ Software-controlled cache
▪ Global mem. (per kernel)
▪ Constant mem. (read only)
Functional block diagram

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

A single GPU device consists of multiple Processor Clusters (PC) that contain
multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1
instruction cache layer with its associated cores. Typically, one SM uses a
dedicated layer-1 cache and a shared layer-2 cache before pulling data from
global GDDR-5 (or GDDR-6 in newer GPU models) memory. Its architecture is
tolerant of memory latency.

Compared to a CPU, a GPU works with fewer, and relatively small, memory
cache layers. Reason being is that a GPU has more transistors dedicated to
computation meaning it cares less how long it takes the retrieve data from
memory. The potential memory access ‘latency’ is masked as long as the GPU
has enough computations at hand, keeping it busy.

Modern GPU Hardware

CUDA:
CUDA − Compute Unified Device Architecture. CUDA is a programming
language that uses the Graphical Processing Unit (GPU). It is a parallel
computing platform and an API (Application Programming Interface) model,
Compute Unified Device Architecture was developed by Nvidia. This allows
computations to be performed in parallel while providing well-formed speed.

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

COMPONENTS:
The CUDA Toolkit includes libraries, debugging and optimization tools, a
compiler, documentation, and a runtime library to deploy your applications. It
has components that support deep learning, linear algebra, signal processing,
and parallel algorithms.

CUDA Architecture:

Thread Organization

In the CUDA processing paradigm (as well as other paradigms similar to


stream processing) there is a notion of a ‘kernel’. A kernel is essentially a mini-
program or subroutine. Kernels are the parallel programs to be run on the device
(the NVIDIA graphics card inside the host system). A number of primitive
‘threads’ will simultaneously execute a kernel program. Batches of these
primitive threads are organized into ‘thread blocks’. A thread block contains a
specific number of primitive threads, chosen based on the amount of available
shared memory, as well as the memory access latency hiding characteristics
desired. The number of threads in a thread block is also limited by the
architecture to a total of 512 threads per block. Each thread within a thread
block can communicate efficiently using the shared memory scoped to each
thread block. Using this shared memory, all threads can also sync within a
thread block. Every thread within a thread block has its own thread ID. Thread
blocks are conceptually organized into 1D, 2D or 3D arrays of threads for
convenience.

A ‘grid’ is a collection of thread blocks of the same thread dimensionality


which all execute the same kernel. Grids are useful for computing a large
number of threads in parallel since thread blocks are physically limited to only
512 threads per block. However, thread blocks within a grid may not
communicate via shared memory, and consequently may not synchronize with
one another.

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

The above diagram demonstrates the thread hierarchy described. Here, a


given kernel contains a 3x2 grid of thread blocks. Each thread block is a 4x3
block of threads, yielding a total of 72 threads executing said kernel.

Memory Hierarchy

There are several levels of memory on the GPU device, each with distinct
read and write characteristics. Every primitive thread has access to private
‘local’ memory as well as registers. This ‘local’ memory is really a misnomer;
the memory is private to the thread, but is not stored local to the thread’s
registers but rather off-chip in the global GDDR memory available on the
graphics card. Every thread in a thread block also has access to a unified ‘shared
memory’, shared among all threads for the life of that thread block. Finally, all
threads have read/write access to ‘global memory’, which is located off-chip on
the main GDDR memory module which therefore has the largest capacity but is
the most costly to interact with. There also exists a read-only ‘constant’ and
‘texture’ memory, in the same location as the global memory.

The global, constant and texture memory are optimized for different memory
usage models. Global memory is not cached, though memory transactions may
be ‘coalesced’ to hide the high memory access latency. These coalescence rules
and behaviors are dependent on the particular device used. The read-only
constant memory resides in the same location as global memory, but this
memory may be cached. On a cache hit, regardless of the number of threads
reading, the access time is that of a register access for each address being read.
The read-only texture memory also resides in the same location as global
memory, and is also cached. Texture memory differs from constant memory in

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

that its caching policy specifically exploits 2D spatial locality. This is due to the
use of ‘textures’ in 3D graphics; the use of 2D images to ‘texture’ the surface of
3D polygons are frequently read and benefit from caching the texture

spatially.

The above diagram shows the scope of each of the memory segments in the
CUDA memory hierarchy. Registers and local memory are unique to a thread,
shared memory is unique to a block, and global, constant, and texture memories
exist across all blocks.

Multiprocessors

CUDA capable GPUs are constructed with the “Tesla” architecture. CUDA
applications may be run on any card which supports this architecture, but each
GPU device may have different specifications, and therefore a slightly different
set of supported features and a different number of available computational
resources. When a kernel is invoked, each thread block executes on a
‘multiprocessor’. This multiprocessor contains the resources to support a certain
number of threads. Specifically, each multiprocessor consists of:

• 8 Scalar Processor cores


• 2 special function units for transcendentals
• 1 multithreaded instruction unit
• On-chip shared memory

One or more thread blocks are assigned to a multiprocessor during the


execution of a kernel. The CUDA runtime handles the dynamic scheduling of
thread blocks on a group of multiprocessors. The scheduler will only assign a
thread block to a multiprocessor when enough resources are available to support
the thread block. Each block is split into SIMD (Single-Instruction Multiple-
Data) groups of threads called ‘warps’. The SIMD unit creates, manages,

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

schedules and executes 32 threads simultaneously to create a warp. Every warp


is synchronous, and therefore care must be taken to ensure that certain threads
within a warp do not take abnormally longer compared to other threads in that
same warp, because the warp will only execute as fast as the slowest thread.
There are a number of programming hints provided in the CUDA programming
guide to help prevent warp divergence.

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

UNIT II

Multi GPU:

(MULTIple-Graphics Processing Units) Using two or more graphics cards in


the same PC to support faster animation in video games. NVIDIA's Scalable
Link Interface (SLI) and ATI's CrossFire are examples. See gaming PC, GPU,
SLI and CrossFire.

Multi GPU Deep Learning Strategies


Once multiple GPUs are added to your systems, you need to build parallelism
into your deep learning processes. There are two main methods to add
parallelism—models and data.

Model parallelism
Model parallelism is a method you can use when your parameters are too large
for your memory constraints. Using this method, you split your model training
processes across multiple GPUs and perform each process in parallel (as
illustrated in the image below) or in series. Model parallelism uses the same
dataset for each portion of your model and requires synchronizing data between
the splits.

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Data parallelism
Data parallelism is a method that uses duplicates of your model across GPUs.
This method is useful when the batch size used by your model is too large to fit
on a single machine, or when you want to speed up the training process. With
data parallelism, each copy of your model is trained on a subset of your dataset
simultaneously. Once done, the results of the models are combined and training
continues as normal.

10

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

How Does Multi GPU Work in Common Deep Learning Frameworks?

TensorFlow Multiple GPU


TensorFlow is an open source framework, created by Google, that you can use
to perform machine learning operations. The library includes a variety of
machine learning and deep learning algorithms and models that you can use as a
base for your training. It also includes built-in methods for distributed training
using GPUs.
Through the API, you can use the tf.distribute.Strategy method to distribute
your operations across GPUs, TPUs or machines. This method enables you to
create and support multiple user segments and to switch between distributed
strategies easily.
Two additional strategies that extend the distribute method are MirroredStrategy
and TPUStrategy. Both of these enable you to distribute your workloads, the
former across multiple GPUs and the latter across multiple Tensor Processing
Units (TPUs). TPUs are units available through Google Cloud Platform that are
specifically optimized for training with TensorFlow.
Both of these methods use roughly the same data-parallel process, summarized
as follows:

• Your dataset is segmented so data is distributed as evenly as possible.


• Replicas of your model are created and assigned to a GPU. Then, a subset
of the dataset is assigned to that replica.
• The subset for each GPU is processed and gradients are produced.
• The gradients from all model replicas are averaged and the result is used
to update the original model.
• The process repeats until your model is fully trained.

Learn more in our guide to TensorFlow multiple GPU and Keras multiple GPU
PyTorch Multi GPU
PyTorch is an open source scientific computing framework based on Python.
You can use it to train machine learning models using tensor computations and

11

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

GPUs. This framework supports distributed training through the


torch.distributed backend.
With PyTorch, there are three parallelism (or distribution) classes that you can
perform with GPUs. These include:

• DataParallel—enables you to distribute model replicas across multiple


GPUs in a single machine. You can then use these models to process
different subsets of your data set.
• DistributedDataParallel—extends the DataParallel class to enable you
to distribute model replicas across machines in addition to GPUs. You
can also use this class in combination with model_parallel to perform
both model and data parallelism.
• model_parallel—enables you to split large models across multiple GPUs
with partial training happening on each. This requires syncing training
data between the GPUs since operations are performed sequentially.

Multi GPU Deployment Models


There are three main deployment models you can use when implementing
machine learning operations that use multiple GPUs. The model you use
depends on where your resources are hosted and the size of your operations.

GPU Server
GPU servers are servers that incorporate GPUs in combination with one or more
CPUs. When workloads are assigned to these servers, the CPUs act as a central
management hub for the GPUs, distributing tasks and collecting outputs as
available.

GPU Cluster
GPU clusters are computing clusters with nodes that contain one or more GPUs.
These clusters can be formed from duplicates of the same GPU (homogeneous)
or from different GPUs (heterogeneous). Each node in a cluster is connected via
an interconnect to enable the transmission of data.
Kubernetes with GPUs
Kubernetes is an open source platform you can use to orchestrate and automate
container deployments. This platform offers support for the use of GPUs in
clusters to enable workload acceleration, including for deep learning.

12

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

When using GPUs with Kubernetes, you can deploy heterogeneous clusters and
specify your resources, such as memory requirements. You can also monitor
these clusters to ensure reliable performance and optimize GPU
utilization. Learn about Kubernetes architecture and how it can be used to
support Deep Learning.

Multi GPU With Run:AI


Run:AI automates resource management and workload orchestration for
machine learning infrastructure. With Run:AI, you can automatically run as
many deep learning experiments as needed on multi-GPU infrastructure.
Here are some of the capabilities you gain when using Run:AI:

• Advanced visibility—create an efficient pipeline of resource sharing by


pooling GPU compute resources.
• No more bottlenecks—you can set up guaranteed quotas of GPU
resources, to avoid bottlenecks and optimize billing.
• A higher level of control—Run:AI enables you to dynamically change
resource allocation, ensuring each job gets the resources it needs at any
given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data


scientists accelerate their productivity and the quality of their models.
CUDA APPLICATIONS:
1. Computational finance
2. Climate, weather, and ocean modeling
3. Data science and analytics
4. Deep learning and machine learning
5. Defense and intelligence
6. Manufacturing/AEC (Architecture, Engineering, and Construction): CAD
and CAE (including computational fluid dynamics, computational
structural mechanics, design and visualization, and electronic design
automation)
7. Media and entertainment (including animation, modeling, and rendering;
color correction and grain management; compositing; finishing and
effects; editing; encoding and digital distribution; on-air graphics; on-set,
review, and stereo tools; and weather graphics)
8. Medical imaging

13

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

9. Oil and gas


10. Research: Higher education and supercomputing (including
computational chemistry and biology, numerical analytics, physics, and
scientific visualization)
11. Safety and security
12. Tools and management
CUDA PROGRAMMING:

To understand CUDA programming, consider this simple C/C++ routine to add


two arrays:

void add(int n, float *x, float *y)


{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}

14

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

UNIT III

Common Problems/Challenges of CUDA

While the benefit of CUDA is clear—faster parallel programming—there are


several challenges for high performance computing (HPC) applications using
CUDA.

Lack of Abstraction

CUDA generally exposes quite a bit about the way that the hardware works to
the programmer. In this regard, it is often compared to assembly language for
parallel programming. Explicit concepts such as thread arrays, map directly to
the way the hardware groups computational units and local memory.

The memory model requires explicit data movement between the host processor
and the device. CUDA makes explicit use of a hierarchy of memory address
spaces, each of which obeys different sharing rules. This is more low-level
detail than the typical application or scientific programmer has to deal with
when programming in C, C++, or Fortran. It also raises significant concerns
with regard to portability and maintainability of code.

Data vs. Task Parallelism

The second challenge is that the programming model for CUDA is one of data
parallelism rather than task parallelism. When divvying up work across the
nodes of a cluster, HPC programmers are used to looking for and exploiting
parallelism at a certain level. The amount of data to assign to each node depends
on a number of factors including computational speed of the node, the available
memory, the bandwidth and latency of the interconnect, and the frequency with
which results need to be shared with other processes.

Since processors are fast and network bandwidth is relatively scarce, the
balance is typically to put quite a bit of data on each compute node and to move
data as infrequently as possible. CUDA invites the programmer to think about
parallelism of a completely different order, encouraging the developer to break

15

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

the problem apart into units that are much smaller. This is often many orders of
magnitude more parallelism than was previously expressed in the code and
requires reasoning somewhat differently about the computations themselves.

Data Movement and Multiple Address Spaces

NVIDIA GPUs are external accelerators attached to the host system via the PCI
bus. Each GPU has both its own onboard memory and also features smaller bits
of memory attached to each one of the compute elements. While there are now
mechanisms to address regions of host memory from device kernels, such
access is slower compared to accessing device memory.

CUDA programs therefore use a model in which code running on the host
processor prepares and explicitly dispatches work to the GPU, pauses for the
GPU to complete that work, then reads the resulting data back from the device.
Both the units of code representing computational kernels and the associated
data on which these computational kernels will execute are dispatched to the
device. The data is moved, in whole or part, to the device over the PCI bus for
execution. As results are produced, they need to be moved, just as explicitly,
back from the device to the host computer’s main memory.

Understanding how Code is Running on the GPU

GPUs are complex in how they run code and manage data. Understanding how
your code runs across cores, streaming multiprocessors (SMs) and how the tens
to thousands of threads are organized into blocks, lanes, and warps can be
confusing.

Parallel Programming with CUDA in C++ (Part 1)

Learn how to speed up compute-intensive applications with the power of


modern GPUs

The most common deep learning frameworks such


as Tensorflow and PyThorch often rely on kernel calls in order to use the GPU

16

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

for parallel computations and accelerate the computation of neural networks. The
most famous interface that allows developers to program using the GPU
is CUDA, created by NVIDIA.

Parallel computing requires a completely different point of view from ordinary


programming, but before you start getting your hands dirty, there are
terminologies and concepts to learn.

Background

In the following figure, you see a classic set-up in which we have an input that is
processed by the CPU one instruction at a time to generate an output. But how
do we process multiple instructions at the same time? This is what we will try to
understand in this article.

Image by Author

Terminology

17

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

• Process: is an instance of a computer program that is being executed.


A process is run on the CPU and creates an allocation in RAM.

Process (Image by Author)

• Context: a collection of data of a process (memory address, program


state). It allows the processor to suspend or hold the execution of a
process and restart it later.

• Thread: It is a component of a process. Every process has at least one


thread called main thread which is the entry point of the program. A
thread can execute some instructions. Within a process multiple
threads can co-exist and share the memory allocated to that particular
process. Between processes there is no memory sharing.

• Round Robin: if we have a single core processor, the processes will


be executed with the Round Robin schedule, execute first process with
the most priority. When you switch from the execution of one process
to another is called context switching.

18

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Parallelism

In modern computing there is no need for context switching, we can run


different threads on different cores (for now think of a core as a small processor),
that’s why we have multiple cores devices!

Multiple Cores (Image by Author)

But remember that in almost every process there are some instructions which
should be performed sequentially and some others that can be computed
simultaneously in parallel.

When you talk about parallelism you should remember that there are 2 types of
parallelism:

• Task Level: different tasks on the same or different data

• Data Level: same task on different data

19

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

At this point, be careful not to confuse parallelism with concurrency.

• Concurrency: we have a single processor, that executes processes


sequentially, and we just have the illusion of parallelism because the
processor is really fast.

• Parallelism: true parallelism with multiple processors.

CPU, GPU and GPGPU

Graphical processing units (GPU) can perform complex actions in a short period.
The complexity relies upon the quantity of operations executed
simultaneously, but only as long as they remain simple and basically similar.
The game industry has been the launching market for the GPU implementation,
later reached by Nvidia company through the platform CUDA. The notoriety of
GPU has increased even more, especially for developers which were now able to
run multiple computing actions using a few lines of code.

CUDA allows us to use parallel computing for so-called general-purpose


computing on graphics processing units (GPGPU), i.e. using GPUs for more
general purposes besides 3D graphics

Let’s summarize some basic differences between CPUs and GPUs.

GPU:

• low clock speed

• thousands of cores

• context switching is done by hardware (really fast)

20

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

• can switch between threads if one thread stalls

CPU:

• high clock speed

• few cores

• context switching is done by software (slow)

• switching between threads it’s not that easy

Basic Steps of Cuda Programming

In the next articles, we are going to write code to use parallel programming.
However, we must first know what the structure of a cuda-based code is, there
are a few simple steps to follow.

• Initialization of data on CPU

• Transfer data from CPU to GPU

• Kernel launch (instructions on GPU)

• Transfer results back to CPU from GPU

• Reclaim the memory from both CPU and GPU

In such an environment we will call Host Code the code that is going to run on
CPU and Device Code the code that is going to run og GPU.

21

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

CUDA Error Types


CUDA errors could be separated into synchronous and asynchronous errors, or
sticky and non-sticky errors.

Synchronous Error VS Asynchronous Error


CUDA kernel launch is asynchronous, meaning when the host thread reaches
the code for kernel launch, say kernel<<<...>>>, the host thread issues an
request to execute the kernel on GPU, then the host thread that launches the
kernel continues, without waiting for the kernel to complete. The kernel might
not begin to execute right away either.

There could be two types of error for CUDA kernel launch, synchronous error
and asynchronous error.

Synchronous error happens when the host thread knows the kernel is illegal or
invalid. For example, when the thread block size or grid size is too large, a
synchronous error is resulted immediately after the kernel launch call, and this
error could be captured by CUDA runtime error capturing API calls, such
as cudaGetLastError, right after the kernel launch call.

Asynchronous error happens during kernel execution or CUDA runtime


asynchronous API execution on GPU. It might take a while to encounter the
error and send the error to host thread. For example, For example, it might
encounter accessing invalid memory address in the late stage of kernel
execution or CUDA runtime asynchronous API cudaMemcpyAsync execution,
it will abort the execution and then send the error back to thread. Even if there
are CUDA runtime error capturing API calls, such as cudaGetLastError, right
after the kernel launch call, at the time when the error reaches host, those
CUDA runtime error capturing API calls have been executed and they found no
error. It is possible to capture the asynchronous error by explicitly
synchronizing using the CUDA kernel launch using CUDA runtime API calls,
such as cudaDeviceSynchronize, cudaStreamSynchronize,
or cudaEventSynchronize, and checking the returned error from those CUDA
kernel launch using CUDA runtime API calls or capturing the error using
CUDA runtime error capturing API calls, such as cudaGetLastError. However,
explicitly synchronization usually affects performance and therefore is not
recommended for using in production unless it is extremely necessary.

Sticky VS Non-Sticky Error


CUDA runtime API returns non-sticky error if there is any, whereas CUDA
kernel execution resulted in sticky error if there is any.

22

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

A non-sticky error is recoverable, meaning subsequent CUDA runtime API


calls could behave normally. Therefore, the CUDA context is not corrupted. For
example, when we allocate memory using cudaMalloc, it will return a non-
sticky error if the GPU memory is insufficient.

A sticky error is not recoverable, meaning subsequent CUDA runtime API calls
will always return the same error. Therefore, the CUDA context is corrupted,
unless the application host process is terminated. For example, when the kernel
tries to access invalid memory address during kernel execution, it will result in a
sticky error which will be captured and returned by all the subsequent CUDA
runtime API calls.

CUDA Error Checking Best Practice


In a CUDA program implementation, both development and production code,
always check the return value of each CUDA runtime synchronous or
asynchronous API call to see if there is any CUDA synchronous error, always
run CUDA runtime error capturing API calls, such as cudaGetLastError, after
kernel launch calls to see if there is any CUDA synchronous error. Check
CUDA asynchronous error in development by synchronization and error
checking after kernel launch calls and disable it in production.

#include <stdio.h>

// Handle errors in CUDA:


void handleCudaError(cudaError_t cudaERR){
if (cudaERR!=cudaSuccess){
printf("CUDA ERROR : %s\n",
cudaGetErrorString(cudaERR));
}
}

// Example usage:
// handleCudaError(cudaMalloc((void**)&dev_a, size ));

// Also after a kernel call, errors are masked.


// Detect them using the following code:
int main(void){
add<<< Nblocks, NthreadsPerBlock >>>( dev_a , dev_b
, dev_c );
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess){
printf("CUDA ERROR while executing the kernel:
%s\n",cudaGetErrorString(err));

23

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

return 103;
}
}

III. OVERVIEW OF SYNCHRONIZATION METHODS IN


NVIDIA GPUS
A. Primitive Synchronization Methods in Nvidia GPUs
Starting from CUDA 9.0, Nvidia added the feature of Coop- erative Groups
(CG). This feature is planned to allow scalable cooperation among groups of
threads and provide flexible parallel decomposition. Coalesced groups and tile
groups can be used as a method to decompose thread blocks. Beyond the level
of thread blocks, grid synchronization is proposed for inter-block
synchronization. Multi-grid synchronization is proposed for inter-GPU
synchronization.
In the current version of CUDA (10.0), tile group and coalesced group only
work correctly inside a warp. Analysis of PTX code shows that those two
instructions are transformed to the warp.sync instruction. Hence, as it stands, we
consider the synchronization capability of those methods to be only applicable
to the warp level.

Figure 2 shows the granularity of cooperative groups and synchronization in the


current version of CUDA.
1) Warp-level Synchronization (Synchronization Inside a Single GPU):
Current CUDA supports two intra-warp syn- chronization methods, i.e. tile
synchronization and the co- alesced group synchronization corresponding
respectively to the tile group and coalesced group in Figure 2. Previous versions
of CUDA guarantee that all threads inside a warp process the same instruction
at a time. Yet the introduction of synchronization methods inside a warp plus

24

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

the fact that each thread now has its own Program Counter (PC) implies a future
possibility of removing this feature.
2) Block-level Synchronization (Synchronization Inside a Single GPU):
Block-level synchronization corresponds to the thread block in the
programming model. According to CUDA’s programming guide [1], its
function is the same as the classical synchronization primitive syncthreads().
3) Grid-level Synchronization (Single GPU Synchronization): Starting from
CUDA 9.0, Nvidia introduced grid group grid-level synchronization. Grid-level
synchronization is a method to do single GPU synchronization. In order to use a
grid group, cudaLaunchCooperativeKernel() API call is necessary, in
comparison to the traditional kernel launch (<<<>>>).
4) Multi-Grid Level Synchronization (Multi-GPU Synchronization): CUDA
9.0 also introduced the concept of multi-grid group. This group is initialized by
a kernel launch API: cuda Launch Cooperative Kernel Multi Device().
Synchronizing this group can act as a way to do multi-GPU synchronization in a
single node.

B. Non-primitive Synchronization
1) Software Barrier for Synchronization: Li etc. [16] researched fine-grained
synchronization. Beyond it, Xiao, etc. [5] introduced a software device-level
synchronization. The authors limit the number of blocks per SM to only one in
order to avoid deadlocks. Sorensen et al. extended this work by adding an
automatic occupancy discovery protocol to discover 1 activate warps [4].
2) Implicit Barrier for Synchronization: Before the introduction of grid-level
synchronization, the typical way to 5 introduce a device-wide barrier to a
program was to use several 6 kernels in a single CUDA stream. A stream is a
logical queue 7 that enforces an execution order on the CUDA kernels in the 9
stream, i.e. the kernels and data movement commands are 10 executed in the
order by which they appeared in the stream. 11 For example, many DL
frameworks, e.g., Chainer [3], use this 13 method to enforce execution order.
3) Multi-GPU Synchronization: The common way to do multi-GPU
synchronization is to synchronize CPU threads orchestrating the GPUs. The
basic idea is to use one CPU thread per device (or one MPI rank per device).
Additionally, with the help of the GPUDirect CUDA technology, it is also
possible to implement multi-GPU software barriers using GPUDirect APIs.

25

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Since we are concerned in this paper with studying general and intrinsic barrier
methods, we would not discuss manu- ally implementation barriers, including
software barriers and GPUDirect based manually implementations.

26

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

UNIT IV
OpenCL:

OpenCL provides a common language, programming interfaces, and hardware


abstractions enabling developers to accelerate applications with task-parallel or
data-parallel computations in a heterogeneous computing environment
consisting of the host CPU and any attached OpenCL “devices”. OpenCL
devices may or may not share memory with the host CPU, and typically have a
different machine instruction set, so the OpenCL programming interfaces
assume heterogeneity between the host and all attached devices. The key
programming interfaces provided by OpenCL include functions for enumerating
available target devices (CPUs, GPUs, and Accelerators of various types),
managing “contexts” containing the devices to be used, managing memory
allocations, performing host-device memory transfers, compiling OpenCL
programs and “kernel” functions to be executed on target devices, launching
kernels on target devices, querying execution progress, and checking for errors.

Although OpenCL programs can be compiled and linked into binary objects
using conventional off-line compilation methodology, OpenCL also supports
run-time compilation enabling OpenCL programs to run natively on the target
hardware, even on platforms unavailable to the original software developer.
Run-time compilation eliminates dependencies on instruction sets, allowing
hardware vendors to make significant changes to instruction sets, drivers, and
supporting libraries, from one hardware generation to the next. Applications that
make use of the run-time compilation features of OpenCL will automatically
take advantage of the latest hardware and software features of the target device
without any need for recompilation of the main application itself.

OpenCL targets a broad range of microprocessor designs, requiring that it


support a multiplicity of programming idioms that are matched to the target
architectures. Although OpenCL provides guarantees of portability and
correctness of kernels across a variety of hardware, it does not guarantee that a
particular kernel will achieve peak performance on different architectures. The
nature of the underlying hardware may make some programming strategies
more appropriate for particular platforms than for others. As an example, a
GPU-optimized kernel may achieve peak memory performance when the work
items in a single work group collectively perform loads and stores, where a
Cell-optimized kernel may perform better with the use of of a double buffering
strategy combined with calls to async_workgroup_copy(). Applications select
the most appropriate kernel for the target devices by querying the capabilities
and hardware attributes of the installed devices at runtime.

27

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

The OpenCL programming model abstracts CPUs, GPUs, and other accelerators
as “devices” that contain one or more “compute units” (e.g., cores) composed of
one or more SIMD “processing elements” (PEs) that execute instructions in
lock-step. OpenCL defines four types of memory systems that devices may
incorporate, a large high-latency “global” memory, small low-latency read-only
“constant” memory, shared “local” memory accessible from multiple PEs
within the same compute unit, and “private” memory or device registers
accessible within each PE. Local memory may be implemented using either
high-latency global memory, or may be implemented with fast on-chip SRAM
or shared register file. Applications can query device attributes to determine the
properties of the available compute units and memory systems, using them
accordingly.

Before an application can compile OpenCL programs, allocate device memory,


or launch kernels, it must first create a “context” associated with one or more
devices. Memory allocations are associated with a context rather than a specific
device. Devices with inadequate memory capacity should be excluded when
creating a context, otherwise the maximum memory allocation will be limited
by the least-capable device. Similarly, it may be necessary to exclude some
devices from a context in the case that they do not support features required by
OpenCL programs to be run on the newly created context. Once a context is
created, OpenCL programs can be compiled at runtime by passing the source
code to OpenCL compilation functions as arrays of strings. After an OpenCL
program is compiled, handles can be obtained for specific “kernels” contained
by the program. The “kernel” functions can then be launched on devices within
the OpenCL context. OpenCL host-device memory I/O operations and kernels
are executed on a target device by enqueing them into a command queue
associated with the target device.

28

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

OpenCL and Modern Processor Architectures

State-of-the-art microprocessors contain a number of architectural features that


have historically been poorly supported or are difficult to utilize in existing
programming languages. This has led vendors to create their own programming
tools, language extensions, vector intrinsics, and subroutine libraries to close the
gap in programmability created by these hardware features. To help clarify the
relationship between OpenCL programming model and the diversity of potential
target hardware, we compare the architectural characteristics of three exemplary
microprocessor families and relate them to key OpenCL abstractions and
features of the OpenCL programming model.

Multi-core CPUs

Modern CPUs are typically composed of a small number of high-frequency


processor cores with advanced features such as out-of-order execution and
branch prediction. CPUs are generalists that perform well for a wide variety of
applications including latency-sensitive sequential workloads, and coarse-
grained task-parallel or data-parallel workloads. Since they are typically used
for latency sensitive workloads with minimal parallelism, CPUs make extensive
29

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

use of large caches to hide main memory latency. Many CPUs also incorporate
small scale use of single-instruction multiple-data (SIMD) arithmetic units to
boost the performance of dense arithmetic and multimedia workloads. These
SIMD units are not directly exposed by conventional programming languages
like C and Fortran, so their use requires calling vectorized subroutine libraries
or proprietary vector intrinsic functions, or trial-and-error source level
restructuring and autovector-izing compilers. AMD, Apple, and IBM provide
OpenCL implementations that target multi-core CPUs, and support the use of
SIMD instruction set extensions such as x86 SSE and Power/VMX. The current
CPU implementations for x86 processors often make best use of SSE when
OpenCL kernels are written with explicit use of float4 types. CPU
implementations often map all memory spaces onto the same hardware cache,
so a kernel that makes explicit use of constant and local memory spaces may
actually incur more overhead than a simple kernel that only uses global memory
references.

The Cell Processor

The Cell Broadband Engine Architecture (CBEA) is a heterogeneous chip


architecture consisting of one 64-bit Power-compliant Processor Element (PPE),
multiple Synergistic Processor Elements (SPE), a Memory Interface Controller
and I/O units, connected with an internal high speed bus [9]. The PPE is a
general purpose processor based on the IBM Power-architecture and it is
designed to run conventional operating system and control-intensive code to
coordinate tasks running on SPEs. The SPE, a SIMD streaming processor,
provides most of the computing power for the Cell systems with its design
optimized for massive data processing. An application’s task parallelism can be
realized using multiple SPEs while the data parallelism and instruction
parallelism can be realized using the SIMD instructions and dual execution
pipelines in SPEs. Each SPE has a small software-managed cache-like fast
memory local to an SPE, called local store. Applications can load data from
system memory to local store or the other way around using DMA requests,
with the best bandwidth achieved when both source and destination are aligned
to 128 bytes. The data transfer and instructions execution can happen
simultaneously, enabling application programmers to hide memory latency
using techniques like double buffering. Shi e al. have provided a detailed
description of the architecture and a sample application ported to the Cell
processor [1].

IBM has released an OpenCL toolkit supporting both the Cell and Power
processors on the Linux platform. The IBM OpenCL implementation supports
the embedded profile for the Cell SPUs, and uses software techniques to smooth

30

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

over some of the architectural differences between the Cell SPUs and
conventional CPUs. On the Cell processor, global memory accesses perform
best when operands are a multiple of 16 bytes, e.g. an OpenCL float4 type. The
use of larger vector types such as float16 enables the compiler to unroll loops,
further increasing performance. The 256 kB Cell SPU local store, is shared
among the program text, and OpenCL “local”, and “private” variables. This
places practical limits on the size of work-groups since private data storage is
required for each work-item. The Cell DMA engine performs most effectively
with the use of double buffering strategies combined with calls
to async_workgroup_copy() to load data from global memory into local store.

Graphics Processing Units

Contemporary GPUs are composed of hundreds of processing units running at a


low to moderate frequency, designed for throughput-oriented latency insensitive
workloads. In order to hide global memory latency, GPUs contain small or
moderate sized on-chip caches, and they make extensive use of hardware
multithreading, executing tens of thousands of threads concurrently across the
pool of processing units. The GPU processing units are typically organized in
SIMD clusters controlled by a single instruction decoder, with shared access to
fast on-chip caches and shared memories. The SIMD clusters execute machine
instructions in lock-step, and branch divergence is handled by executing both
paths of the branch and masking off results from inactive processing units as
necessary. The use of SIMD architecture and in-order execution of instructions
allows GPUs to contain a larger number of arithmetic units in the same area as
compared to traditional CPUs.

Massively parallel arithmetic-heavy hardware design enables state-of-the-art


GPUs to achieve single-precision floating point arithmetic rates approaching 2
TFLOPS (trillions of instructions per second). Due to the demands of graphics
workloads, GPUs are designed with global memory systems capable of
bandwidths approaching 200 GB/sec. GPU global memory is organized in
multiple banks, with peak performance attained when accesses are aligned on
appropriate address boundaries, and groups of SIMD units cooperatively load
data from a contiguous block of memory addresses, known as a “coalesced”
memory access. When a memory access is not aligned on an appropriate
address boundary and in consecutive sequence, the memory access must be split
into multiple transactions resulting in a significant reduction in effective
bandwidth, and increasing latency.

Although GPUs are powerful computing devices in their own right, they must
currently be managed by the host CPUs. GPUs are typically attached to the host

31

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

by a PCI-Express bus, and in most cases have their own independent on-board
memory system. In order to exchange input and output data with the GPU, the
host CPU schedules DMA transfers between the host and GPU memory
systems. OpenCL provides APIs for CPU-directed data transfers between
independent host and GPU memory systems. Recent GPUs are capable of direct
access to host memory over PCI-e, and in some cases may allow their on-board
to be mapped into the host address space, providing the necessary hardware
support for zero-copy access to data that are read or written only once during
kernel execution. At the present time, OpenCL does not include mechanisms for
zero-copy memory access, though it could be provided as an extension or as
part of a future version.

Both AMD and NVIDIA have released OpenCL implementations supporting


their respective GPUs. These devices require a large number of OpenGL work-
items and work-groups to fully saturate the hardware and hide latency. NVIDIA
GPUs use a scalar processor architecture for the individual PEs seen by
OpenCL, enabling them to work with high efficiency on most OpenCL data
types. AMD GPUs use a vector architecture, and typically achieve best
performance such that OpenCL work-items operate on 4-element vector types
such as float4. In many cases, a vectorized OpenCL kernel can be made to
perform well on x86 CPUs and on AMD and NVIDIA GPUs, though the
resulting kernel code may be less readable than the scalar equivalent.
Differences in low level GPU architecture including variations on what memory
is cached and what memory access patterns create bank conflicts affect kernel
optimality. Vendor-provided OpenCL literature typically contains low level
optimization guidelines. In the examples that follow we refrain from detail and
focus on general OpenCL programming concepts.

An Example OpenCL Kernel

To illustrate the process of moving serial code to OpenCL, we discuss an


example kernel from the Adaptive Poisson-Boltzmann Solver (APBS) [10].
APBS is a package for calculating biomolecular solvation through the use of the
Poisson-Boltzmann equation (PBE). The PBE is a popular continuum model
that describes electrostatic interactions between molecular solutes. As part of
the solution of the PBE, potentials are discretized onto a grid sized larger than
the bounding volume containing the molecule of interest. Under Dirichlet
boundary conditions, the potential contribution of grid points on the faces of the
grid can be solved using the Single Debye-Hückel (SDH) method or Multiple
Debye-Hückel (MDH) method. For the MDH method, the potential at a grid
point i located at position ri is given by

32

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

with the sum taken over all atoms, where α is a prefactor that accounts for the
system of units and solution dielectric values, atom j is located at rj and has
partial charge qj and size σj, and the pairwise distance is rij = |rj − ri|. The
potential at each grid point is effectively the sum of all atomic potential
contributions in the molecule. The MDH method is inherently data-parallel
when decomposed over grid points since they are computed independently and
there are no output conflicts.

Host Devices:

A host is a computer or other device that communicates with other hosts on a


network. Also known as network hosts, hosts include clients and servers that
send or receive data, services and applications.

A host is any hardware device that has the capability of permitting access to a
network via a user interface, specialized software, network address, protocol
stack, or any other means. Some examples include, but are not limited
to, computers, personal electronic devices, thin clients, and multi-functional
devices.

33

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Types:

Host devices are devices that are physically plugged into the host,
including SCSI (for example tapes, disks, changers), PCI (for example
NICs, GPUs, and HBAs), and USB (for example mice, cameras, and disks).

Hosts typically do not include intermediary network devices


like switches and routers, which are instead often categorized as nodes.
A node is a broader term that includes anything connected to a network, while a
host requires an IP address. In other words, all hosts are nodes, but network
nodes are not hosts unless they require an IP address to function.

Hosts use various protocols to communicate, including TCP and User Datagram
Protocol (UDP). On a TCP/IP network, each host has a host number that,
together with a network identity, forms its unique IP address. In the Open
Systems Interconnection (OSI) model, protocols in the transport layer, also
known as Layer 4, are responsible for communication between hosts.

34

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Types of IT hosts
The term host is used in several other areas within information technology (IT),
carrying a slightly different meaning depending on the context.

Web host
For companies or individuals with a website, a host is a web server that stores
and transmits the data for one or more websites. Host can also refer to the
service provider that leases this infrastructure, which is known as hosting.

Cloud host
A cloud host is based on cloud computing technologies that enable a number of
servers to act as one system in which website performance can be guaranteed by
multiple machines. It often includes a network of servers pulling from different
data centers in different locations.

Cloud hosts operate as a service that enables clients to buy as much of the
service as they need. Cloud hosting is an alternative to hosting a website on a
single server. Cloud hosting can be considered both infrastructure as a service
(IaaS) and platform as a service (PaaS). Using a public cloud model, a public
network transmits data that is physically stored on shared virtual servers that
make up the cloud resource.

Virtual host
The term virtual host has two uses. One refers to the technology used to run
multiple domains or applications on a single physical server. The second refers
to companies that sell virtual infrastructure services.

Remote host
In this context, users access a remote host in a different physical location using
a private network or the internet. This process provides users with remote
access. Examples include servers that users can log in to remotely or a host
computer for a remote desktop.

35

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Hosts
connect to other hosts and servers in this local network.
Host virtual machine
This refers to the hardware -- or the physical server -- that provides the
computing resources to support virtual machines (VMs). This process is also
known as server virtualization.

Mainframe computer environments


In this context, a mainframe computer can be the host provider of services for
the workstations attached to it. This does not mean that the host only has servers
and the workstations only have clients. The server-client relationship is a
programming model independent of this contextual usage of the term host.

Hostname
A hostname is a plaintext name identifying a host in a given domain. On a local
area network (LAN), a server's hostname might be a nickname like mailserver1.
On the internet, a hostname makes up part of a web address and has three parts:

1. subdomain

36

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

2. domain name
3. top-level domain

For example, an example hostname subdomain.example.com consists of the


subdomain subdomain, the domain example and the top-level domain .com.

In other contexts, a host can also be a device or program that provides services
to some smaller or less-capable device or program.

37

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

UNIT IV

Algorithms of GPU

Parallel Patterns:

Convolution:
Convolution is a mathematical operation which describes a rule of how to
combine two functions or pieces of information to form a third function. The
feature map (or input data) and the kernel are combined to form a transformed
feature map. The convolution algorithm is often interpreted as a filter, where the
kernel filters the feature map for certain information. A kernel, for example,
might filter for edges and discard other information. The inverse of the
convolution operation is called deconvolution.

Convolution is a mathematical operation on two signals f and g, defined as:

38

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

As we always process an image with a finite resolution, the convolution is


actually a scalar product of the filter weights and all pixels of the image within a
window that is defined by the extent of the filter and a center pixel. Figure 1
illustrates the convolution using a small 3 × 3 kernel. The filter is defined as a
matrix, where the central item weights the center pixel, and the other items
define the weights of the neighbor pixels. We can also say that the radius of the
3 × 3 kernel is 1, since only the one-ring neighborhood is considered during the
convolution. We also have to define the convolution’s behavior at border of the
image, where the kernel maps to undefined values outside the image. Generally,
the filtered values outside the image boundaries are either treated as zeros (this
is what we will do in this assignment) or clamped to the border pixels of the
image.

39

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

The design of the convolution filter requires a careful selection of kernel


weights to achieve the desired effect. In the following, we introduce a few
examples to demonstrate basic filtering kernels often used in image processing

Prefix Sum:

40

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Prefix Sum – Compaction

41

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Prefix Sum Application – Range histogram:

42

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Prefix Sum Application – Summed Area Tables:

43

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Sparse matrix multiplication:

44

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Sparse matrix multiplication is the solution to perform the multiplication of two


matrixes in less time complexity. However, the resultant matrix is also having
the most elements value 0, so it is better to store the resultant matrix in the same
way. Thus, we can save a lot of space by storing just non-zero elements.

The following article provides an outline for Sparse Matrix Multiplication.

Sparse matrix is the matrix in which most of the elements are zero. To store all

the elements of the matrix leads to a wastage of memory. So it is better to store

the data of only those elements which are non-zero in the existing matrix. For

example, let’s suppose there is a matrix A of which most the element is 0. So, to

store all the matrix, it is better to store only those elements which are non-zero.

Using this way to store the matrix reduces the space and takes lesser time to

perform operations. One of the operations is the multiplication of two matrixes,

in which we multiply the two sparse matrices.

Syntax of Sparse Matrix Multiplication


To multiply two matrixes in which most of the elements is zero elements,

Sparse matrix multiplication is used for better time complexity and space

complexity. In sparse multiplication, it limits the processing time and space

usage; instead of storing a lesser number of non-zero elements in a matrix, to

store the elements that are non-zero, we use the below 2 representations.

1. Array Representation: An array representation where we need to access the

elements more often. Since array store, the elements based on the indices thus is

45

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

much helpful. The 2D array is converted to a 1D array with 3 columns

representing.

• Row: Row index of the non-zero element.

• Column: Column index of non-zero element.

• Value: Value at the same row, column index in the 2D matrix.

2. Linked List Representation: In a linked list where the frequency of

insertion and deletion operation in the matrix is more since it is easier to delete

and insert elements in a linked list as compared to the arrays.

Each node has four fields as given below:

• Row: Row index of the non-zero elements in the matrix.

• Column: Column index of the non-zero elements in the matrix.

• Value: Value of the non-zero element at (row, column) position in the

matrix.

• Next Node: Reference to the next node.

46

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

How to Perform Sparse Matrix Multiplication?

• A sparse matrix is the type of 2 -D matrix where zero elements are large

compared to non-zero elements. And to multiply sparse matrixes takes a

lot of memory space and time, and also it is difficult to perform any

further operation. So to multiply these matrixes with less time and space

storage, we can directly multiply two reduced sparse matrixes.

• Let’s suppose we have two matrixes, both contains most elements 0, to

the multiplication of those matrix takes a lot of time means high time

complexity and also it will be large space as well for storing 3 matrixes,

so it is better to convert those matrixes into reduce the form of sparse

matrix and multiply them and get the resultant matrix as a reduced the

form of a sparse matrix as well, that multiplication is called Sparse

Matrix Multiplication.

• We can also store the resultant multiplication matrix in the same as the

above representation.

Example of Sparse Matrix Multiplication


Given below are the example of Sparse Matrix Multiplication:

47

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Let us illustrate the concept of sparse matrix multiplication with an example.

Consider below 2-D matrix with 4 rows and 4 columns.

So here, as we can see, only 5 elements out of 4 * 4 = 16 elements in the matrix

are non-zero. So that depicts that we just need to store these 5 elements to store

in the memory. So we just store the location of non-zero elements and their

values.

For multiplication, Let’s take another sparse matrix of 4*4 size.

So, here we can see that 5 elements have non-zero values.

To multiply them, we have to follow these steps:


48

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

Step 1: First, we have to take the transpose of the second matrix. The transpose

of a matrix is, converting all the rows into columns and columns to rows. So

here as we store only non-zero elements. So we have to replace the rows with

columns and columns with rows.

Step 2: So, in a resultant matrix of multiplication of these two matrixes, let x is

a row of the first matrix and y is the row of the transpose of the second matrix,

we get resultant[x][y]. For resultant[x][y] is the summation of all values, get

after the multiplication of those values in which the column in the first matrix

and column in the transpose of the second matrix is the same.

Step 3: Let x=0 and y=0; in the first matrix, two values are there in which row

is 0 and one in the second matrix. So, if their column is also the same, then we

multiply both and find another pair. So, in that example, in the first matrix, 0,2

value is 18, and in the second matrix, 0,2 is 5. So we multiply both and find

49

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

another pair, but as we can see, there is no other pair in which column is the

same. So, resultant [0][0] is 144.

Step 4: Follow step 3 for every 0<=x<row and 0<=y<column where resultant

[x][y] is resultant matrix.

struct matrix {

int row; //To store row no.

int col; //To store col no

int value; // element value

};

// Print the sparse matrix

void printMatrix(matrix a[100], int k) {

for (int i=0; i<k;i++) {

cout << a[i].row <<" "<< a[i].col <<" "<< a[i].value << endl;

// Sort the sparse matrix

bool compareMatrix( struct matrix a, struct matrix b ) {

if (a.row < b.row) {

50

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

return true;

if (a.row > b.row) {

return false;

if (a.col < b.col) {

return true;

return false;

// Transpose of a sparse Matrix

void transpose(matrix a[100], matrix b[100], int row,int col, int n) {

for (int i=0;i<n;i++) {

b[i].row = a[i].col;

for (int i=0;i<n;i++) {

b[i].col = a[i].row;

51

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

for (int i=0;i<n;i++) {

b[i].value = a[i].value;

sort(b,b+n,compareMatrix);

cout << "Transpose of sparse matrix is: -" << endl;

printMatrix(b, n);

void multiply(matrix a[100], matrix transposeB[100], matrix resultant[100], int

n1, int n2, int rows, int cols) {

int i = 0;

int j = 0;

int k = 0;

int temp = 0;

while( i<rows || j<cols ) {

int a1 = a[i].row;

int b1 = transposeB[j].row;

int tempi = i;

int tempj = j;

52

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

while( a[tempi].row==a1 ) {

tempj=j;

while( transposeB[tempj].row == b1) {

if(a[tempi].col == transposeB[tempj].col ) {

temp = temp + (a[tempi].value * transposeB[tempj].value);

tempj++;

tempi++;

if(temp != 0) {

resultant[k].row=a[i].row;

resultant[k].col=transposeB[j].row;

resultant[k].value=temp;

k++;

temp=0;

while(j<cols && b1==transposeB[j].row) {

53

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

j++;

if(b1==transposeB[j].row) {

while(i<rows && a1==a[i].row) {

i++;

j=0;

if(a1==a[i].row){

break;

cout << "Multiplication of sparse matrix is: -" << endl;

printMatrix(resultant,k);

int main() {

int row1=4, col1=4;

int row2=4, col2=4;

54

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

int n1=5,n2=5;

matrix a[n1]={

0,1,5,

0,2,18,

2,1,14,

3,1,15,

3,3,4

};

matrix b[n2] = {

0,1,5,

1,2,20,

2,0,8,

3,1,15,

3,3,24

};

matrix transposeb[n2];

matrix resultant[100];

transpose(b, transposeb, row2, col2, n2);

55

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)


lOMoARcPSD|37680818

multiply( a, transposeb, resultant, n1, n2, row1, col2);

Output:

Conclusion
Sparse matrix multiplication is the solution to perform the multiplication of two

matrixes in less time complexity. However, the resultant matrix is also having

the most elements value 0, so it is better to store the resultant matrix in the same

way. Thus, we can save a lot of space by storing just non-zero elements.

56

Downloaded by Sudha kumar (sudhakumar.k1996@gmail.com)

You might also like