KEMBAR78
PC Cuda Assignment-2 | PDF | Parallel Computing | Graphics Processing Unit
0% found this document useful (0 votes)
20 views29 pages

PC Cuda Assignment-2

Uploaded by

PrashansaBhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

PC Cuda Assignment-2

Uploaded by

PrashansaBhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Experiment-2

MIHIR KATAKDHOND (231070025)


Prashansa Bhatia(231071045)
BRANCH: TY BTECH COMPUTER ENGINEERING
BATCH: AD

PARALLEL COMPUTING LAB

Aim: Implement Cuda

Theory:

1. Introduction

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming
model created by NVIDIA. It enables programmers to leverage the GPU’s highly parallel structure
for general-purpose computing, beyond just graphics rendering. By using CUDA, computationally
intensive tasks such as matrix operations, image processing, and large-scale data analytics can be
executed much faster compared to traditional CPU-only implementations.

2. CPU vs GPU Computing

●​ CPU (Central Processing Unit)​

○​ Few cores, optimized for sequential processing.​

○​ High clock speeds and complex instruction handling.​

○​ Better suited for tasks requiring significant decision-making and control flow.​

●​ GPU (Graphics Processing Unit)​

○​ Hundreds or thousands of smaller cores.​

○​ Optimized for executing the same instruction across large datasets in parallel.​

○​ Particularly efficient for numerical computations and data-parallel workloads.​


3. CUDA Programming Model

CUDA adopts a hierarchical thread organization for efficient parallelism:

1.​ Thread – The smallest execution unit.​

2.​ Block – A collection of threads that can share memory.​

3.​ Grid – A collection of blocks.​

Each thread is assigned a unique ID based on its block and thread indices, enabling parallel access
to data elements.

4. Memory Hierarchy in CUDA

CUDA offers several memory types with different scopes and performance characteristics:

●​ Registers – Fastest, private to each thread.​

●​ Shared Memory – Shared between threads within a block, very fast but limited in size.​

●​ Global Memory – Accessible to all threads but relatively slow.​

●​ Constant & Texture Memory – Specialized memory types for read-only access and specific
use cases.​

Efficient memory usage is critical for optimizing performance.

5. CUDA Kernels

A kernel is a special function that runs on the GPU in parallel across multiple threads. Kernels are
launched from the CPU (host) but executed on the GPU (device). The launch configuration
specifies:

●​ Number of blocks in a grid.​

●​ Number of threads per block.​

This determines the total parallel workload.


6. Performance Measurement

Execution time is measured to compare CPU and GPU implementations:

●​ CPU Execution Time – Typically measured using system clock functions.​

●​ GPU Execution Time – Measured using CUDA events to accurately capture kernel runtime.​

GPU acceleration benefits become more evident with larger datasets, as the parallelism outweighs
kernel launch overhead.
1.hello world

Code:

%%writefile add.cu
#include <cstdio>
#include <cuda_runtime.h>

#define CHECK_CUDA(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error %s at %s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__); \
exit(1); \
}\
} while(0)

__global__ void helloFromGPU() {


int globalId = blockIdx.x * blockDim.x + threadIdx.x;
printf("Hello from GPU! block %d, thread %d (global %d)\n",
blockIdx.x, threadIdx.x, globalId);
}

int main() {
printf("Hello from CPU!\n");

// Choose a simple launch config


int blocks = 2;
int threadsPerBlock = 8;

// Launch kernel
helloFromGPU<<<blocks, threadsPerBlock>>>();
CHECK_CUDA(cudaPeekAtLastError());
CHECK_CUDA(cudaDeviceSynchronize());

// Show device info


int device = 0;
cudaDeviceProp prop;
CHECK_CUDA(cudaGetDevice(&device));
CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
printf("\n--- Device Info ---\n");
printf("Name: %s\n", prop.name);
printf("SMs: %d\n", prop.multiProcessorCount);
printf("Global Memory: %.2f GB\n", prop.totalGlobalMem / (1024.0 * 1024 * 1024));
printf("Max Threads/Block: %d\n", prop.maxThreadsPerBlock);
printf("Compute Capability: %d.%d\n", prop.major, prop.minor);

return 0;
}

Output:
2.matrix addition

Code 1:

%%writefile matrix_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void matrixAddGPU(float *A, float *B, float *C, int N) {


int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
C[row * N + col] = A[row * N + col] + B[row * N + col];
}

void matrixAddCPU(float *A, float *B, float *C, int N) {


for (int i = 0; i < N * N; i++)
C[i] = A[i] + B[i];
}

int main() {
int N = 1024; // change sizes for experiments
size_t size = N * N * sizeof(float);

float *h_A = (float*)malloc(size);


float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

// CPU timing
clock_t start_cpu = clock();
matrixAddCPU(h_A, h_B, h_C_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

// Allocate GPU memory


float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);


dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
matrixAddGPU<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time = 0;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile matrix_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void matrixAddGPU(float *A, float *B, float *C, int N) {


int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
C[row * N + col] = A[row * N + col] + B[row * N + col];
}

void matrixAddCPU(float *A, float *B, float *C, int N) {


for (int i = 0; i < N * N; i++)
C[i] = A[i] + B[i];
}

int main() {
int N = 30; // change sizes for experiments
size_t size = N * N * sizeof(float);

float *h_A = (float*)malloc(size);


float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

// CPU timing
clock_t start_cpu = clock();
matrixAddCPU(h_A, h_B, h_C_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

// Allocate GPU memory


float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);


dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
matrixAddGPU<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time = 0;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:
3.matrix transpose

Code 1:

%%writefile matrix_transpose.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void transposeGPU(float *in, float *out, int N) {


int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
out[col * N + row] = in[row * N + col];
}

void transposeCPU(float *in, float *out, int N) {


for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
out[col * N + row] = in[row * N + col];
}

int main() {
int N = 1024;
size_t size = N * N * sizeof(float);
float *h_in = (float*)malloc(size);
float *h_out_cpu = (float*)malloc(size);
float *h_out_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++)


h_in[i] = rand() % 100;

clock_t start_cpu = clock();


transposeCPU(h_in, h_out_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_in, *d_out;


cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);


dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
transposeGPU<<<blocksPerGrid, threadsPerBlock>>>(d_in, d_out, N);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile matrix_transpose.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void transposeGPU(float *in, float *out, int N) {


int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
out[col * N + row] = in[row * N + col];
}

void transposeCPU(float *in, float *out, int N) {


for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
out[col * N + row] = in[row * N + col];
}

int main() {
int N = 50;
size_t size = N * N * sizeof(float);
float *h_in = (float*)malloc(size);
float *h_out_cpu = (float*)malloc(size);
float *h_out_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++)


h_in[i] = rand() % 100;

clock_t start_cpu = clock();


transposeCPU(h_in, h_out_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_in, *d_out;


cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
transposeGPU<<<blocksPerGrid, threadsPerBlock>>>(d_in, d_out, N);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 2:
4.Prefix sum

Code 1:

%%writefile prefix_sum.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void prefixSumGPU(int *in, int *out, int n) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
int sum = 0;
for (int i = 0; i <= tid; i++)
sum += in[i];
out[tid] = sum;
}
}

void prefixSumCPU(int *in, int *out, int n) {


out[0] = in[0];
for (int i = 1; i < n; i++)
out[i] = out[i - 1] + in[i];
}

int main() {
int n = 10000;
size_t size = n * sizeof(int);
int *h_in = (int*)malloc(size);
int *h_out_cpu = (int*)malloc(size);
int *h_out_gpu = (int*)malloc(size);

for (int i = 0; i < n; i++)


h_in[i] = rand() % 10;

clock_t start_cpu = clock();


prefixSumCPU(h_in, h_out_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

int *d_in, *d_out;


cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
prefixSumGPU<<<(n + 255) / 256, 256>>>(d_in, d_out, n);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile prefix_sum.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void prefixSumGPU(int *in, int *out, int n) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
int sum = 0;
for (int i = 0; i <= tid; i++)
sum += in[i];
out[tid] = sum;
}
}

void prefixSumCPU(int *in, int *out, int n) {


out[0] = in[0];
for (int i = 1; i < n; i++)
out[i] = out[i - 1] + in[i];
}

int main() {
int n = 10;
size_t size = n * sizeof(int);
int *h_in = (int*)malloc(size);
int *h_out_cpu = (int*)malloc(size);
int *h_out_gpu = (int*)malloc(size);

for (int i = 0; i < n; i++)


h_in[i] = rand() % 10;

clock_t start_cpu = clock();


prefixSumCPU(h_in, h_out_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

int *d_in, *d_out;


cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
prefixSumGPU<<<(n + 255) / 256, 256>>>(d_in, d_out, n);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 2:
5.Square of array or elements

Code 1:

%%writefile square_array.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void squareGPU(float *arr, int n) {


int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) arr[tid] *= arr[tid];
}

void squareCPU(float *arr, int n) {


for (int i = 0; i < n; i++)
arr[i] *= arr[i];
}

int main() {
int n = 1000000;
size_t size = n * sizeof(float);
float *h_arr_cpu = (float*)malloc(size);
float *h_arr_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++)


h_arr_cpu[i] = h_arr_gpu[i] = (float)(rand() % 100);

clock_t start_cpu = clock();


squareCPU(h_arr_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_arr;
cudaMalloc(&d_arr, size);
cudaMemcpy(d_arr, h_arr_gpu, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
squareGPU<<<(n + 255) / 256, 256>>>(d_arr, n);
cudaEventRecord(stop);

cudaMemcpy(h_arr_gpu, d_arr, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_arr);
free(h_arr_cpu); free(h_arr_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile square_array.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void squareGPU(float *arr, int n) {


int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) arr[tid] *= arr[tid];
}

void squareCPU(float *arr, int n) {


for (int i = 0; i < n; i++)
arr[i] *= arr[i];
}

int main() {
int n = 36;
size_t size = n * sizeof(float);
float *h_arr_cpu = (float*)malloc(size);
float *h_arr_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++)


h_arr_cpu[i] = h_arr_gpu[i] = (float)(rand() % 100);

clock_t start_cpu = clock();


squareCPU(h_arr_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_arr;
cudaMalloc(&d_arr, size);
cudaMemcpy(d_arr, h_arr_gpu, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
squareGPU<<<(n + 255) / 256, 256>>>(d_arr, n);
cudaEventRecord(stop);
cudaMemcpy(h_arr_gpu, d_arr, size, cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_arr);
free(h_arr_cpu); free(h_arr_gpu);
return 0;
}

Output 2:
6.vector addition

Code 1:

%%writefile vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void vectorAddGPU(float *A, float *B, float *C, int n) {


int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

void vectorAddCPU(float *A, float *B, float *C, int n) {


for (int i = 0; i < n; i++)
C[i] = A[i] + B[i];
}

int main() {
int n = 1000000;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();


vectorAddCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_A, *d_B, *d_C;


cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void vectorAddGPU(float *A, float *B, float *C, int n) {


int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

void vectorAddCPU(float *A, float *B, float *C, int n) {


for (int i = 0; i < n; i++)
C[i] = A[i] + B[i];
}

int main() {
int n = 2000;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();


vectorAddCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_A, *d_B, *d_C;


cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:
7.vector multiplication

Code 1:

%%writefile vector_mul.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void vectorMulGPU(float *A, float *B, float *C, int n) {


int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] * B[i];
}

void vectorMulCPU(float *A, float *B, float *C, int n) {


for (int i = 0; i < n; i++)
C[i] = A[i] * B[i];
}

int main() {
int n = 1000000;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();


vectorMulCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_A, *d_B, *d_C;


cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorMulGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile vector_mul.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void vectorMulGPU(float *A, float *B, float *C, int n) {


int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] * B[i];
}

void vectorMulCPU(float *A, float *B, float *C, int n) {


for (int i = 0; i < n; i++)
C[i] = A[i] * B[i];
}

int main() {
int n = 10;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {


h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();


vectorMulCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_A, *d_B, *d_C;


cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorMulGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);


cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);


printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);


free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:

Conclusion:

This assignment demonstrated the power of parallel computing using CUDA compared to
traditional serial CPU execution. Through experiments on various operations such as matrix
addition, matrix transpose, prefix sum, vector addition, and vector multiplication, it was observed
that GPU performance significantly improves with increasing data size due to its ability to execute
thousands of threads in parallel.

While CPU implementations are simpler and perform well for small datasets, they become slower
for large-scale computations because of their limited core count and sequential execution model.
The GPU, with its massive parallelism and optimized memory architecture, achieved substantial
reductions in execution time for large problem sizes.

You might also like