0% found this document useful (0 votes)

20 views29 pages

PC Cuda Assignment-2

Uploaded by

PrashansaBhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views29 pages

PC Cuda Assignment-2

Uploaded by

PrashansaBhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Experiment-2

MIHIR KATAKDHOND (231070025)

Prashansa Bhatia(231071045)
BRANCH: TY BTECH COMPUTER ENGINEERING
BATCH: AD

PARALLEL COMPUTING LAB

Aim: Implement Cuda

Theory:

1. Introduction

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming
model created by NVIDIA. It enables programmers to leverage the GPU’s highly parallel structure
for general-purpose computing, beyond just graphics rendering. By using CUDA, computationally
intensive tasks such as matrix operations, image processing, and large-scale data analytics can be
executed much faster compared to traditional CPU-only implementations.

2. CPU vs GPU Computing

● CPU (Central Processing Unit)

○ Few cores, optimized for sequential processing.

○ High clock speeds and complex instruction handling.

○ Better suited for tasks requiring significant decision-making and control flow.

● GPU (Graphics Processing Unit)

○ Hundreds or thousands of smaller cores.

○ Optimized for executing the same instruction across large datasets in parallel.

○ Particularly efficient for numerical computations and data-parallel workloads.

3. CUDA Programming Model

CUDA adopts a hierarchical thread organization for efficient parallelism:

1. Thread – The smallest execution unit.

2. Block – A collection of threads that can share memory.

3. Grid – A collection of blocks.

Each thread is assigned a unique ID based on its block and thread indices, enabling parallel access
to data elements.

4. Memory Hierarchy in CUDA

CUDA offers several memory types with different scopes and performance characteristics:

● Registers – Fastest, private to each thread.

● Shared Memory – Shared between threads within a block, very fast but limited in size.

● Global Memory – Accessible to all threads but relatively slow.

● Constant & Texture Memory – Specialized memory types for read-only access and specific
use cases.

Efficient memory usage is critical for optimizing performance.

5. CUDA Kernels

A kernel is a special function that runs on the GPU in parallel across multiple threads. Kernels are
launched from the CPU (host) but executed on the GPU (device). The launch configuration
specifies:

● Number of blocks in a grid.

● Number of threads per block.

This determines the total parallel workload.

6. Performance Measurement

Execution time is measured to compare CPU and GPU implementations:

● CPU Execution Time – Typically measured using system clock functions.

● GPU Execution Time – Measured using CUDA events to accurately capture kernel runtime.

GPU acceleration benefits become more evident with larger datasets, as the parallelism outweighs
kernel launch overhead.
1.hello world

Code:

%%writefile add.cu
#include <cstdio>
#include <cuda_runtime.h>

#define CHECK_CUDA(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error %s at %s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__); \
exit(1); \
}\
} while(0)

global void helloFromGPU() {

int globalId = blockIdx.x * blockDim.x + threadIdx.x;
printf("Hello from GPU! block %d, thread %d (global %d)\n",
blockIdx.x, threadIdx.x, globalId);
}

int main() {
printf("Hello from CPU!\n");

// Choose a simple launch config

int blocks = 2;
int threadsPerBlock = 8;

// Launch kernel
helloFromGPU<<<blocks, threadsPerBlock>>>();
CHECK_CUDA(cudaPeekAtLastError());
CHECK_CUDA(cudaDeviceSynchronize());

// Show device info

int device = 0;
cudaDeviceProp prop;
CHECK_CUDA(cudaGetDevice(&device));
CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
printf("\n--- Device Info ---\n");
printf("Name: %s\n", prop.name);
printf("SMs: %d\n", prop.multiProcessorCount);
printf("Global Memory: %.2f GB\n", prop.totalGlobalMem / (1024.0 * 1024 * 1024));
printf("Max Threads/Block: %d\n", prop.maxThreadsPerBlock);
printf("Compute Capability: %d.%d\n", prop.major, prop.minor);

return 0;
}

Output:
2.matrix addition

Code 1:

%%writefile matrix_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void matrixAddGPU(float A, float B, float *C, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
C[row * N + col] = A[row * N + col] + B[row * N + col];
}

void matrixAddCPU(float A, float B, float *C, int N) {

for (int i = 0; i < N * N; i++)
C[i] = A[i] + B[i];
}

int main() {
int N = 1024; // change sizes for experiments
size_t size = N * N * sizeof(float);

float h_A = (float)malloc(size);

float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

// CPU timing
clock_t start_cpu = clock();
matrixAddCPU(h_A, h_B, h_C_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

// Allocate GPU memory

float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);

dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
matrixAddGPU<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time = 0;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile matrix_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void matrixAddGPU(float A, float B, float *C, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
C[row * N + col] = A[row * N + col] + B[row * N + col];
}

void matrixAddCPU(float A, float B, float *C, int N) {

for (int i = 0; i < N * N; i++)
C[i] = A[i] + B[i];
}

int main() {
int N = 30; // change sizes for experiments
size_t size = N * N * sizeof(float);

float h_A = (float)malloc(size);

float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

// CPU timing
clock_t start_cpu = clock();
matrixAddCPU(h_A, h_B, h_C_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

// Allocate GPU memory

float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);

dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
matrixAddGPU<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time = 0;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:
3.matrix transpose

Code 1:

%%writefile matrix_transpose.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void transposeGPU(float in, float out, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
out[col * N + row] = in[row * N + col];
}

void transposeCPU(float in, float out, int N) {

for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
out[col * N + row] = in[row * N + col];
}

int main() {
int N = 1024;
size_t size = N * N * sizeof(float);
float *h_in = (float*)malloc(size);
float *h_out_cpu = (float*)malloc(size);
float *h_out_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++)

h_in[i] = rand() % 100;

clock_t start_cpu = clock();

transposeCPU(h_in, h_out_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_in, d_out;

cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

dim3 threadsPerBlock(16, 16);

dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
transposeGPU<<<blocksPerGrid, threadsPerBlock>>>(d_in, d_out, N);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile matrix_transpose.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void transposeGPU(float in, float out, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N)
out[col * N + row] = in[row * N + col];
}

void transposeCPU(float in, float out, int N) {

for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
out[col * N + row] = in[row * N + col];
}

int main() {
int N = 50;
size_t size = N * N * sizeof(float);
float *h_in = (float*)malloc(size);
float *h_out_cpu = (float*)malloc(size);
float *h_out_gpu = (float*)malloc(size);

for (int i = 0; i < N * N; i++)

h_in[i] = rand() % 100;

clock_t start_cpu = clock();

transposeCPU(h_in, h_out_cpu, N);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_in, d_out;

cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);

cudaEventRecord(start);
transposeGPU<<<blocksPerGrid, threadsPerBlock>>>(d_in, d_out, N);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Matrix Size: %d x %d\n", N, N);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 2:
4.Prefix sum

Code 1:

%%writefile prefix_sum.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void prefixSumGPU(int in, int out, int n) {

int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
int sum = 0;
for (int i = 0; i <= tid; i++)
sum += in[i];
out[tid] = sum;
}
}

void prefixSumCPU(int in, int out, int n) {

out[0] = in[0];
for (int i = 1; i < n; i++)
out[i] = out[i - 1] + in[i];
}

int main() {
int n = 10000;
size_t size = n * sizeof(int);
int *h_in = (int*)malloc(size);
int *h_out_cpu = (int*)malloc(size);
int *h_out_gpu = (int*)malloc(size);

for (int i = 0; i < n; i++)

h_in[i] = rand() % 10;

clock_t start_cpu = clock();

prefixSumCPU(h_in, h_out_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

int d_in, d_out;

cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
prefixSumGPU<<<(n + 255) / 256, 256>>>(d_in, d_out, n);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile prefix_sum.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void prefixSumGPU(int in, int out, int n) {

int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
int sum = 0;
for (int i = 0; i <= tid; i++)
sum += in[i];
out[tid] = sum;
}
}

void prefixSumCPU(int in, int out, int n) {

out[0] = in[0];
for (int i = 1; i < n; i++)
out[i] = out[i - 1] + in[i];
}

int main() {
int n = 10;
size_t size = n * sizeof(int);
int *h_in = (int*)malloc(size);
int *h_out_cpu = (int*)malloc(size);
int *h_out_gpu = (int*)malloc(size);

for (int i = 0; i < n; i++)

h_in[i] = rand() % 10;

clock_t start_cpu = clock();

prefixSumCPU(h_in, h_out_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

int d_in, d_out;

cudaMalloc(&d_in, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
prefixSumGPU<<<(n + 255) / 256, 256>>>(d_in, d_out, n);
cudaEventRecord(stop);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_in); cudaFree(d_out);
free(h_in); free(h_out_cpu); free(h_out_gpu);
return 0;
}

Output 2:
5.Square of array or elements

Code 1:

%%writefile square_array.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void squareGPU(float *arr, int n) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) arr[tid] *= arr[tid];
}

void squareCPU(float *arr, int n) {

for (int i = 0; i < n; i++)
arr[i] *= arr[i];
}

int main() {
int n = 1000000;
size_t size = n * sizeof(float);
float *h_arr_cpu = (float*)malloc(size);
float *h_arr_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++)

h_arr_cpu[i] = h_arr_gpu[i] = (float)(rand() % 100);

clock_t start_cpu = clock();

squareCPU(h_arr_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_arr;
cudaMalloc(&d_arr, size);
cudaMemcpy(d_arr, h_arr_gpu, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
squareGPU<<<(n + 255) / 256, 256>>>(d_arr, n);
cudaEventRecord(stop);

cudaMemcpy(h_arr_gpu, d_arr, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_arr);
free(h_arr_cpu); free(h_arr_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile square_array.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void squareGPU(float *arr, int n) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) arr[tid] *= arr[tid];
}

void squareCPU(float *arr, int n) {

for (int i = 0; i < n; i++)
arr[i] *= arr[i];
}

int main() {
int n = 36;
size_t size = n * sizeof(float);
float *h_arr_cpu = (float*)malloc(size);
float *h_arr_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++)

h_arr_cpu[i] = h_arr_gpu[i] = (float)(rand() % 100);

clock_t start_cpu = clock();

squareCPU(h_arr_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float *d_arr;
cudaMalloc(&d_arr, size);
cudaMemcpy(d_arr, h_arr_gpu, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
squareGPU<<<(n + 255) / 256, 256>>>(d_arr, n);
cudaEventRecord(stop);
cudaMemcpy(h_arr_gpu, d_arr, size, cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_arr);
free(h_arr_cpu); free(h_arr_gpu);
return 0;
}

Output 2:
6.vector addition

Code 1:

%%writefile vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void vectorAddGPU(float A, float B, float *C, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

void vectorAddCPU(float A, float B, float *C, int n) {

for (int i = 0; i < n; i++)
C[i] = A[i] + B[i];
}

int main() {
int n = 1000000;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();

vectorAddCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_A, d_B, *d_C;

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile vector_add.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void vectorAddGPU(float A, float B, float *C, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

void vectorAddCPU(float A, float B, float *C, int n) {

for (int i = 0; i < n; i++)
C[i] = A[i] + B[i];
}

int main() {
int n = 2000;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();

vectorAddCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_A, d_B, *d_C;

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:
7.vector multiplication

Code 1:

%%writefile vector_mul.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void vectorMulGPU(float A, float B, float *C, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] * B[i];
}

void vectorMulCPU(float A, float B, float *C, int n) {

for (int i = 0; i < n; i++)
C[i] = A[i] * B[i];
}

for (int i = 0; i < n; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();

vectorMulCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_A, d_B, *d_C;

cudaEventRecord(start);
vectorMulGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 1:
Code 2:

%%writefile vector_mul.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <time.h>

global void vectorMulGPU(float A, float B, float *C, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] * B[i];
}

void vectorMulCPU(float A, float B, float *C, int n) {

for (int i = 0; i < n; i++)
C[i] = A[i] * B[i];
}

int main() {
int n = 10;
size_t size = n * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C_cpu = (float*)malloc(size);
float *h_C_gpu = (float*)malloc(size);

for (int i = 0; i < n; i++) {

h_A[i] = rand() % 100;
h_B[i] = rand() % 100;
}

clock_t start_cpu = clock();

vectorMulCPU(h_A, h_B, h_C_cpu, n);
clock_t end_cpu = clock();
double cpu_time = (double)(end_cpu - start_cpu) / CLOCKS_PER_SEC * 1000;

float d_A, d_B, *d_C;

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorMulGPU<<<(n + 255) / 256, 256>>>(d_A, d_B, d_C, n);
cudaEventRecord(stop);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);

float gpu_time;
cudaEventElapsedTime(&gpu_time, start, stop);

printf("Elements: %d\n", n);

printf("CPU Time: %f ms\n", cpu_time);
printf("GPU Time: %f ms\n", gpu_time);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

free(h_A); free(h_B); free(h_C_cpu); free(h_C_gpu);
return 0;
}

Output 2:

Conclusion:

This assignment demonstrated the power of parallel computing using CUDA compared to
traditional serial CPU execution. Through experiments on various operations such as matrix
addition, matrix transpose, prefix sum, vector addition, and vector multiplication, it was observed
that GPU performance significantly improves with increasing data size due to its ability to execute
thousands of threads in parallel.

While CPU implementations are simpler and perform well for small datasets, they become slower
for large-scale computations because of their limited core count and sequential execution model.
The GPU, with its massive parallelism and optimized memory architecture, achieved substantial
reductions in execution time for large problem sizes.

Source Code
No ratings yet
Source Code
7 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Data Parallelism, Task Parallelism, CPU, GPU
No ratings yet
Data Parallelism, Task Parallelism, CPU, GPU
13 pages
Data Parallelism, Task Parallelism, CPU, GPU
No ratings yet
Data Parallelism, Task Parallelism, CPU, GPU
13 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Rishi
No ratings yet
Rishi
30 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Cuda
No ratings yet
Cuda
4 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Addition Cuda
No ratings yet
Addition Cuda
2 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Cuda 4.1
No ratings yet
Cuda 4.1
2 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
L06 GPGPU CUDA Programming 1
No ratings yet
L06 GPGPU CUDA Programming 1
23 pages
CUDA Class Lecture04
No ratings yet
CUDA Class Lecture04
11 pages
p4 Multiply
No ratings yet
p4 Multiply
2 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Moving To Parallel - Addition of 2 Matrices
No ratings yet
Moving To Parallel - Addition of 2 Matrices
14 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
3 Computation
No ratings yet
3 Computation
28 pages
CUDA Lab Guide for Students
No ratings yet
CUDA Lab Guide for Students
19 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
MPI Experiments Report-2
No ratings yet
MPI Experiments Report-2
6 pages
Minor Assignment
No ratings yet
Minor Assignment
4 pages
Graphical Environments and Interfaces
No ratings yet
Graphical Environments and Interfaces
2 pages
Fundamentals of Statistics
No ratings yet
Fundamentals of Statistics
8 pages
1.1. Origin and Development of Statistics
No ratings yet
1.1. Origin and Development of Statistics
6 pages
Fundamental of Statistics 2
No ratings yet
Fundamental of Statistics 2
2 pages
SybaseIQ15 Best Practices Wp20110718
100% (1)
SybaseIQ15 Best Practices Wp20110718
66 pages
Architecture Question Bank
No ratings yet
Architecture Question Bank
5 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Experiment 3
No ratings yet
Experiment 3
5 pages
Distributed OS for Engineering Students
No ratings yet
Distributed OS for Engineering Students
75 pages
W8L2
No ratings yet
W8L2
54 pages
ACA Question Bank 2024
No ratings yet
ACA Question Bank 2024
6 pages
PETSc Tutorial
No ratings yet
PETSc Tutorial
132 pages
Computer Systems & Software Basics
No ratings yet
Computer Systems & Software Basics
39 pages
Gate LevelCDCPaperr8
No ratings yet
Gate LevelCDCPaperr8
8 pages
GetFEM User Guide
No ratings yet
GetFEM User Guide
263 pages
Parallel Computer Architecture Basics
No ratings yet
Parallel Computer Architecture Basics
268 pages
Llama3.1 Paper
No ratings yet
Llama3.1 Paper
92 pages
Master of Science Applied Computer Science and Society (Acs)
No ratings yet
Master of Science Applied Computer Science and Society (Acs)
4 pages
Distributed Systems:: Principles and Paradigms
No ratings yet
Distributed Systems:: Principles and Paradigms
31 pages
Data Integration in Grid
No ratings yet
Data Integration in Grid
11 pages
Chap 008
100% (6)
Chap 008
54 pages
Design and Implementation of Embedded PCB Defect Detection System Based On FPGA
No ratings yet
Design and Implementation of Embedded PCB Defect Detection System Based On FPGA
6 pages
Strong Strong Beam Beam Sim 2024 Nuclear Instruments and Methods in Physics
No ratings yet
Strong Strong Beam Beam Sim 2024 Nuclear Instruments and Methods in Physics
15 pages
Concurrent Processes
No ratings yet
Concurrent Processes
34 pages
HPC Module 1
No ratings yet
HPC Module 1
48 pages
Database System Architectures: 10/12/2016 1 Md. Golam Moazzam, Dept. of CSE, JU
No ratings yet
Database System Architectures: 10/12/2016 1 Md. Golam Moazzam, Dept. of CSE, JU
85 pages
FastWave: FPGA-Accelerated CNNs
No ratings yet
FastWave: FPGA-Accelerated CNNs
8 pages
I Unit OS
No ratings yet
I Unit OS
47 pages
Ab Initio
No ratings yet
Ab Initio
17 pages
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
No ratings yet
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
13 pages
SARscape Windows Installation
No ratings yet
SARscape Windows Installation
6 pages
CS07 - Software Architecture
No ratings yet
CS07 - Software Architecture
88 pages
PDC 8 - Processes and Message Passing
No ratings yet
PDC 8 - Processes and Message Passing
25 pages
Amber 11 User Manual Guide
No ratings yet
Amber 11 User Manual Guide
308 pages

PC Cuda Assignment-2

Uploaded by

PC Cuda Assignment-2

Uploaded by

Experiment-2

MIHIR KATAKDHOND (231070025)

PARALLEL COMPUTING LAB

Aim: Implement Cuda

2. CPU vs GPU Computing

●​ CPU (Central Processing Unit)​

○​ Few cores, optimized for sequential processing.​

○​ High clock speeds and complex instruction handling.​

●​ GPU (Graphics Processing Unit)​

○​ Hundreds or thousands of smaller cores.​

○​ Particularly efficient for numerical computations and data-parallel workloads.​

CUDA adopts a hierarchical thread organization for efficient parallelism:

1.​ Thread – The smallest execution unit.​

2.​ Block – A collection of threads that can share memory.​

3.​ Grid – A collection of blocks.​

4. Memory Hierarchy in CUDA

●​ Registers – Fastest, private to each thread.​

●​ Global Memory – Accessible to all threads but relatively slow.​

Efficient memory usage is critical for optimizing performance.

●​ Number of blocks in a grid.​

●​ Number of threads per block.​

This determines the total parallel workload.

Execution time is measured to compare CPU and GPU implementations:

●​ CPU Execution Time – Typically measured using system clock functions.​

__global__ void helloFromGPU() {

// Choose a simple launch config

// Show device info

__global__ void matrixAddGPU(float *A, float *B, float *C, int N) {

void matrixAddCPU(float *A, float *B, float *C, int N) {

float *h_A = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {

// Allocate GPU memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

dim3 threadsPerBlock(16, 16);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

printf("Matrix Size: %d x %d\n", N, N);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

__global__ void matrixAddGPU(float *A, float *B, float *C, int N) {

void matrixAddCPU(float *A, float *B, float *C, int N) {

float *h_A = (float*)malloc(size);

for (int i = 0; i < N * N; i++) {

// Allocate GPU memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

dim3 threadsPerBlock(16, 16);

cudaMemcpy(h_C_gpu, d_C, size, cudaMemcpyDeviceToHost);

printf("Matrix Size: %d x %d\n", N, N);

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

__global__ void transposeGPU(float *in, float *out, int N) {

void transposeCPU(float *in, float *out, int N) {

for (int i = 0; i < N * N; i++)

clock_t start_cpu = clock();

float *d_in, *d_out;

cudaEvent_t start, stop;

dim3 threadsPerBlock(16, 16);

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

printf("Matrix Size: %d x %d\n", N, N);

__global__ void transposeGPU(float *in, float *out, int N) {

void transposeCPU(float *in, float *out, int N) {

for (int i = 0; i < N * N; i++)

clock_t start_cpu = clock();

float *d_in, *d_out;

cudaEvent_t start, stop;

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

printf("Matrix Size: %d x %d\n", N, N);

__global__ void prefixSumGPU(int *in, int *out, int n) {

void prefixSumCPU(int *in, int *out, int n) {

for (int i = 0; i < n; i++)

clock_t start_cpu = clock();

int *d_in, *d_out;

cudaEvent_t start, stop;

cudaMemcpy(h_out_gpu, d_out, size, cudaMemcpyDeviceToHost);

printf("Elements: %d\n", n);

__global__ void prefixSumGPU(int *in, int *out, int n) {

void prefixSumCPU(int *in, int *out, int n) {

for (int i = 0; i < n; i++)

clock_t start_cpu = clock();

● CPU (Central Processing Unit)

○ Few cores, optimized for sequential processing.

○ High clock speeds and complex instruction handling.

● GPU (Graphics Processing Unit)

○ Hundreds or thousands of smaller cores.

○ Particularly efficient for numerical computations and data-parallel workloads.

1. Thread – The smallest execution unit.

2. Block – A collection of threads that can share memory.

3. Grid – A collection of blocks.

● Registers – Fastest, private to each thread.

● Global Memory – Accessible to all threads but relatively slow.

● Number of blocks in a grid.

● Number of threads per block.

● CPU Execution Time – Typically measured using system clock functions.

global void helloFromGPU() {

global void matrixAddGPU(float A, float B, float *C, int N) {

void matrixAddCPU(float A, float B, float *C, int N) {

float h_A = (float)malloc(size);

global void matrixAddGPU(float A, float B, float *C, int N) {

void matrixAddCPU(float A, float B, float *C, int N) {

float h_A = (float)malloc(size);

global void transposeGPU(float in, float out, int N) {

void transposeCPU(float in, float out, int N) {

float d_in, d_out;

global void transposeGPU(float in, float out, int N) {

void transposeCPU(float in, float out, int N) {

float d_in, d_out;

global void prefixSumGPU(int in, int out, int n) {

void prefixSumCPU(int in, int out, int n) {

int d_in, d_out;

global void prefixSumGPU(int in, int out, int n) {

void prefixSumCPU(int in, int out, int n) {

int d_in, d_out;

global void squareGPU(float *arr, int n) {

global void squareGPU(float *arr, int n) {

global void vectorAddGPU(float A, float B, float *C, int n) {

void vectorAddCPU(float A, float B, float *C, int n) {

float d_A, d_B, *d_C;

global void vectorAddGPU(float A, float B, float *C, int n) {

void vectorAddCPU(float A, float B, float *C, int n) {

float d_A, d_B, *d_C;

global void vectorMulGPU(float A, float B, float *C, int n) {

void vectorMulCPU(float A, float B, float *C, int n) {

float d_A, d_B, *d_C;

global void vectorMulGPU(float A, float B, float *C, int n) {

void vectorMulCPU(float A, float B, float *C, int n) {

float d_A, d_B, *d_C;