SRI RAMAKRISHNA ENGINEERING COLLEGE
[Educational Service: SNR Sons Charitable Trust]
[Autonomous Institution, Reaccredited by NAAC with ‘A+’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022.
Department of Computer Science & Engineering
Internal Test – II
Date 25.04.2024 Department CSE
Semester VI Class/section III B.E CSE & III MTech CSE
Duration 2:00 Hours Maximum marks 50
Course Code &Title: 20CS257 & High Performance Computing-Answer Key
Course Outcomes Addressed:
On successful completion of the course, the students will be able to
CO3: Apply parallelism to extract maximum performance in multicore and shared memory processor.
Questions Cognitive
PART – A (Answer All Questions) (10*1 =10 Marks) Level/CO
1. In MPI, which function is commonly used to send a message from one process to another?
a) MPI_Receive b) MPI_Send c) d) MPI_Bcast R/ CO3
MPI_Comm_rank
2. The purpose of MPI_Comm_size function in MPI is
a) To initialize MPI b) To get the c) To determine d) To finalize MPI R/ CO3
communication rank of the the size of the communication
process communicator
3.
In MPI, what does the term “rank” refer to? R/ CO3
a) The size of the b) The process c) The message d) The type of data being
communicator identifier size sent
4.
Data parallelism performed ____, Task parallelism performed _____ . R/CO3
a) Synchronous, b) Synchronous, c) Asynchronous, d) Asynchronous,
Asynchronous Synchronous Synchronous Asynchronous
Computation Computation Computation Computation
5.
The style of parallelism supported on GPUs is best described as R/ CO3
a) SISD - Single b) MISD - c) SIMT - Single d) SIMD - Single
Instruction Single Multiple Instruction Instruction Multiple Data
Data Instruction Multiple Thread
Single Data
6. Identify the limitations of CUDA Kernel.
a) recursion, call b) no c) recursion, no d) no recursion, call R/ CO3
stack, static variable recursion, no call stack, static stack, no static variable
declaration call stack, no variable declarations
static variable declaration
declarations
7. Which of the following correctly describes a GPU kernel?
(A) All thread blocks involved in the same computation use the same kernel R/ CO3
(B) A kernel is part of the GPU's internal micro-operating system, allowing it to act as in
independent host
(C) A kernel may contain a mix of host and GPU code
a) A b) B c) C d) A,B,C
8. _________________ MPI function is used for blocking point-to-point communication to
R/CO3
receive a message. (Answer: MPI_Recv)
9. A CUDA program is comprised of two primary components: a host and a ________.
R/ CO3
(Answer: GPU kernel)
10. ___________ is a technique which allows optimal usage of the global memory bandwidth.
R/ CO3
(Answer: Memory coalescing)
PART – B (Answer All Questions) (5*2 =10 Marks) Cognitive
Level/CO
11. Develop a “Hello World” MPI program in C.
(Program: 2 Marks)
Ap/CO3
12. Summarize about collective communications.
(Summary: 2 Marks)
Collective communication involves one or more senders and one or more receivers.
Examples include broadcast of a single data item from one process to all other processes, U/CO3
broadcast of unique items from one process to all other processes, and the inverse
operation: gathering data from a group of processes.
13. Illustrate the modern GPU architecture.
(Architecture Diagram: 2 Marks)
U/ CO3
14. Differentiate between Task Parallelism and Data Parallelism. U/CO3
(Difference: 2 Marks)
15. Outline the compilation process of a CUDA program.
(Compilation Process: 2 Marks)
U/CO3
PART – C (3*10 = 30 Marks)
16. Compulsory Question: Ap/CO3
Develop a CUDA program to perform vector addition and vector multiplication with
Blocks and Threads.
(Vector Addition Program: 5 Marks, Vector Multiplication Program: 5 Marks)
Vector Addition
%%writefile matrix1DADD.cu
#include<stdio.h>
#include<cuda.h>
__global__ void arradd(int *x,int *y, int *z) //kernel definition
{
int id=blockIdx.x;
/* blockIdx.x gives the respective block id which starts from 0 */
z[id]=x[id]+y[id];
}
int main()
{
int a[6];
int b[6];
int c[6];
int *d,*e,*f;
int i;
printf("\n Enter six elements of first array vector\n");
for(i=0;i<6;i++)
{
scanf("%d",&a[i]);
}
printf("\n Enter six elements of second array vector\n");
for(i=0;i<6;i++)
{
scanf("%d",&b[i]);
}
/* cudaMalloc() allocates memory from Global memory on GPU */
cudaMalloc((void **)&d,6*sizeof(int));
cudaMalloc((void **)&e,6*sizeof(int));
cudaMalloc((void **)&f,6*sizeof(int));
/* cudaMemcpy() copies the contents from destination to source. Here destination is
GPU(d,e) and source is CPU(a,b) */
cudaMemcpy(d,a,6*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(e,b,6*sizeof(int),cudaMemcpyHostToDevice);
/* call to kernel. Here 6 is number of blocks, 1 is the number of threads per block and d,e,f
are the arguments */
arradd<<<6,1>>>(d,e,f);
/* Here we are copying content from GPU(Device) to CPU(Host) */
cudaMemcpy(c,f,6*sizeof(int),cudaMemcpyDeviceToHost);
printf("\nSum of two arrays:\n ");
for(i=0;i<6;i++)
{
printf("%d\t",c[i]);
}
/* Free the memory allocated to pointers d,e,f */
cudaFree(d);
cudaFree(e);
cudaFree(f);
return 0;
}
Vector Multiplication
%%writefile Matrixmul.cu
#include<stdio.h>
#include<cuda.h>
__global__ void VecMul(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i < N)
C[i] = A[i]*B[i];
}
int main()
{
int i, N = 10;
size_t size = N * sizeof(float);
// Allocating host and initializing
float A[N],B[N],C[N];
for(i=0;i<N;i++) {
A[i] = B[i] = i;
}
// Allocating device and copying to device
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
// Invoking kernel
int threadsPerBlock = 8;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
VecMul<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device to host
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
for(i=0;i<N;i++)
printf("%f\n", C[i]);
}
Any Two Questions
17. Consider indexing an array containing one element per thread (8 threads per block).
Report the thread which will handle the shaded element in the following array?
(Formula with explanation: 5 Marks, Calculation: 5 Marks)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Ap/ CO3
int index= threadIdx.x+blockIdx.x*M
= 3+2*8
=19
18. Categorize the Memory Visibility in CUDA. Ap/ CO3
(Category with explanation: 10 Marks)
19. Examine the Messages and point-to-point communication in MPI. Ap/ CO3
(Answer: 10 Marks)
Standard MPI data types for C