NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur
GPU Architectures
and Programming
Assignment- Week 3
TYPE OF QUESTION: Objective
Number of questions: 10
Total mark: 10 X 1 = 10
QUESTION 1:
How are CUDA threads invoked to execute a kernel from the host?
Options:
A) Using a loop structure
B) With the <<<...>>> execution configuration syntax
C) By specifying thread IDs in the main function
D) Automatically by the GPU scheduler
Answer:
B) With the <<<...>>> execution configuration syntax
QUESTION 2:
What is the purpose of the threadIdx built-in variable in a CUDA kernel?
Options:
A) Provides a random number
B) Identifies the current CUDA block
C) Gives the total number of threads
D) Provides a unique identifier for each thread
Answer:
D) Provides a unique identifier for each thread
QUESTION 3:
Any function that is launched by the host and executed by a GPU kernel should be qualified by which
keyword?
Options:
A) __device__
B) __host__
C) __kernel__
D) __global__
Answer:
D) __global__
QUESTION 4:
What does the <<<1, N>>> syntax signify in the kernel invocation VecAdd<<<1, N>>>(A, B, C)?
Options:
A) 1 block of threads, N threads per block
B) N blocks of threads, 1 thread per block
C) N blocks with variable thread count
D) 1 thread per block, 1 block in total
Answer:
A) 1 block of threads, N threads per block
QUESTION 5:
Given a GPU with 10 streaming multiprocessors, each supporting a maximum of 1024 threads per SM, and a
CUDA kernel is launched with a block size of 128 threads, calculate the maximum number of active blocks
on the GPU.
Options:
A. 80
B. 100
C. 200
D. 1280
Answer:
A. 80
Detailed Solution:
Maximum active blocks per SM = Total threads per SM / Threads per block
Maximum active blocks on GPU = Maximum active blocks per SM * Number of SMs
QUESTION 6:
Calculate the execution time (in seconds) for a CUDA kernel that processes 8192 elements with a block size
of 128 threads and an average execution time of 2 milliseconds per block, considering that only one SM is
available on the target GPU for executing the blocks.
Options:
A. 0.512 seconds
B. 0.256 seconds
C. 1.024 seconds
D. 0.128 seconds
Answer:
D. 0.128 seconds
Detailed Solution:
Execution time is calculated as the product of the number of blocks and the average execution time per
block.
QUESTION 7:
Given a CUDA kernel with a grid size of 2 blocks and 256 threads per block, calculate the total number of
threads launched by the kernel.
Options:
A. 256
B. 512
C. 1024
D. 4096
Answer:
B. 512
Detailed Solution:
Total threads launched = Block size * Threads per block
QUESTION 8:
What is the CUDA function call required to copy an array h_A from the CPU memory to the GPU
memory, where it is known as d_A?
Options:
A. cudaMemcpy(h_A, d_A, size, cudaMemcpyHostToDevice);
B. cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
C. cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);
D. cudaMemcpy(d_A, h_A, size, cudaMemcpyDeviceToHost);
Answer:
B. cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
QUESTION 9:
Which of the following options is true regarding the matrix multiplication kernel in the code shown
below:
#where d_M and d_N are matrices and N is the row and column sizes
and d_P is the product matrix
__global__ void Matrix MulKernel ( float * d_M , float * d_N , float * d_P , int N ) {
int i = blockIdx.y * blockDim.y + threadIdx.y ;
int j = blockIdx.x * blockDim.x + threadIdx.x ;
if (( i < N ) && (j < N ) ) {
float Pvalue = 0.0;
for ( int k = 0; k < N ; ++k ) {
Pvalue += d_M [i*N + k]* d_N [k*N + j];
}
d_P [i*N+j] = Pvalue ;
}
}
Options:
A. The kernel iterates over each element of the output matrix (d_P) parallelly and calculates its
value using a nested loop that iterates over the corresponding row of the first matrix (d_M)
and the corresponding column of the second matrix (d_N) sequentially.
B. The kernel iterates over each element of the output matrix (d_P) sequentially and calculates
its value using a nested loop that iterates over the corresponding row of the first matrix
(d_M) and the corresponding column of the second matrix (d_N) parallelly.
C. The computation of individual elements in the product matrix d_P can be carried out
parallelly using threads along a different dimension than the ones used for the parallel
computation of the entire product matrix.
D. The computation of individual elements in the product matrix d_P can be carried out
parallelly using threads along one of the same dimensions as the ones used for the parallel
computation of the entire product matrix.
Answer:
B. The kernel iterates over each element of the output matrix (d_P) sequentially and calculates its
value using a nested loop that iterates over the corresponding row of the first matrix (d_M) and the
corresponding column of the second matrix (d_N) parallelly.
QUESTION 10:
Which of the following statements regarding CUDA memory allocation is false?
Options:
A. It is possible to allocate memory in a CUDA device kernel for an integer array.
B. It is possible to allocate memory in a CUDA device kernel by passing the pointer to an
integer array.
C. An array created inside a CUDA device kernel cannot be directly dereferenced in the host
side.
D. An array created inside a CUDA device kernel can be copied to another CUDA device
kernel by calling the function cudaMemcpy using the flag cudaMemcpyDeviceToDevice.
Answer: B. It is possible to allocate memory in a CUDA device kernel by passing the pointer to an
integer array.
Detailed Solution: It is not possible to allocate memory in a CUDA device kernel by passing the
pointer to an integer array, the pointer has to be typecast to void before passing.