KEMBAR78
Parallel computing with Gpu | PPT
Rohit khatana
Parallel Computing With GPU
Rohit Khatana
4344
Seminar guide
Prof. Aparna Joshi
ARMY INSTITUE OF TECHNOLOGY
Rohit khatana
Content
1.What is parallel computing?
2.Gpu
3.CUDA
4.Application
Rohit khatana
What is Parallel Computing?
Performing or Executing a task/program
on more than one machine or processor.
In simple way dividing a job in a group.
Rohit khatana
For example
Rohit khatana
What kind of processors will we
build?
(major design constraint: power)
Cpu: - Complex Control Hardware
Flexibility + Performance
Expensive in Terms of Power
GPU: - Simpler Control Hardware
More H/W for Computation
Potentially More power Efficient (ops/watt)
More Restrictive Programming Model
Modern GPU has more ALU’s
Graphics Logical Pipeline
• The GPU receives geometry information
from the CPU as an input and provides
a picture as an output
• Let’s see how that happens
Host Interface
• The host interface is the communication bridge
between the CPU and the GPU
• It receives commands from the CPU and also
pulls geometry information from system
memory
• It outputs a stream of vertices in object space
with all their associated information (normals,
texture coordinates, per vertex color etc)
Vertex Processing
• The vertex processing stage receives vertices from the
host interface in object space and outputs them in screen
space
• This may be a simple linear transformation, or a complex
operation involving morphing effects
• No new vertices are created in this stage, and no
vertices are discarded (input/output has 1:1 mapping)
Triangle Setup
• In this stage geometry information becomes raster
information (screen space geometry is the input,
pixels are the output)
• Prior to rasterization, triangles that are backfacing
or are located outside the viewing frustrum are
rejected
Triangle Setup
• A fragment is generated if and only if its center
is inside the triangle
• Every fragment generated has its attributes
computed to be the perspective correct
interpolation of the three vertices that make up
the triangle
Fragment Processing
• Each fragment provided by triangle setup is fed
into fragment processing as a set of attributes
(position, normal, texcoord etc), which are used to
compute the final color for this pixel
• The computations taking place here include
texture mapping and math operations
Memory Interface
• Fragments provided by the last step are written to
the framebuffer.
• Before the final write occurs, some fragments are
rejected by the zbuffer, stencil and alpha tests
Memory Model of GPU
Basic Architecture of GPU
CUDA(compute unified device
Architecture)
• CUDA is a parallel computing platform and
programming model.
• Created by NVIDIA and implemented by the
GPUs that they produce.
CUDA
• CUDA gives developers access to the
virtual instruction set and memory of the
parallel computational elements in CUDA
GPUs.
• CUDA supports standard programming
languages , including C++,python , Fortran.
Programming Model
• Threads are organized into blocks.
• Blocks are organized into a grid.
• A multiprocessor executes one block at a
time.
• A warp is the set of threads executed in
parallel.
• 32 threads in a warp.
Typical CUDA/GPU Program
1. CPU allocates storage on GPU (cudaMalloc).
2. CPU copies input data from CPU GPU
(cudaMemcpy).
3. CPU launches kernel on GPU to process the data.
(Kernel function<<<no of threads>>>(parameter))
4. CPU copies results back to CPU from GPU
(cudaMemcpy)
simply squaring the elements of an array
__global__ void square(float * d_out, float * d_in){
// Todo: Fill in this function
int idx = threadIdx.x;
float f = d_in[idx];
d_out[idx] = f*f
}
theadIdx.x =gives the current thread number
GPU/CUDA programming
Main program
int main(int argc, char **argv){
……………………
…………………….
float h_out[ARRAY_SIZE];
//declare GPU pointer
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc( (void*) &d_in, ARRAY_BYTES);
cudaMalloc( (void*) &d_out, ARRAY_BYTES);
Main program(cont.)
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
square<<<1, ARRAY_SIZE>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
}
Programming Model
GPU vs CPU Code
Conclusion
• GPU computing is a good choice for fine-
grained data-parallel programs with limited
communication
• GPU computing is not so good for coarse-
grained program with a lot of communication
• The GPU has become a co-processor to the
CPU.
References
• 1.[‘IEEE’] Accelerating image processing capability using
graphics processors Jason. Dalea, Gordon. Caina, Brad.
ZellbaVision4ce Ltd. Crowthorne Enterprise Center,
Crowthorne, Berkshire, UK, RG45 6AWbVision4ce LLC
Severna Park, USA, MD2114
•
• 2.Udacity cs344,Intro to parallel Programming with GPU
• 3.Wikipedia
• 4.Nividia docs

Parallel computing with Gpu

  • 1.
    Rohit khatana Parallel ComputingWith GPU Rohit Khatana 4344 Seminar guide Prof. Aparna Joshi ARMY INSTITUE OF TECHNOLOGY
  • 2.
    Rohit khatana Content 1.What isparallel computing? 2.Gpu 3.CUDA 4.Application
  • 3.
    Rohit khatana What isParallel Computing? Performing or Executing a task/program on more than one machine or processor. In simple way dividing a job in a group.
  • 4.
  • 5.
  • 12.
    What kind ofprocessors will we build? (major design constraint: power) Cpu: - Complex Control Hardware Flexibility + Performance Expensive in Terms of Power GPU: - Simpler Control Hardware More H/W for Computation Potentially More power Efficient (ops/watt) More Restrictive Programming Model
  • 13.
    Modern GPU hasmore ALU’s
  • 14.
    Graphics Logical Pipeline •The GPU receives geometry information from the CPU as an input and provides a picture as an output • Let’s see how that happens
  • 15.
    Host Interface • Thehost interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc)
  • 16.
    Vertex Processing • Thevertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects • No new vertices are created in this stage, and no vertices are discarded (input/output has 1:1 mapping)
  • 17.
    Triangle Setup • Inthis stage geometry information becomes raster information (screen space geometry is the input, pixels are the output) • Prior to rasterization, triangles that are backfacing or are located outside the viewing frustrum are rejected
  • 18.
    Triangle Setup • Afragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle
  • 19.
    Fragment Processing • Eachfragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcoord etc), which are used to compute the final color for this pixel • The computations taking place here include texture mapping and math operations
  • 20.
    Memory Interface • Fragmentsprovided by the last step are written to the framebuffer. • Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests
  • 21.
  • 22.
  • 23.
    CUDA(compute unified device Architecture) •CUDA is a parallel computing platform and programming model. • Created by NVIDIA and implemented by the GPUs that they produce.
  • 24.
    CUDA • CUDA givesdevelopers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. • CUDA supports standard programming languages , including C++,python , Fortran.
  • 25.
    Programming Model • Threadsare organized into blocks. • Blocks are organized into a grid. • A multiprocessor executes one block at a time. • A warp is the set of threads executed in parallel. • 32 threads in a warp.
  • 26.
    Typical CUDA/GPU Program 1.CPU allocates storage on GPU (cudaMalloc). 2. CPU copies input data from CPU GPU (cudaMemcpy). 3. CPU launches kernel on GPU to process the data. (Kernel function<<<no of threads>>>(parameter)) 4. CPU copies results back to CPU from GPU (cudaMemcpy)
  • 27.
    simply squaring theelements of an array __global__ void square(float * d_out, float * d_in){ // Todo: Fill in this function int idx = threadIdx.x; float f = d_in[idx]; d_out[idx] = f*f } theadIdx.x =gives the current thread number GPU/CUDA programming
  • 28.
    Main program int main(intargc, char **argv){ …………………… ……………………. float h_out[ARRAY_SIZE]; //declare GPU pointer float * d_in; float * d_out; // allocate GPU memory cudaMalloc( (void*) &d_in, ARRAY_BYTES); cudaMalloc( (void*) &d_out, ARRAY_BYTES);
  • 29.
    Main program(cont.) // transferthe array to the GPU cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice); // launch the kernel square<<<1, ARRAY_SIZE>>>(d_out, d_in); // copy back the result array to the CPU cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost); // print out the resulting array for (int i =0; i < ARRAY_SIZE; i++) { printf("%f", h_out[i]); }
  • 30.
  • 31.
  • 32.
    Conclusion • GPU computingis a good choice for fine- grained data-parallel programs with limited communication • GPU computing is not so good for coarse- grained program with a lot of communication • The GPU has become a co-processor to the CPU.
  • 33.
    References • 1.[‘IEEE’] Acceleratingimage processing capability using graphics processors Jason. Dalea, Gordon. Caina, Brad. ZellbaVision4ce Ltd. Crowthorne Enterprise Center, Crowthorne, Berkshire, UK, RG45 6AWbVision4ce LLC Severna Park, USA, MD2114 • • 2.Udacity cs344,Intro to parallel Programming with GPU • 3.Wikipedia • 4.Nividia docs