KEMBAR78
HPC MPI LAB 1 Vector Addition | PDF | Parallel Computing | Message Passing Interface
0% found this document useful (0 votes)
115 views9 pages

HPC MPI LAB 1 Vector Addition

This document summarizes an MPI lab assignment on vector addition using collective and point-to-point communication. It includes the name and roll number of the student, date, hardware configuration, MPI code snippets for the two communication methods, compilation and execution steps, performance results for varying processor counts, and analysis of the results.

Uploaded by

Mridul Harish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views9 pages

HPC MPI LAB 1 Vector Addition

This document summarizes an MPI lab assignment on vector addition using collective and point-to-point communication. It includes the name and roll number of the student, date, hardware configuration, MPI code snippets for the two communication methods, compilation and execution steps, performance results for varying processor counts, and analysis of the results.

Uploaded by

Mridul Harish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HPC MPI LAB 1 : Vector Addition

Name: Mridul Harish


Roll No: CED18I034
Programming Environment: MPI
Problem: Matrix Multiplication
Date: 21st October

Hardware Configuration:
CPU NAME : Intel Core i5-8250U @ 8x 3.4GHz
Number of Sockets : 1
Cores per Socket : 4
Threads per core : 2
L1 Cache size : 32KB
L2 Cache size : 256KB
L3 Cache size(Shared): 6MB
RAM : 8 GB

Collective Communication MPI Code

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])


{
int myid, np;
double startwtime, endwtime, totalTime;
int namelen;
int n = 100000;
double a[n + 50], b[n + 50], c[n + 50];
int i, j;
int s, s0;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
startwtime = MPI_Wtime();
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Get_processor_name(processor_name, &namelen);
//fprintf(stderr, "Process %d on %s\n", myid, processor_name);
fflush(stderr);
// Vector 1 Reading
for (i = 0; i < n; i++)
a[i] = i;
// Vector 2 Reading
for (i = 0; i < n; i++)
b[i] = i;
if (myid == 0)
{
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
s = (int)floor(n / np);
s0 = n % np;
int a_recv[s];
int b_recv[s];
long int c_recv[s];
if (s0 != 0)
{
s = s + 1;
for (i = 0; i < ((s * np) - n); i++)
{
a[n + i] = 1;
b[n + i] = 1;
}
}
MPI_Bcast(&s, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Scatter(a, s, MPI_INT, a_recv, s, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Scatter(b, s, MPI_INT, b_recv, s, MPI_INT, 0, MPI_COMM_WORLD);
for (i = 0; i < s; i++)
{
c_recv[i] = a_recv[i] + b_recv[i];
}
MPI_Gather(c_recv, s, MPI_LONG, c, s, MPI_LONG, 0,
MPI_COMM_WORLD);

//for (i = 0; i < n; i++)


//printf("%d %d %d %ld\n", i, a[i], b[i], c[i]);
endwtime = MPI_Wtime();
totalTime = endwtime - startwtime;
printf("%f\n", totalTime);
}
else
{
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&s, 1, MPI_INT, 0, MPI_COMM_WORLD);
double a_recv[s], b_recv[s], c_recv[s];
MPI_Scatter(a, s, MPI_INT, a_recv, s, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Scatter(b, s, MPI_INT, b_recv, s, MPI_INT, 0, MPI_COMM_WORLD);
for (i = 0; i < s; i++)
{
c_recv[i] = a_recv[i] + b_recv[i];
}
MPI_Gather(c_recv, s, MPI_LONG, c, s, MPI_LONG, 0,
MPI_COMM_WORLD);
}

MPI_Finalize();
}

Point to Point communication MPI Code


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define n 100000
double a[n], b[n];
double c[n] = {0};
int main(int argc, char *argv[])
{
int myid, np, elements_per_process, n_elements_recieved;
MPI_Status status;
double startwtime, endwtime, totalTime;
MPI_Init(&argc, &argv);
startwtime = MPI_Wtime();
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &np);
if (myid == 0)
{
for (int i = 0; i < n; i += 1)
{
a[i] = (i + 1);
b[i] = (i + 1);
}
int idx, i;
if(np == 1)
elements_per_process = n;
else
elements_per_process = n / (np - 1);
if (np > 1)
{
for (i = 1; i < np - 1; i++)
{
idx = (i - 1) * elements_per_process;
MPI_Send(&elements_per_process,1, MPI_DOUBLE, i,
0,MPI_COMM_WORLD);
MPI_Send(&a[idx],elements_per_process,MPI_DOUBLE, i,
0,MPI_COMM_WORLD);
MPI_Send(&b[idx],elements_per_process,MPI_DOUBLE, i,
0,MPI_COMM_WORLD);
}
idx = (i - 1) * elements_per_process;
int elements_left = n - idx;
MPI_Send(&elements_left,1, MPI_DOUBLE,i, 0,MPI_COMM_WORLD);
MPI_Send(&a[idx],elements_left,MPI_DOUBLE, i,
0,MPI_COMM_WORLD);
MPI_Send(&b[idx],elements_left,MPI_DOUBLE, i,
0,MPI_COMM_WORLD);
}
for (i = 1; i < np; i++)
{
int n_elements_recieved;
idx = (i - 1) * elements_per_process;
MPI_Recv(&n_elements_recieved,1, MPI_DOUBLE, i,
0,MPI_COMM_WORLD,&status);
MPI_Recv(&c[idx], n_elements_recieved,MPI_DOUBLE, i,
0,MPI_COMM_WORLD,&status);
int sender = status.MPI_SOURCE;
}
endwtime = MPI_Wtime();
totalTime = endwtime - startwtime;
printf("%f\n", totalTime);
}

else
{
MPI_Recv(&n_elements_recieved,1, MPI_DOUBLE, 0,
0,MPI_COMM_WORLD,&status);
char processor_name[MPI_MAX_PROCESSOR_NAME];
double a_recv[n + 1000], b_recv[n + 1000], c_recv[n + 1000];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
MPI_Recv(&a_recv, n_elements_recieved,MPI_DOUBLE, 0,
0,MPI_COMM_WORLD,&status);
MPI_Recv(&b_recv, n_elements_recieved,MPI_DOUBLE, 0,
0,MPI_COMM_WORLD,&status);
for (int i = 0; i < n_elements_recieved; i++)
c_recv[i] = a_recv[i] + b_recv[i];
MPI_Send(&n_elements_recieved,1, MPI_DOUBLE,0, 0,MPI_COMM_WORLD);
MPI_Send(&c_recv, n_elements_recieved, MPI_DOUBLE,0, 0,
MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}

Compilation and Execution

Compile using mpicc to include the mpi library ;


mpicc vector_add.c -o vector_add

For execution;
mpirun -n 25 -f machinefile ./vector_add
Observations

Speed up can be found using the following formula;


S(n)=T(1)/T(n) where, S(n) = Speedup for thread count ‘n’
T(1) = Execution Time for Thread count ‘1’ (serial code)
T(n) = Execution Time for Thread count ‘n’ (serial code)

Parallelization Fraction can be found using the following formula;


S(n)=1/((1 - p) + p/n where, S(n) = Speedup for thread count ‘n’
n = Number of threads
p = Parallelization fraction

Collective Communication;

No of Execution Parallel
Threads(P) Time(T(P)) Speedup(S(P)) Fraction(f(P))
1 0.00576
2 0.002708 2.127031019 1.059722222
4 0.002645 2.177693762 0.7210648148
6 0.089785 0.064153255 -17.50520833
8 0.13245 0.04348810872 -25.13690476
12 0.338068 0.01703799236 -62.93712121
16 0.428783 0.01343336839 -78.33759259
20 1.112917 0.005175588117 -202.3313231
24 1.935339 0.00297622277 -349.561413
32 6.079448 0.0009474544399 -1088.474552
48 6.102137 0.0009439316095 -1080.917908
64 3.791373 0.001519238545 -667.6566138
128 14.708826 0.0003916016139 -2572.714961
Point to point Communication;

No of Execution Parallel
Threads(P) Time(T(P)) Speedup(S(P)) Fraction(f(P))
1 0.001028
2 0.002171 0.4735145094 -2.223735409
4 0.00299 0.343812709 -2.544747082
6 0.032904 0.03124240214 -37.20933852
8 0.170045 0.006045458555 -187.9010561
12 0.434316 0.0023669402 -459.8033251
16 0.061818 0.01662946067 -63.07652399
20 0.354108 0.00290306912 -361.5400369
24 0.111944 0.00918316301 -112.5860261
32 0.333079 0.003086354889 -333.4263838
48 2.674903 0.0003843130013 -2656.387118
64 4.826712 0.000212981425 -4768.756964
128 23.345798 0.0000440336201 -22887.73063
Inference

The performance of the program worsens with the cluster execution as the same
resources are divided among different virtual machines.
At any given instance of the program’s runtime, there are 4 OS (1 Windows 11 OS and
3 Ubuntu 20.10) running, using the common resources from the laptop. This causes a
bottleneck in the performance of the program.
Along with the OS, there are multiple background and foreground processes running,
which use the system resources.
Also, the complexity of the algorithm is low, O(n).
The communication overhead crosses the computation overhead at 128 processors,
causing the drastic reduction in the performance of the program.
At 256 processors, my computer stopped working because the computations overhead
were too much.
Collective communication offers better performance compared to Point-to-point
communication.

You might also like