Fdsa Lab Manual
Fdsa Lab Manual
Name :
Reg No :
Branch :
Year :
Semester :
INDEX
DATE
S.NO NAME OF EXPERIMENTS PG.NO SIGN
1. Working With Numpy Arrays
a.) Creating A Numpy Ndarray Object
b.) Access An Array Element Using Array
Indexing
c.) Access The Subarray Using Slicing Technique
2.
WORKING WITH PANDAS DTAFRAMES
a.)
CREATE A DATAFRAME WITH THE
BELOW GIVEN DICTIONARY AND
APPLY AGGREGATIONS ON AGE
COLUMN
b.)
CREATE A DATAFRAME WITH THE
BELOW GIVEN TABLE (CREATE A
DICTIONARY USING THE DATA) AND
PERFORM THE GROUPBY OPERATIONS
c.)
PERFORM CONCATENATION
OPERATIONS ALONG AN AXIS
3.
BASIC PLOTS USING MATPLOTLIB
Line Plot
a.)
Multi Line Plot
b.)
Bar Chart
c.)
Histogram Chart
d.)
Pie Chart
e.)
Subplot
f.)
Frequency Distribution
4a.)
Describing Data With Averages
4b.)
4c.) Measures Of Variability
5a.) Normal Curves
5b.) Correlation Coefficient And Scatter Plots
6a.) Implementation Of One Sample Z-Test
6b.) Implementation Of Two Sample Z-Test
6c.) Implementation Of Z-Test Using Titanic Case
Study
7a.) Implementation Of One Sample T-Test
7b.) Implementation Of Two Sample T-Test
8a.) Implementation Of Variance Analysis
( Anova)
9a.) Demonstration Of Linear Regression
10a.) Demonstration Of Logistic Regression
11a.) Implementation Of Time Series Analysis
1a.) Display the dimensions, shape and size of arrays
Algorithm:
4.Create a 2-D array containing two arrays with the values 111,22,33 and 444,55,66:
5.Create a 3-D array with two 2-D arrays, both containing two arrays with the values 11,22,33
and 44,55,66
import numpy as np
a=np.array(456)
b=np.array([100,200,300])
c=np.array([[11,22,33],[44,55,66]])
d=np.array([[[11,22,33],[44,55,66]],[[11,22,33],[44,55,66]]])
print('the array a is',a)
print('the array b is',b)
print('the array c is',c)
print('the array d is',d)
print('the dimension of an array a :' ,a.ndim)
print('the dimension of an array b :',b.ndim)
print('the dimension of an array c :' ,c.ndim)
print('the dimension of an array d :',d.ndim)
print('the shape of an array a :' ,a.shape)
print('the shape of an array b :' ,b.shape)
print('the shape of an array c :' ,c.shape)
print('the shape of an array d :', d.shape)
print('the size of an array a :', a.size)
print('the size of an array b :',b.size)
print('the size of an array c :',c.size)
print('the size of an array d :',d.size)
OUTPUT:
the array a is 456
the array b is [100 200 300]
Algorithm:
. 4.Get third and fourth elements from the above array and add them. Display the result
8.Create an array([[[19, 52, 73], [24, 65, 46]], [[17,28, 89], [50, 41, 92]]])
9.Access the third element of the second array of the first array
11.Print the last element from the 2nd dim using negative indexing
1b.) Access an array element using array indexing
PROGRAM:
import numpy as np
#one dimensional array
arr=np.array([23,12,53,84])
#get first element from the above array
print("the first element from the array:",arr[0])
#get the second element from the above array
print("the second element from the array:",arr[1])
#get the third and fourth element from the above array
print("the third element from the array:",arr[2])
print("the fourth element from the array:",arr[3])
#2D ARRAYS
arr1 = np.array([[11,22,33,44,55], [36,67,88,99,101]])
#access the 2nd element on 1st dim
print("the second element on 1st dim:",arr1[0][1])
#access the 5th element on 2nd dim
print("the fifth element on 2nd dim:",arr1[1][4])
#3D ARRAYS
arr2 = np.array([[[19, 52, 73], [24, 65, 46]], [[17,28, 89], [50, 41, 92]]])
#Access the third element of the second array of the first array
print("the third element of the second array of the first array:",arr2[0][1][2])
#Negative Indexing
arr3 = np.array([[18,23,32,41,55], [63,74,86,98,30]])
#Print the last element from the 2nd dim using negative indexing
print("the last element from the 2nd dim:",arr3[1][-1])
OUTPUT:
the first element from the array: 23
the second element from the array: 12
the third element from the array: 53
the fourth element from the array: 84
the second element on 1st dim: 22
the fifth element on 2nd dim: 101
the third element of the second array of the first array: 46
the last element from the 2nd dim: 30
1c.)Accessing the subarray using slicing technique
Algorithm:
1.Slice elements from index 1 to index 5 from the following array ([13, 22, 37, 49, 56, 64, 72])
4 (not included) 4.Slice from the index 3 from the end to index 1 from the end from the
following array ([12, 23, 53, 74, 15, 16, 87])
5.Return every other element from index 1 to index 5 from the array([14, 22, 36, 14, 15, 76, 97])
6.Return every other element from the entire array ([14, 22,13, 64, 65, 56, 47])
7.Create a subarray arr1 [27 28 19] from the array([[31, 22, 13, 54, 75], [16, 27, 28, 19, 70]]) and
display the array arr1
1c.)Accessing the subarray using slicing technique
PROGRAM:
import numpy as np
arr = np.array([13, 22, 37, 49, 56, 64, 72])
#Slice elements from index 4 to the end of the array
print("the elements from index 4 to the end:",arr[4:])
#Slice elements from the beginning to index 4 (not included)
print("the elements from the beginning to index 3:",arr[0:4])
#negative slicing
#slice from index 3 from the end to index 1 from the end
arr1=np.array([12,23,53,74,15,16,87])
print("the element from index 3 from the end to index 1 from the end:",arr1[-4:-8:-1])
#return every other element from index 1 to index 5
arr2=np.array([14,22,36,14,15,76,97])
print("the every other element from index 1 to index 5:",arr2[1:6])
#return every other element from the entire array
arr3=np.array([14,22,13,64,65,56,47])
print("the every other element from the entire array:",arr3[0:])
#SLICING 2D ARRAYS
arr4 = np.array([[31, 22, 13, 54, 75], [16, 27, 28, 19, 70]])
subarray=(arr4[1][1:4])
print("the given array is ",arr4)
print("the subarray from given array :",subarray)
OUTPUT:
the elements from index 4 to the end: [56 64 72]
the elements from the beginning to index 3: [13 22 37 49]
the element from index 3 from the end to index 1 from the end: [74 53 23 12]
the every other element from index 1 to index 5: [22 36 14 15 76]
the every other element from the entire array: [14 22 13 64 65 56 47]
the given array is [[31 22 13 54 75]
[16 27 28 19 70]]
the subarray from given array : [27 28 19]
2a.) Create a DataFrame with the below given dictionaryApplying multiple aggregations at
once – Get the sum, mean and standard deviations for the AGE column
PROGRAM:
import pandas as pd
#create dataframe
Data = {'Name':['Sankar', 'Julius', 'Sriram', 'Rithu',
'Mitelesh','Juliet'],'Age':[20,19,20,20,19,20], 'Dept' : ['AI','CSE','AI','CSE', 'AI','CSE']}
Index='20AD01','20CS02','20AD03','20CS04','20AD04','20AD10'
df=pd.DataFrame(Data,Index)
print(df)
#get sum, mean, standard deviations in age column
sum1=df['Age'].aggregate('sum')
print("sum of age column :",sum1)
mean1=df['Age'].mean()
print("mean of age column:",mean1)
standard_deviations=df['Age'].std()
print("standard deviation of age column:",standard_deviations)
OUTPUT:
Name Age Dept
20AD01 Sankar 20 AI
20CS02 Julius 19 CSE
20AD03 Sriram 20 AI
20CS04 Rithu 20 CSE
20AD04 Mitelesh 19 AI
20AD10 Juliet 20 CSE
sum of age column : 118
mean of age column: 19.666666666666668
standard deviation of age column: 0.5163977794943222
.) Create a DataFrame with the below given table (create a dictionary using the data) and
perform the groupby operations
PROGRAM:
import pandas as pd
#create a dataframe
data={'Outlook':['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny
','Rainy ','Sunny','Overcast', 'Overcast ','Rainy '],'Temparature':
[85,80,83,70,68,65,64,72,69,75,75,72,81,71],'humidity':
[85,90,86,96,80,70,65,95,70,80,70,90,75,91],'windy':
['false','true','false','false','false','true','true','false','false','false','true','true','false','true'],'play':
['no','no','yes','yes','yes','no','yes','no','yes','yes','yes','yes','yes','no']}
df=pd.DataFrame(data)
print(df)
OUTPUT:
Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
2 Overcast 83 86 false yes
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes
5 Rainy 65 70 true no
6 Overcast 64 65 true yes
7 Sunny 72 95 false no
8 Sunny 69 70 false yes
9 Rainy 75 80 false yes
10 Sunny 75 70 true yes
11 Overcast 72 90 true yes
12 Overcast 81 75 false yes
13 Rainy 71 91 true no
('Overcast', Outlook Temparature humidity windy play
2 Overcast 83 86 false yes
6 Overcast 64 65 true yes
11 Overcast 72 90 true yes)
('Overcast ', Outlook Temparature humidity windy play
12 Overcast 81 75 false yes)
('Rainy', Outlook Temparature humidity windy play
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes
5 Rainy 65 70 true no)
('Rainy ', Outlook Temparature humidity windy play
9 Rainy 75 80 false yes
13 Rainy 71 91 true no)
('Sunny', Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
7 Sunny 72 95 false no
10 Sunny 75 70 true yes)
('Sunny ', Outlook Temparature humidity windy play
8 Sunny 69 70 false yes)
(('Overcast', 'yes'), Outlook Temparature humidity windy play
2 Overcast 83 86 false yes
6 Overcast 64 65 true yes
11 Overcast 72 90 true yes)
(('Overcast ', 'yes'), Outlook Temparature humidity windy play
12 Overcast 81 75 false yes)
(('Rainy', 'no'), Outlook Temparature humidity windy play
5 Rainy 65 70 true no)
(('Rainy', 'yes'), Outlook Temparature humidity windy play
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes)
(('Rainy ', 'no'), Outlook Temparature humidity windy play
13 Rainy 71 91 true no)
(('Rainy ', 'yes'), Outlook Temparature humidity windy play
9 Rainy 75 80 false yes)
(('Sunny', 'no'), Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
7 Sunny 72 95 false no)
(('Sunny', 'yes'), Outlook Temparature humidity windy play
10 Sunny 75 70 true yes)
(('Sunny ', 'yes'), Outlook Temparature humidity windy play
8 Sunny 69 70 false yes)
PROGRAM:
#Concatenation
#Create two DataFrame
import pandas as pd
One= pd.DataFrame({ 'Name': ['Allen', 'Amutha', 'Ashwad', 'Avinash', 'Arun'],
'subject_id':['OOPS','DM','Physics','Statistics','FDS'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({ 'Name': ['Bathri', 'Barath', 'Banu', 'Balaji', 'Betty'],
'subject_id':[ 'OOPS','DM','Physics','Statistics','FDS'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
df=[One,two]
#Perform concatenation operations along an axis
#Set the ignore_index to True in concatenation and print the result.
a=pd.concat(df,ignore_index=True)
print(a)
OUTPUT:
Name subject_id Marks_scored
0 Allen OOPS 98
1 Amutha DM 90
2 Ashwad Physics 87
3 Avinash Statistics 69
4 Arun FDS 78
5 Bathri OOPS 89
6 Barath DM 80
7 Banu Physics 79
8 Balaji Statistics 97
9 Betty FDS 88
3(a) loading the dataset and read total sales of all month and
show it using a line plot
4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3a.) loading the dataset and read total sales of all month and
show it using a line plot
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("C:/mat/plot.csv") #load the dataset
print(df)
#style properties
plt.plot(year,ts,color='green',label='sales',linestyle='dotted',line
width=5,marker='o',markerfacecolor='red')
#labelling
plt.xlabel('month number')
plt.ylabel('sales per month')
plt.legend(loc='upper left')
#display
plt.show()
OUTPUT:
3b.)loading the dataset and read all product sales data and show
it using multiline plot
4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3b.)loading the dataset and read all product sales data and show
it using multiline plot
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
print(df)
year=df['year/month'].tolist()
fuel=df['fuel'].tolist()
veg=df['veg'].tolist()
fruits=df['fruits'].tolist()
snacks=df['snacks'].tolist()
rice=df['rice'].tolist()
plt.plot(year,fuel,label='fuel',linewidth=5,color='green',marker=
'o',markerfacecolor='red')
plt.plot(year,veg,label='veg',linewidth=5,color='blue',marker='o
',markerfacecolor='red')
plt.plot(year,fruits,label='fruits',linewidth=5,color='red',marker
='o',markerfacecolor='black')
plt.plot(year,snacks,linewidth=5,label='snacks',color='orange',m
arker='o',markerfacecolor='blue')
plt.plot(year,rice,linewidth=5,color='black',label='rice',marker='
o',markerfacecolor='red')
plt.xlabel("year/month")
plt.ylabel("sales unit in number")
plt.legend(loc='upper left')
plt.show()
OUTPUT:
3C.) loading the dataset and read veg and snacks sales data and
show it using bar chart
4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3C.) loading the dataset and read veg and snacks sales data and
show it using bar chart
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#load the dataset
df=pd.read_csv('C:/mat/plot.csv')
print(df)
year=df['year/month']
v=df['veg']
s=df['snacks']
bar1=np.arange(len(year))
bar2=0.4+bar1
plt.bar(bar1,v,width=0.4,label='veg')
plt.bar(bar2,s,width=0.4,label='snacks')
plt.xticks(bar1,year)
plt.legend(loc='upper left')
plt.xlabel("year & month")
plt.ylabel("sales of veg and snacks")
plt.show()
OUTPUT:
3d.)load the dataset and read sales data of rice of all months and
show it using histogram chart
4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3d.)load the dataset and read sales data of rice of all months and
show it using histogram chart
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month'].tolist()
rice=df['rice'].tolist()
rice.sort()
a=[6000,6500,7000,7500,8000,8500,9000]
plt.hist(rice,a,bottom=4,ec='black')
plt.xticks(rice)
plt.show()
OUTPUT:
3e.) load the dataset and read the total sales data of 2013 for last
year for each product and show it using a pie chart
4. Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3e.) load the dataset and read the total sales data of 2013 for last
year for each product and show it using a pie chart
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month']
labels=['fuel','veg','fruits','snacks','rice']
#adding the data of each columns
a=[df['fuel'].sum(),df['veg'].sum(),df['fruits'].sum(),df['snacks'].s
um(),df['rice'].sum()]
plt.axis('equal')
#autopct for visualize the percentage of each data
plt.pie(a,labels=labels,autopct='%1.1f%%')
plt.legend(loc='upper left')
plt.show()
OUTPUT:
3f.))load the dataset and read fuel and fruits all months and
display it using the subplot
4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
7. Stop
3f.))load the dataset and read fuel and fruits all months and
display it using the subplot
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month']
fuel=df['fuel']
fruits=df['fruits']
fig,ax=plt.subplots(2)
ax[0].plot(year,fuel,color='red',marker='o',label='fuel',linewidth
=3)
ax[0].set_title('fuel')
ax[1].plot(year,fruits,color='green',marker='o',label='fruits',line
width=3)
ax[1].set_title('fruits')
plt.xticks(year)
plt.show()
OUTPUT
THE CSV FILE :
year/month fuel veg fruits snacks rice total_sales total_profit
0 2013-10 160195 94189 107256 93654 7518 462812 4628120
1 2013-09 167767 95059 102224 99034 6925 471009 4710090
2 2013-08 147264 119002 113335 109712 8120 497433 4974330
3 2013-07 146792 117459 124317 113254 7910 509732 5097320
4 2013-06 141535 125406 117393 92277 7592 484203 4842030
5 2013-05 167146 142285 124644 96207 8516 538798 5387980
6 2013-04 141325 114726 103087 90990 7743 457871 4578710
4a.) Frequency distribution
Aim: To sorting observations into classes and showing their frequency (f ) of occurrence in
each class.
Algorithm
1. Download any of the freely available dataset and import the dataset using pandas library
2. Choose a Quantitative column. Find the range of the column
3. Find the frequency distribution for the considered column with the given range
4. Find
the relative frequency distribution for the considered column with the given range
and add along the previous table
5. Findthe Cumulative frequency distribution for the considered column with the given
range and add along the previous table
6. Find the Cumulative frequency distribution percentage for the considered column with the
given range and add along the previous table
7. stop
4a.) Frequency distribution
PROGRAM:
import pandas as pd
df=pd.read_csv('C:/mat/fre.csv')
print(df)
#TO FIND RANGE OF A QUANTIATIVE COLUMN
max_value=df['AGE'].max()
min_value=df['AGE'].min()
a=(max_value)-(min_value)#range
print('the range of the column:',a)
#frequency distribution of age column
df1=pd.DataFrame({})#EMPTY DATASET
frequency_distribution=df['AGE'].value_counts()
df1['frequency_distribution_AGE']=frequency_distribution
OUTPUT:
NAME DEPT AGE MARK [CSV FILE]
0 Sankar AI 22 98
1 Julius IT 25 99
2 Sriram ECE 24 97
3 Rithu CSE 22 96
the range of the column: 3
frequency_distribution_AGE relative_frequency_AGE
22 2 0.50
25 1 0.25
24 1 0.25
cumulative_frequency_AGE cumulative_percentage_AGE
0 22 23.655914
1 47 50.537634
2 71 76.344086
3 93 100.000000
4b.) Describing with averages
Aim: To apply the measures of central tendency to describe the middle or typical value for a
distribution.
Algorithm:
1. Create an array using Numpy and find the mean value of the
array ([56,78,98,76,54,56,43,32,34,51])
2.Create a dictionary of series using Weather Dataset and apply mean,median on
Temperature and Windy column,apply mode on windy column.
3. Create an array using Numpy and find the median,mode value of the
array ([56,78,98,76,54,56,43,32,34,51]
4. Install scipy package and import stat library and use stats.mode
5.Create a dictionary of series using Weather Dataset and calculate and plot the
skewness on Temperature column
4b.) Describing with averages
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics
from sklearn.preprocessing import LabelEncoder
from scipy.stats import mode,skew
a=np.array([56,78,98,76,54,56,43,32,34,51])
print('the mean of an array:',a.mean())
print('the median of the array:',statistics.median(a))
print('the mode of the array:',mode(a,keepdims=True))
#CREATE A DICTIONARY OF SERIES
dict_ser=pd.Series({'outlook':['sunny','sunny','overcast','r
ainy','rainy','rainy','overcast','sunny','sunny'],'temperatur
e':[85,80,83,70,68,65,64,72,69],'humidity':[85,90,86,96,8
0,70,65,95,70],'windy':['False','True','False','False','False','
True','True','False','False']})
print("the data series:",dict_ser)
#to find mean convert True to 1 and False to 0
b=LabelEncoder().fit_transform(dict_ser['windy'])
dict_ser['windy']=b
print(dict_ser)
print('the mean of
temperature:',statistics.mean(dict_ser['temperature']))
print('the mean of windy',dict_ser['windy'].mean())
print('the median of
temperature:',statistics.median(dict_ser['temperature']))
print('the median of
windy:',statistics.median(dict_ser['windy']))
print('the mode of
windy:',mode(dict_ser['windy'],keepdims=True))
OUTPUT:
the mean of an array: 57.8
the median of the array: 55.0
the mode of the array: ModeResult(mode=array([56]),
count=array([2]))
the data series: outlook [sunny, sunny, overcast,
rainy, rainy, rainy, ...
temperature [85, 80, 83, 70, 68, 65, 64, 72, 69]
humidity [85, 90, 86, 96, 80, 70, 65, 95, 70]
windy [False, True, False, False, False, True, True,...
dtype: object
outlook [sunny, sunny, overcast, rainy, rainy, rainy, ...
temperature [85, 80, 83, 70, 68, 65, 64, 72, 69]
humidity [85, 90, 86, 96, 80, 70, 65, 95, 70]
windy [0, 1, 0, 0, 0, 1, 1, 0, 0]
dtype: object
the mean of temperature: 72.88888888888889
the mean of windy 0.3333333333333333
the median of temperature: 70
the median of windy: 0
the mode of windy: ModeResult(mode=array([0],
dtype=int64), count=array([6]))
the skewness on temperature column:
0.7071067811865479
4c.) Measures of variability
Aim: To measure the amount by which scores are dispersed or scattered in a distribution.
Algorithm:
1. Create an DataFrame using pandas for the given list [3319, 3654, 3881, 6335, 840, 4759,
5130,
863, 8070, 8830]
3. find the distance of each data object’s standard deviation from mean
PROGRAM:
import pandas as pd
import numpy as np
import statistics
from scipy.stats import iqr
#create a dataframe
a=pd.DataFrame({'data':[3319,3654,3881,6335,840,4759,5130,863,807
0,8830]})
print(a)
#RANGE
max_value=a['data'].max()
min_value=a['data'].min()
b=max_value-min_value
print('the range of the column:',b)
#VARIANCE
c=statistics.variance(a['data'])
#STANDARD DEVIATION
d=statistics.stdev(a['data'])
M=statistics.mean(a['data'])
i=0
x=(a['data'][i])
while i <len(a['data']):
i+=1
#IQR
r=a['data'].tolist()
r.sort()
print('the sorted list :',r)
j=a.describe()
OUTPUT:
data
0 3319
1 3654
2 3881
3 6335
4 840
5 4759
6 5130
7 863
8 8070
9 8830
the sorted list : [840, 863, 3319, 3654, 3881, 4759, 5130, 6335, 8070,
8830]
count 10.000000
mean 4568.100000
std 2674.896403
min 840.000000
25% 3402.750000
50% 4320.000000
75% 6033.750000
max 8830.000000
5a.) Normal curves
Aim: To visualize the data by arranging the probability distribution of each value in the
data
Algorithm:
1. Import matplotlib, scipy and numpy library
2. Set min and max value for generating random numbers between the specified range
PROGRAM:
import random
import statistics
from scipy.stats import norm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
OUTPUT:
[NOTE: THE OUTPUT CHANGES FOR EACH EXECUTION]
[3, 7, 10, 6, 8, 2, 5, 1, 4, 9]
the mean of the random number: 5.5
the standard deviation of the random number:3.0276503540974917
5b.) Correlation coefficient
Aim: To describe the relationship between a pair of variables and plot them to visualize
Algorithm:
1. Import the excel file using pandas pd.read_excel(file path) and display first 4 rows
of the file to ensure you uploaded the file
2. Rename columns for better understanding
3. Find the correlation between 'Cement' (Column name C3) and 'Compressive
Strength’ (Column name Strength)
4. Plot the correlation between two variables 'Cement' (Column name C3) and
'Compressive Strength’ (Column name Strength) using seaborn
5. Plot the correlation matrix comprising all the variables using heatmap
5b.) Correlation coefficient
PROGRAM:
import pandas as pd
from scipy.stats import pearsonr
df=pd.read_csv('C:/fds/Concrete_Data.csv')
print(df.head(4)) #to display first 4 elements
print(df.columns)
df.rename(columns={'Cement':'C3','Concrete compressive
strength':'Strength'},inplace=True)
print(df.head(4))
a=df['C3'].corr(df['Strength']) #pearson correlation
corr_matrix1=df.corr()
sns.heatmap(corr_matrix1)
plt.show()
OUTPUT:
Cement Blast Furnace Slag ... Age (day) Concrete compressive strength
0 540.0 0.0 ... 28 79.99
1 540.0 0.0 ... 28 61.89
2 332.5 142.5 ... 270 40.27
3 332.5 142.5 ... 365 41.05
[4 rows x 9 columns]
Index(['Cement', 'Blast Furnace Slag', 'Fly Ash', 'Water ', 'Superplasticizer ', 'Coarse Aggregate ',
'Fine Aggregate ',
'Age (day)', 'Concrete compressive strength'], dtype='object')
C3 Blast Furnace Slag Fly Ash ... Fine Aggregate Age (day) Strength
0 540.0 0.0 0.0 ... 676.0 28 79.99
1 540.0 0.0 0.0 ... 676.0 28 61.89
2 332.5 142.5 0.0 ... 594.0 270 40.27
3 332.5 142.5 0.0 ... 594.0 365 41.05
[4 rows x 9 columns]
'THE CORRELATION BETWEEN CEMENT AND STRENGTH:' 0.49783
c3 strength
0 540.0 79.99
1 540.0 61.89
2 332.5 40.27
3 332.5 41.05
4 198.6 44.30
5 266.0 47.03
6 380.0 43.70
7 380.0 36.45
8 266.0 45.85
9 475.0 39.29
c3 strength
c3 1.000000 0.497832
strength 0.497832 1.000000
6 a.) Implementation of one sample z-test
PROGRAM:
OUTPUT:
z value: 1.5976240527147705
p value: 0.11012667014384257
CONCLUSION: Accept H0
PROGRAM:
OUTPUT:
z value: -1.9953236073282115
p value: 0.046007596761332065
CONCLUSION:
we reject H0
6C.) IMPLEMENTATION OF Z-TEST – USING TITANIC CASE STUDY
2.1 Some new survey/research claims that the average age of passengers in Titanic who
survived is greater than 28.
2.2 There is a difference in average age between the two genders who survived?
2.3 Greater than 50% of passengers who survived in Titanic are in the age group of 20–40.
2.4.Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)
PROGRAM:
(i) Some new survey/research claims that the average
age of passengers in Titanic who survived is greater
than 28.
import pandas as pd
import numpy as np
import random
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import zconfint
alpha=float(0.05)
df=pd.read_csv('C:/fds/titanic.csv')
survived_passenger=(df[df['Survived']==1])
Age_valued=survived_passenger[survived_passenger['Age'].notna
()].head()
Age_column=(df[df['Age'].notna()].Age)
survey=[]
#AS PER CENTRAL LIMIT THEOREM WILL TAKE 60 RANDOM
SAMPLING
for i in range(60):
mean_age=np.random.choice(Age_column).mean()
survey.append(mean_age)
age=28
value=ztest(survey,value=age)
lower,upper=zconfint(survey,value=0)
OUTPUT:
[NOTE: AT EACH EXECUTION VALUES DIFFER]
z value: 1.0036986169297506
p value: 0.3155239039221307
CONCLUSION:accept H0
the average age of passengers in Titanic who survived is between
26.372399866408035 and 33.04426680025863
OUTPUT:
[NOTE:AT EACH EXECUTION OUTPUT VALUES DIFFER]
z_value: 0.669303406337577
p_value: 0.5033019543375687
the average age of male who is survived lies between
22.807162910444323 and 30.91783708955568
the average age of female who is survived lies between
21.442984426168046 and 28.590348907165286
CONCLUSION:accept H0
import pandas as pd
import numpy as np
import random
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import zconfint
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportion_confint
import math
alpha=0.05
df=pd.read_csv('C:/fds/titanic.csv')
survived=df[df['Survived']==1]
age_20_40=survived[(survived['Age']>=20)&(survived['Age']<=40)]
.Age
h=df[df['Survived']==1].Survived
len_survived=len(h)
nobs=len_survived
value=float(0.5)
count=len(age_20_40)
z_value,p_value=proportions_ztest(count,nobs,value)
OUTPUT:
z value: -1.628491667667518
p_value: 0.10342067458876389
z score: -2.3439279326118236
CONCLUSION:ACCEPT H0
import pandas as pd
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.proportion import proportions_ztest
import math
alpha=0.05
df=pd.read_csv('C:/fds/titanic.csv')
passengers=len(df['Survived'])
age=df[(df['Age']>=20)&(df['Age']<=40)].Age
count=(len(age))
value=0.5
nobs=passengers
z_value,p_value=proportions_ztest(count,nobs,value)
print('z value:',z_value,'\np value:',p_value)
a=(count/passengers)-value
b=math.sqrt((value*(1-value))/len(df['Age']))
print('z score:',a/b)
if p_value<alpha:
print("CONCLUSION:REJECT H0")
else:
print("CONCLUSION:ACCEPT H0")
OUTPUT:
z value: -3.064640290803521
p value: 0.0021793193662313385
z score: -3.048614706286276
CONCLUSION:REJECT H0
7a.) Implementation of T-Test – one sample t-test
Aim: To perform a one sample t-test to determine whether the mean of a population is equal
Algorithm :
1: Create some dummy age data for the population of voters in the entire country
2: Create Sample of voters in Minnesota and test the whether the average age of voters Minnesota
differs from the population
3: Conduct a t-test at a 95% confidence level and see if it correctly rejects the null hypothesis
that the sample comes from the same distribution as the population.
4: If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence
level and degrees of freedom, we reject the null hypothesis.
5: Calculate the chances of seeing a result as extreme as the one being observed (known as the p-
value) by passing the t-statistic in as the quantile to the stats.t.cdf() function
7a.) Implementation of T-Test – one sample t-test
Program:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
alpha=(100-95)/100
df=50-1
np.random.seed(6)
population_age1=stats.poisson.rvs(loc=18,mu=35,size=150000)
population_age2=stats.poisson.rvs(loc=18,mu=10,size=100000)
population_age=np.concatenate((population_age1,population_age2))
popmean=population_age.mean()
minnesota_age1=stats.poisson.rvs(loc=18,mu=30,size=30)
minnesota_age2=stats.poisson.rvs(loc=18,mu=10,size=20)
minnesota_age=np.concatenate((minnesota_age1,minnesota_age2))
t_statistic,
p_value=stats.ttest_1samp(a=minnesota_age,popmean=population_age.mean())
print('t_statistic value:',t_statistic,'p_value:',p_value)
quantile=alpha/2
quantiles=stats.t.ppf(quantile,df)
print('quantile value:',quantiles)
sigma=minnesota_age.std()/math.sqrt(50)
a=stats.t.interval(0.95,df = 49,loc = minnesota_age.mean(),scale= sigma)
b=stats.t.interval(0.99,df=49,loc=minnesota_age.mean(),scale=sigma)
alpha=(100-99)/100
if p_value<alpha:
print("the p-value is lower than our significance level alpha",alpha,"so we
should reject the null hypothesis. ")
else:
print("the p-value is greater than our significance level alpha",alpha,"so we
should accept the null hypothesis. ")
OUTPUT:
t_statistic value: -2.5742714883655027 p_value: 0.013118685425061678
quantile value: -2.0095752344892093
t statistic to quantile: 0.013118685425061678
the p-value is lower than our significance level alpha 0.05 so we should reject the
null hypothesis.
the p-value is greater than our significance level alpha 0.01 so we should accept
the null hypothesis.
7 b.) Implementation of T-Test – Two sample t-test and Paired T-Test
Aim:
To perform a two sample t-test and paired t-test to determine whether the mean of two population
are equal to some value or not
Algorithm
1: Create the data
Program:
import numpy as np
import pandas as pd
import math
import scipy.stats as stats
import matplotlib.pyplot as plt
alpha=(100-95)/100
np.random.seed(6)
population_age1=stats.poisson.rvs(loc=18,mu=35,size=150000)
population_age2=stats.poisson.rvs(loc=18,mu=10,size=100000)
population_age=np.concatenate((population_age1,population_age2))
np.random.seed(12)
wisconsin_age1 = stats.poisson.rvs(loc=18, mu=33, size=30)
wisconsin_age2 = stats.poisson.rvs(loc=18, mu=13, size=20)
wisconsin_age= np.concatenate((wisconsin_age1, wisconsin_age2))
t_statistic, p_value=stats.ttest_ind(a=minnesota_age,b=wisconsin_age)
print('t statistic value:',t_statistic,'p value:',p_value)
if p_value<alpha:
print("the p-value is lower than our significance level alpha",alpha,"so we
should reject the null hypothesis. ")
else:
print("the p-value is greater than our significance level alpha",alpha,"so we
should accept the null hypothesis. ")
#PAIRED TEST
print("paired test ")
np.random.seed(11)
before= stats.norm.rvs(scale=30, loc=250, size=100)
after = before + stats.norm.rvs(scale=5, loc=-1.25, size=100)
t_statistic,p_value=stats.ttest_rel(a=before,b=after)
print('t statistic value:',t_statistic,'p value:',p_value)
plt.figure(figsize=(12,10))
OUTPUT:
t statistic value: -1.7083870793286842 p value: 0.09073015386514258
the p-value is greater than our significance level alpha 0.05 so we should accept
the null hypothesis.
paired test
t statistic value: 2.5720175998568284 p value: 0.011596444318439859
Type II error: -0.001142591013029215
8 a.) IMPLEMENTATION OF VARIANCE ANALYSIS (ANOVA)
4. The difference between each value and the mean value for the group is calculated and
squared.
5. The squared difference values are added. The result is a value that relates to the total
deviation of rows from the mean of their respective groups. This value is referred to as the sum
of squares within groups, or
6. For each group, the difference between the total mean and the group mean is squared
and multiplied by the number of values in the group. The results are added. The result is
referred to as the sum of squares between groups
7. The two sums of squares are used to obtain a statistic for testing the null hypothesis, the so
called F-statistic. The F-statistic is calculated as: where (degree of freedom between groups)
equals the number of groups minus 1, (degree of freedom within groups) equals the total number
of values minus the number of groups
8 a.) IMPLEMENTATION OF VARIANCE ANALYSIS (ANOVA)
PROGRAM:
from scipy.stats import f_oneway
import pandas as pd
import numpy as np
a=[25,25,27,30,23,20]
b=[30,30,21,24,26,28]
c=[18,30,29,29,24,26]
data=zip(a,b,c)
df=pd.DataFrame(data,columns=['a','b','c'])
print(df)
m1=np.mean(a)
m2=np.mean(b)
m3=np.mean(c)
m=(m1+m2+m3)/3
x1=((m1-m)**2)*6
x2=((m2-m)**2)*6
x3=((m3-m)**2)*6
s2btwn=x1+x2+x3
num=s2btwn/(3-1)
y1=list(a-m1)
y2=list(b-m2)
y3=list(c-m3)
y=(y1+y2+y3)
wthn=[]
for i in y:
wthn.append(i**2)
s2wthn=np.sum(wthn)
n=6
k=3
df1=(k*(n-1))
den=(s2wthn)/(df1)
f=num/den
print('f statistic value:',f)
OUTPUT:
abc
0 25 30 18
1 25 30 30
2 27 21 29
3 30 24 29
4 23 26 24
5 20 28 26
f statistic value: 0.23489932885906037
9a.) DEMONSTRATION OF LINEAR REGRESSION
Algorithm
1: Consider a set of values x, y.
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def estimate_coef(x, y):
n = np.size(x)
mean_x = np.mean(x)
mean_y = np.mean(y)
ss_xy = np.sum(y*x) - n*mean_y*mean_x
ss_xx = np.sum(x*x) - n*mean_x*mean_x
c = ss_xy / ss_xx
d = mean_y - c*mean_x
return (d,c)
def plot_regression_line(x, y, b):
plt.scatter(x, y, color = "red",marker = "o")
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(x, y, b)
main()
OUTPUT:
Estimated coefficients:
b_0 = 1.2363636363636363 b_1 = 1.1696969696969697
10a.) DEMONSTRATION OF LOGISTIC REGRESSION
Aim: Write a python program Application Program to perform Classification using Logistic
Algorithm:
Step4: Fit the data into logistic regression function. Step5: Predict the test
data set.
Step6: Print the results.
10a.) DEMONSTRATION OF LOGISTIC REGRESSION
PROGRAM:
# Train the model on the admissions dataset (replace with your own dataset)
X_train = [[700, 3.5, 2], [680, 3.9, 4], [720, 3.8, 3], [690, 3.3, 6], [730, 3.7, 5], [690,
3.5, 2], [720, 3.7, 6], [740, 3.6, 8], [700, 3.3, 1], [690, 2.7, 4]]
y_train = [1, 1, 1, 0, 1, 0, 1, 1, 0, 0]
logreg.fit(X_train, y_train)
if y_pred[0] == 1:
print("Congratulations! You have been admitted.")
else:
print("Sorry, you have not been admitted.")
OUTPUT:
Aim: Implement a python Application Program to analyze the characteristics of a given time
Algorithm
plt.show()
from statsmodels.tsa.stattools import adfuller
#adf=augumented_dickey_fuller
adft=adfuller(df,autolag='AIC')
output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'],
adft[4]['5%'], adft[4]['10%']] , "Metric":["Test Statistics","p-value","No. of lags
used","Number of observations used", "critical value (1%)", "critical value (5%)",
"critical value (10%)"]})
print(output_df)
train_data = df['passengers'][:int(len(df)*0.8)]
test_data = df['passengers'][int(len(df)*0.8):]
plt.plot(train_data, color = "black",label='train_data')
plt.plot(test_data, color = "red",label='test_data')
plt.title("Train/Test split for Passenger Data")
plt.ylabel("Passenger Number")
plt.xlabel('Year-Month')
sns.set()
plt.legend(loc='upper left')
plt.show()
from pmdarima.arima import auto_arima
OUTPUT:
month passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
month passengers
139 1960-08 606
140 1960-09 508
141 1960-10 461
142 1960-11 390
143 1960-12 432
month passengers
0 1949-01-01 112
1 1949-02-01 118
2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121
passengers
month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Values Metric
0 0.815369 Test Statistics
1 0.991880 p-value
2 13.000000 No. of lags used
3 130.000000 Number of observations used
4 -3.481682 critical value (1%)
5 -2.884042 critical value (5%)
6 -2.578770 critical value (10%)
Performing stepwise search to minimize aic
ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=0.16 sec
ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=1076.519, Time=0.00 sec