KEMBAR78
Fdsa Lab Manual | PDF | Mode (Statistics) | Statistical Analysis
0% found this document useful (0 votes)
16 views74 pages

Fdsa Lab Manual

The document outlines the structure and content of a Bonafide Record for the Data Science and Analytics Laboratory course at Anand Institute of Higher Technology for the academic year 2023-2024. It includes sections for student information, a certificate of authenticity, an index of experiments, and detailed programming tasks involving NumPy and Pandas. The document serves as a practical guide for students to complete their laboratory work and assessments.

Uploaded by

saranyap.aids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views74 pages

Fdsa Lab Manual

The document outlines the structure and content of a Bonafide Record for the Data Science and Analytics Laboratory course at Anand Institute of Higher Technology for the academic year 2023-2024. It includes sections for student information, a certificate of authenticity, an index of experiments, and detailed programming tasks involving NumPy and Pandas. The document serves as a practical guide for students to complete their laboratory work and assessments.

Uploaded by

saranyap.aids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 74

ANAND INSTITUTE OF HIGHER TECHNOLOGY

OLD MAHABALIPURAM ROAD, KALASALINGAM NAGAR,

KAZHIPATTUR, CHENNAI-603 103.

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3411 – DATA SCIENCE AND ANALYTICS LABORATORY


(2023-2024)

Name :

Reg No :

Branch :

Year :

Semester :

ANAND INSTITUTE OF HIGHER TECHNOLOGY

OLD MAHABALIPURAM ROAD, KALASALINGAM NAGAR,

KAZHIPATTUR, CHENNAI-603 103.


BONAFIDE CERTIFICATE

Certified that this is a Bonafide Record of work done by


Mr./Ms.___________________________________ and the Register No is
________________________ of the ___________________________________ department in
DATA SCIENCE AND ANALYTICS LABORATORY – AD3411 during the Academic Year
2023-2024.

Staff in charge HOD

Submitted for University Examination held on ………………………..

Internal Examiner External Examiner

INDEX

DATE
S.NO NAME OF EXPERIMENTS PG.NO SIGN
1. Working With Numpy Arrays
a.) Creating A Numpy Ndarray Object
b.) Access An Array Element Using Array
Indexing
c.) Access The Subarray Using Slicing Technique

2.
WORKING WITH PANDAS DTAFRAMES
a.)
CREATE A DATAFRAME WITH THE
BELOW GIVEN DICTIONARY AND
APPLY AGGREGATIONS ON AGE
COLUMN

b.)
CREATE A DATAFRAME WITH THE
BELOW GIVEN TABLE (CREATE A
DICTIONARY USING THE DATA) AND
PERFORM THE GROUPBY OPERATIONS

c.)
PERFORM CONCATENATION
OPERATIONS ALONG AN AXIS

3.
BASIC PLOTS USING MATPLOTLIB

Line Plot
a.)
Multi Line Plot
b.)
Bar Chart
c.)
Histogram Chart
d.)
Pie Chart
e.)
Subplot
f.)
Frequency Distribution
4a.)
Describing Data With Averages
4b.)
4c.) Measures Of Variability
5a.) Normal Curves
5b.) Correlation Coefficient And Scatter Plots
6a.) Implementation Of One Sample Z-Test
6b.) Implementation Of Two Sample Z-Test
6c.) Implementation Of Z-Test Using Titanic Case
Study
7a.) Implementation Of One Sample T-Test
7b.) Implementation Of Two Sample T-Test
8a.) Implementation Of Variance Analysis
( Anova)
9a.) Demonstration Of Linear Regression
10a.) Demonstration Of Logistic Regression
11a.) Implementation Of Time Series Analysis
1a.) Display the dimensions, shape and size of arrays

Aim: Creating a NumPy ndarray Object using python

Algorithm:

1.Use a tuple to create a NumPy array:

2.Create a 0-D array with value 456

3.Create a 1-D array containing the values 100,200,300,400,500:

4.Create a 2-D array containing two arrays with the values 111,22,33 and 444,55,66:

5.Create a 3-D array with two 2-D arrays, both containing two arrays with the values 11,22,33
and 44,55,66

6.Display the number of dimensions, shape and size of the array


1a.) Display the dimensions, shape and size of arrays
PROGRAM:

import numpy as np
a=np.array(456)
b=np.array([100,200,300])
c=np.array([[11,22,33],[44,55,66]])
d=np.array([[[11,22,33],[44,55,66]],[[11,22,33],[44,55,66]]])
print('the array a is',a)
print('the array b is',b)
print('the array c is',c)
print('the array d is',d)
print('the dimension of an array a :' ,a.ndim)
print('the dimension of an array b :',b.ndim)
print('the dimension of an array c :' ,c.ndim)
print('the dimension of an array d :',d.ndim)
print('the shape of an array a :' ,a.shape)
print('the shape of an array b :' ,b.shape)
print('the shape of an array c :' ,c.shape)
print('the shape of an array d :', d.shape)
print('the size of an array a :', a.size)
print('the size of an array b :',b.size)
print('the size of an array c :',c.size)
print('the size of an array d :',d.size)

OUTPUT:
the array a is 456
the array b is [100 200 300]

the array c is [[11 22 33]


[44 55 66]]
the array d is [[[11 22 33]
[44 55 66]]
[[11 22 33]
[44 55 66]]]
the dimension of an array a : 0
the dimension of an array b : 1
the dimension of an array c : 2
the dimension of an array d : 3
the shape of an array a : ()
the shape of an array b : (3,)
the shape of an array c : (2, 3)
the shape of an array d : (2, 2, 3)
the size of an array a : 1
the size of an array b : 3
the size of an array c : 6
the size of an array d : 12
1b.) Access an array element using array indexing

Aim: To Access an array element using array indexing

Algorithm:

1.Create an 1D array using numpy ([23, 12, 53, 84])

2.Get the first element from the above array

3.Get the second element from the above array

. 4.Get third and fourth elements from the above array and add them. Display the result

5. Create an 2D array([[11,22,33,44,55], [36,67,88,99,101]])

6.Access the 2nd element on 1st dim:

7.Access the 5th element on 2nd dim:

8.Create an array([[[19, 52, 73], [24, 65, 46]], [[17,28, 89], [50, 41, 92]]])

9.Access the third element of the second array of the first array

: 10.Create an array([[18,23,32,41,55], [63,74,86,98,30]])

11.Print the last element from the 2nd dim using negative indexing
1b.) Access an array element using array indexing

PROGRAM:

import numpy as np
#one dimensional array
arr=np.array([23,12,53,84])
#get first element from the above array
print("the first element from the array:",arr[0])
#get the second element from the above array
print("the second element from the array:",arr[1])
#get the third and fourth element from the above array
print("the third element from the array:",arr[2])
print("the fourth element from the array:",arr[3])
#2D ARRAYS
arr1 = np.array([[11,22,33,44,55], [36,67,88,99,101]])
#access the 2nd element on 1st dim
print("the second element on 1st dim:",arr1[0][1])
#access the 5th element on 2nd dim
print("the fifth element on 2nd dim:",arr1[1][4])
#3D ARRAYS
arr2 = np.array([[[19, 52, 73], [24, 65, 46]], [[17,28, 89], [50, 41, 92]]])
#Access the third element of the second array of the first array
print("the third element of the second array of the first array:",arr2[0][1][2])
#Negative Indexing
arr3 = np.array([[18,23,32,41,55], [63,74,86,98,30]])
#Print the last element from the 2nd dim using negative indexing
print("the last element from the 2nd dim:",arr3[1][-1])

OUTPUT:
the first element from the array: 23
the second element from the array: 12
the third element from the array: 53
the fourth element from the array: 84
the second element on 1st dim: 22
the fifth element on 2nd dim: 101
the third element of the second array of the first array: 46
the last element from the 2nd dim: 30
1c.)Accessing the subarray using slicing technique

Aim: To Access the subarray using slicing technique

Algorithm:

1.Slice elements from index 1 to index 5 from the following array ([13, 22, 37, 49, 56, 64, 72])

2.Slice elements from index 4 to the end of the array

3.Slice elements from the beginning to index

4 (not included) 4.Slice from the index 3 from the end to index 1 from the end from the
following array ([12, 23, 53, 74, 15, 16, 87])

5.Return every other element from index 1 to index 5 from the array([14, 22, 36, 14, 15, 76, 97])

6.Return every other element from the entire array ([14, 22,13, 64, 65, 56, 47])

7.Create a subarray arr1 [27 28 19] from the array([[31, 22, 13, 54, 75], [16, 27, 28, 19, 70]]) and
display the array arr1
1c.)Accessing the subarray using slicing technique

PROGRAM:

import numpy as np
arr = np.array([13, 22, 37, 49, 56, 64, 72])
#Slice elements from index 4 to the end of the array
print("the elements from index 4 to the end:",arr[4:])
#Slice elements from the beginning to index 4 (not included)
print("the elements from the beginning to index 3:",arr[0:4])
#negative slicing
#slice from index 3 from the end to index 1 from the end
arr1=np.array([12,23,53,74,15,16,87])
print("the element from index 3 from the end to index 1 from the end:",arr1[-4:-8:-1])
#return every other element from index 1 to index 5
arr2=np.array([14,22,36,14,15,76,97])
print("the every other element from index 1 to index 5:",arr2[1:6])
#return every other element from the entire array
arr3=np.array([14,22,13,64,65,56,47])
print("the every other element from the entire array:",arr3[0:])
#SLICING 2D ARRAYS
arr4 = np.array([[31, 22, 13, 54, 75], [16, 27, 28, 19, 70]])
subarray=(arr4[1][1:4])
print("the given array is ",arr4)
print("the subarray from given array :",subarray)

OUTPUT:
the elements from index 4 to the end: [56 64 72]
the elements from the beginning to index 3: [13 22 37 49]
the element from index 3 from the end to index 1 from the end: [74 53 23 12]
the every other element from index 1 to index 5: [22 36 14 15 76]
the every other element from the entire array: [14 22 13 64 65 56 47]
the given array is [[31 22 13 54 75]
[16 27 28 19 70]]
the subarray from given array : [27 28 19]
2a.) Create a DataFrame with the below given dictionaryApplying multiple aggregations at
once – Get the sum, mean and standard deviations for the AGE column

Aim: To know and use more of the functionalities of pandas library


Algorithm:
1. Import the required packages
2. Create a DataFrame with the below given dictionary and apply aggregations on Age
column
3. Applying multiple aggregations at once – Get the sum, mean and standard deviations for
the AGE column
4. Display the result
2a.) Create a DataFrame with the below given dictionaryApplying multiple aggregations at once
– Get the sum, mean and standard deviations for the AGE column

PROGRAM:
import pandas as pd
#create dataframe
Data = {'Name':['Sankar', 'Julius', 'Sriram', 'Rithu',
'Mitelesh','Juliet'],'Age':[20,19,20,20,19,20], 'Dept' : ['AI','CSE','AI','CSE', 'AI','CSE']}
Index='20AD01','20CS02','20AD03','20CS04','20AD04','20AD10'
df=pd.DataFrame(Data,Index)
print(df)
#get sum, mean, standard deviations in age column
sum1=df['Age'].aggregate('sum')
print("sum of age column :",sum1)
mean1=df['Age'].mean()
print("mean of age column:",mean1)
standard_deviations=df['Age'].std()
print("standard deviation of age column:",standard_deviations)

OUTPUT:
Name Age Dept
20AD01 Sankar 20 AI
20CS02 Julius 19 CSE
20AD03 Sriram 20 AI
20CS04 Rithu 20 CSE
20AD04 Mitelesh 19 AI
20AD10 Juliet 20 CSE
sum of age column : 118
mean of age column: 19.666666666666668
standard deviation of age column: 0.5163977794943222
.) Create a DataFrame with the below given table (create a dictionary using the data) and
perform the groupby operations

Aim: To know and use more of the functionalities of pandas library


Algorithm:
1. Create a DataFrame with the below given table (create a dictionary using the data)
and perform the groupby operations
2. Group the data based on single column – Outlook

3. Group the data based on multiple column – Outlook and Play


4. Iterate the objects of the data frame using groupby’s Use Outlook column
5. Display the result
2b.) Create a DataFrame with the below given table (create a dictionary using the data) and
perform the groupby operations

PROGRAM:
import pandas as pd
#create a dataframe
data={'Outlook':['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny
','Rainy ','Sunny','Overcast', 'Overcast ','Rainy '],'Temparature':
[85,80,83,70,68,65,64,72,69,75,75,72,81,71],'humidity':
[85,90,86,96,80,70,65,95,70,80,70,90,75,91],'windy':
['false','true','false','false','false','true','true','false','false','false','true','true','false','true'],'play':
['no','no','yes','yes','yes','no','yes','no','yes','yes','yes','yes','yes','no']}
df=pd.DataFrame(data)
print(df)

#group the data based on single column -outlook


outlook1=df.groupby('Outlook')
#To visualize the objects in dataframe use iterative method
for outlook_objects in outlook1:
print(outlook_objects)

#group the data based on multiple coluumn - outlook and play


play1=df.groupby(['Outlook','play'])
#To visualize the objects in dataframe use iterative method
for play_objects in play1:
print(play_objects)

OUTPUT:
Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
2 Overcast 83 86 false yes
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes
5 Rainy 65 70 true no
6 Overcast 64 65 true yes
7 Sunny 72 95 false no
8 Sunny 69 70 false yes
9 Rainy 75 80 false yes
10 Sunny 75 70 true yes
11 Overcast 72 90 true yes
12 Overcast 81 75 false yes
13 Rainy 71 91 true no
('Overcast', Outlook Temparature humidity windy play
2 Overcast 83 86 false yes
6 Overcast 64 65 true yes
11 Overcast 72 90 true yes)
('Overcast ', Outlook Temparature humidity windy play
12 Overcast 81 75 false yes)
('Rainy', Outlook Temparature humidity windy play
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes
5 Rainy 65 70 true no)
('Rainy ', Outlook Temparature humidity windy play
9 Rainy 75 80 false yes
13 Rainy 71 91 true no)
('Sunny', Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
7 Sunny 72 95 false no
10 Sunny 75 70 true yes)
('Sunny ', Outlook Temparature humidity windy play
8 Sunny 69 70 false yes)
(('Overcast', 'yes'), Outlook Temparature humidity windy play
2 Overcast 83 86 false yes
6 Overcast 64 65 true yes
11 Overcast 72 90 true yes)
(('Overcast ', 'yes'), Outlook Temparature humidity windy play
12 Overcast 81 75 false yes)
(('Rainy', 'no'), Outlook Temparature humidity windy play
5 Rainy 65 70 true no)
(('Rainy', 'yes'), Outlook Temparature humidity windy play
3 Rainy 70 96 false yes
4 Rainy 68 80 false yes)
(('Rainy ', 'no'), Outlook Temparature humidity windy play
13 Rainy 71 91 true no)
(('Rainy ', 'yes'), Outlook Temparature humidity windy play
9 Rainy 75 80 false yes)
(('Sunny', 'no'), Outlook Temparature humidity windy play
0 Sunny 85 85 false no
1 Sunny 80 90 true no
7 Sunny 72 95 false no)
(('Sunny', 'yes'), Outlook Temparature humidity windy play
10 Sunny 75 70 true yes)
(('Sunny ', 'yes'), Outlook Temparature humidity windy play
8 Sunny 69 70 false yes)

2(C) To perform concatenation operations along an axis


Aim: To know and use more of the functionalities of pandas library
Algorithm:
1. Perform concatenation operations along an axis

2. Create two DataFrame

3. Set the ignore_index to True in concatenation and print the result.

4. Display the result


2c.) To perform concatenation operations along an axis

PROGRAM:

#Concatenation
#Create two DataFrame
import pandas as pd
One= pd.DataFrame({ 'Name': ['Allen', 'Amutha', 'Ashwad', 'Avinash', 'Arun'],
'subject_id':['OOPS','DM','Physics','Statistics','FDS'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({ 'Name': ['Bathri', 'Barath', 'Banu', 'Balaji', 'Betty'],
'subject_id':[ 'OOPS','DM','Physics','Statistics','FDS'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
df=[One,two]
#Perform concatenation operations along an axis
#Set the ignore_index to True in concatenation and print the result.
a=pd.concat(df,ignore_index=True)
print(a)

OUTPUT:
Name subject_id Marks_scored
0 Allen OOPS 98
1 Amutha DM 90
2 Ashwad Physics 87
3 Avinash Statistics 69
4 Arun FDS 78
5 Bathri OOPS 89
6 Barath DM 80
7 Banu Physics 79
8 Balaji Statistics 97
9 Betty FDS 88
3(a) loading the dataset and read total sales of all month and
show it using a line plot

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1. Load the Dataset and Read Total sales of all month and show it using a line plot

2.Define the x-axis and corresponding y-axis values as lists.

3.Plot them on canvas using .plot() function.

4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5.Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3a.) loading the dataset and read total sales of all month and
show it using a line plot

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("C:/mat/plot.csv") #load the dataset
print(df)

#convert array to list


year=df['year/month'].tolist()
ts=df['total_sales'].tolist()

#style properties
plt.plot(year,ts,color='green',label='sales',linestyle='dotted',line
width=5,marker='o',markerfacecolor='red')

#labelling
plt.xlabel('month number')
plt.ylabel('sales per month')
plt.legend(loc='upper left')

#display
plt.show()

OUTPUT:
3b.)loading the dataset and read all product sales data and show
it using multiline plot

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1. Load the Dataset and Read all product sales data and show it using a multiline plot

2.Define the x-axis and corresponding y-axis values as lists.

3.Plot them on canvas using .plot() function.

4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5.Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3b.)loading the dataset and read all product sales data and show
it using multiline plot

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
print(df)
year=df['year/month'].tolist()
fuel=df['fuel'].tolist()
veg=df['veg'].tolist()
fruits=df['fruits'].tolist()
snacks=df['snacks'].tolist()
rice=df['rice'].tolist()
plt.plot(year,fuel,label='fuel',linewidth=5,color='green',marker=
'o',markerfacecolor='red')
plt.plot(year,veg,label='veg',linewidth=5,color='blue',marker='o
',markerfacecolor='red')
plt.plot(year,fruits,label='fruits',linewidth=5,color='red',marker
='o',markerfacecolor='black')
plt.plot(year,snacks,linewidth=5,label='snacks',color='orange',m
arker='o',markerfacecolor='blue')
plt.plot(year,rice,linewidth=5,color='black',label='rice',marker='
o',markerfacecolor='red')
plt.xlabel("year/month")
plt.ylabel("sales unit in number")
plt.legend(loc='upper left')
plt.show()

OUTPUT:
3C.) loading the dataset and read veg and snacks sales data and
show it using bar chart

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1. Load the Dataset and Read Veg and snacks sales data and show it using the bar chart

2.Define the x-axis and corresponding y-axis values as lists.

3.Plot them on canvas using .plot() function.

4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5.Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3C.) loading the dataset and read veg and snacks sales data and
show it using bar chart

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#load the dataset
df=pd.read_csv('C:/mat/plot.csv')
print(df)
year=df['year/month']

v=df['veg']
s=df['snacks']

bar1=np.arange(len(year))
bar2=0.4+bar1
plt.bar(bar1,v,width=0.4,label='veg')
plt.bar(bar2,s,width=0.4,label='snacks')
plt.xticks(bar1,year)
plt.legend(loc='upper left')
plt.xlabel("year & month")
plt.ylabel("sales of veg and snacks")
plt.show()

OUTPUT:
3d.)load the dataset and read sales data of rice of all months and
show it using histogram chart

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1. Load the Dataset and Read sales data of Rice of all months and show it using a histogram
chart

2.Define the x-axis and corresponding y-axis values as lists.

3.Plot them on canvas using .plot() function.

4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5.Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3d.)load the dataset and read sales data of rice of all months and
show it using histogram chart

PROGRAM:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month'].tolist()

rice=df['rice'].tolist()
rice.sort()
a=[6000,6500,7000,7500,8000,8500,9000]
plt.hist(rice,a,bottom=4,ec='black')
plt.xticks(rice)
plt.show()

OUTPUT:
3e.) load the dataset and read the total sales data of 2013 for last
year for each product and show it using a pie chart

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1.start
2. Define the x-axis and corresponding y-axis values as lists.

3. Plot them on canvas using .plot() function.

4. Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5. Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3e.) load the dataset and read the total sales data of 2013 for last
year for each product and show it using a pie chart

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month']
labels=['fuel','veg','fruits','snacks','rice']
#adding the data of each columns
a=[df['fuel'].sum(),df['veg'].sum(),df['fruits'].sum(),df['snacks'].s
um(),df['rice'].sum()]
plt.axis('equal')
#autopct for visualize the percentage of each data
plt.pie(a,labels=labels,autopct='%1.1f%%')
plt.legend(loc='upper left')
plt.show()

OUTPUT:
3f.))load the dataset and read fuel and fruits all months and
display it using the subplot

Aim: Write a python program to create a simple plot using plot()


function
Algorithm:
1. Load the Dataset and Read Fuel and fruits all months and display it using the Subplot

2.Define the x-axis and corresponding y-axis values as lists.

3.Plot them on canvas using .plot() function.

4.Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.

5.Give a title to your plot using .title() function.

6. Finally, to view your plot, we use .show() function.

7. Stop
3f.))load the dataset and read fuel and fruits all months and
display it using the subplot

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('C:/mat/plot.csv')
year=df['year/month']
fuel=df['fuel']
fruits=df['fruits']
fig,ax=plt.subplots(2)
ax[0].plot(year,fuel,color='red',marker='o',label='fuel',linewidth
=3)

ax[0].set_title('fuel')
ax[1].plot(year,fruits,color='green',marker='o',label='fruits',line
width=3)
ax[1].set_title('fruits')
plt.xticks(year)
plt.show()

OUTPUT
THE CSV FILE :
year/month fuel veg fruits snacks rice total_sales total_profit
0 2013-10 160195 94189 107256 93654 7518 462812 4628120
1 2013-09 167767 95059 102224 99034 6925 471009 4710090
2 2013-08 147264 119002 113335 109712 8120 497433 4974330
3 2013-07 146792 117459 124317 113254 7910 509732 5097320
4 2013-06 141535 125406 117393 92277 7592 484203 4842030
5 2013-05 167146 142285 124644 96207 8516 538798 5387980
6 2013-04 141325 114726 103087 90990 7743 457871 4578710
4a.) Frequency distribution

Aim: To sorting observations into classes and showing their frequency (f ) of occurrence in
each class.

Algorithm

1. Download any of the freely available dataset and import the dataset using pandas library
2. Choose a Quantitative column. Find the range of the column

3. Find the frequency distribution for the considered column with the given range

4. Find
the relative frequency distribution for the considered column with the given range
and add along the previous table
5. Findthe Cumulative frequency distribution for the considered column with the given
range and add along the previous table

6. Find the Cumulative frequency distribution percentage for the considered column with the
given range and add along the previous table
7. stop
4a.) Frequency distribution

PROGRAM:

import pandas as pd
df=pd.read_csv('C:/mat/fre.csv')
print(df)
#TO FIND RANGE OF A QUANTIATIVE COLUMN
max_value=df['AGE'].max()
min_value=df['AGE'].min()
a=(max_value)-(min_value)#range
print('the range of the column:',a)
#frequency distribution of age column
df1=pd.DataFrame({})#EMPTY DATASET
frequency_distribution=df['AGE'].value_counts()
df1['frequency_distribution_AGE']=frequency_distribution

#relative frequency of age column


total_frequency=frequency_distribution.sum()
relative_frequency=((frequency_distribution)/total_frequency)
df1['relative_frequency_AGE']=relative_frequency
print(df1)
#cumulative frequency of age column
df2=pd.DataFrame({})
df2['cumulative_frequency_AGE']=df['AGE'].cumsum()
df2['cumulative_percentage_AGE']=100*(df['AGE'].cumsum()/df['AGE'].sum())
print(df2)

OUTPUT:
NAME DEPT AGE MARK [CSV FILE]
0 Sankar AI 22 98
1 Julius IT 25 99
2 Sriram ECE 24 97
3 Rithu CSE 22 96
the range of the column: 3
frequency_distribution_AGE relative_frequency_AGE
22 2 0.50
25 1 0.25
24 1 0.25
cumulative_frequency_AGE cumulative_percentage_AGE
0 22 23.655914
1 47 50.537634
2 71 76.344086
3 93 100.000000
4b.) Describing with averages

Aim: To apply the measures of central tendency to describe the middle or typical value for a
distribution.
Algorithm:

1. Create an array using Numpy and find the mean value of the
array ([56,78,98,76,54,56,43,32,34,51])
2.Create a dictionary of series using Weather Dataset and apply mean,median on
Temperature and Windy column,apply mode on windy column.

3. Create an array using Numpy and find the median,mode value of the
array ([56,78,98,76,54,56,43,32,34,51]

4. Install scipy package and import stat library and use stats.mode

5.Create a dictionary of series using Weather Dataset and calculate and plot the
skewness on Temperature column
4b.) Describing with averages

PROGRAM:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics
from sklearn.preprocessing import LabelEncoder
from scipy.stats import mode,skew

a=np.array([56,78,98,76,54,56,43,32,34,51])
print('the mean of an array:',a.mean())
print('the median of the array:',statistics.median(a))
print('the mode of the array:',mode(a,keepdims=True))
#CREATE A DICTIONARY OF SERIES
dict_ser=pd.Series({'outlook':['sunny','sunny','overcast','r
ainy','rainy','rainy','overcast','sunny','sunny'],'temperatur

e':[85,80,83,70,68,65,64,72,69],'humidity':[85,90,86,96,8
0,70,65,95,70],'windy':['False','True','False','False','False','
True','True','False','False']})
print("the data series:",dict_ser)
#to find mean convert True to 1 and False to 0
b=LabelEncoder().fit_transform(dict_ser['windy'])
dict_ser['windy']=b
print(dict_ser)

print('the mean of
temperature:',statistics.mean(dict_ser['temperature']))
print('the mean of windy',dict_ser['windy'].mean())
print('the median of
temperature:',statistics.median(dict_ser['temperature']))
print('the median of
windy:',statistics.median(dict_ser['windy']))
print('the mode of
windy:',mode(dict_ser['windy'],keepdims=True))

print("the skewness on temperature


column:",skew(dict_ser['windy']))
x=np.linspace(-2,2,500)
y=1/(np.sqrt(2*np.pi))*np.exp(-5*(x)**2)
plt.plot(x,y,'*')
plt.title('skewness of temperature')
plt.show()

OUTPUT:
the mean of an array: 57.8
the median of the array: 55.0
the mode of the array: ModeResult(mode=array([56]),
count=array([2]))
the data series: outlook [sunny, sunny, overcast,
rainy, rainy, rainy, ...
temperature [85, 80, 83, 70, 68, 65, 64, 72, 69]
humidity [85, 90, 86, 96, 80, 70, 65, 95, 70]
windy [False, True, False, False, False, True, True,...

dtype: object
outlook [sunny, sunny, overcast, rainy, rainy, rainy, ...
temperature [85, 80, 83, 70, 68, 65, 64, 72, 69]
humidity [85, 90, 86, 96, 80, 70, 65, 95, 70]
windy [0, 1, 0, 0, 0, 1, 1, 0, 0]
dtype: object
the mean of temperature: 72.88888888888889
the mean of windy 0.3333333333333333
the median of temperature: 70
the median of windy: 0
the mode of windy: ModeResult(mode=array([0],
dtype=int64), count=array([6]))
the skewness on temperature column:
0.7071067811865479
4c.) Measures of variability

Aim: To measure the amount by which scores are dispersed or scattered in a distribution.
Algorithm:
1. Create an DataFrame using pandas for the given list [3319, 3654, 3881, 6335, 840, 4759,
5130,
863, 8070, 8830]

2.find range, variance and standard deviation of the list

3. find the distance of each data object’s standard deviation from mean

4.Describe the data and find the IQR, median, Q1,Q2,Q3


4c.) Measures of variability

PROGRAM:

import pandas as pd
import numpy as np
import statistics
from scipy.stats import iqr
#create a dataframe
a=pd.DataFrame({'data':[3319,3654,3881,6335,840,4759,5130,863,807
0,8830]})
print(a)
#RANGE
max_value=a['data'].max()
min_value=a['data'].min()
b=max_value-min_value
print('the range of the column:',b)
#VARIANCE
c=statistics.variance(a['data'])

print('the variance of the column:',c)

#STANDARD DEVIATION

d=statistics.stdev(a['data'])

print("the standard deviation of the column:",d)

#DISTANCE OF STANDARD DEVIATION FROM MEAN

M=statistics.mean(a['data'])

i=0

x=(a['data'][i])

while i <len(a['data']):

k=(((x) - M) / c)#c=variance m=mean x=element of df

i+=1

print(“the distance of each data objects’s standard deviation from


mean”, i,'{:7f}\n'.format(k))

#IQR

#CONVERT INTO LIST & SORT

r=a['data'].tolist()

r.sort()
print('the sorted list :',r)

print('the interquartilerange of list:',iqr(r))

#DESCRIBE THE DATAFRAME

j=a.describe()

print(‘the description :’,j)

OUTPUT:

data

0 3319

1 3654

2 3881

3 6335

4 840

5 4759

6 5130

7 863

8 8070

9 8830

the range of the column: 7990

the variance of the column: 7155070.766666667

the standard deviation of the column: 2674.8964029783783

the distance of standard deviation from mean 1 -0.000175

the distance of standard deviation from mean 2 -0.000128

the distance of standard deviation from mean 3 -0.000096

the distance of standard deviation from mean 4 0.000247

the distance of standard deviation from mean 5 -0.000521

the distance of standard deviation from mean 6 0.000027

the distance of standard deviation from mean 7 0.000079


the distance of standard deviation from mean 8 -0.000518

the distance of standard deviation from mean 9 0.000489

the distance of standard deviation from mean 10 0.000596

the sorted list : [840, 863, 3319, 3654, 3881, 4759, 5130, 6335, 8070,
8830]

the interquartilerange of list: 2631.0

the description : data

count 10.000000

mean 4568.100000

std 2674.896403

min 840.000000

25% 3402.750000

50% 4320.000000

75% 6033.750000

max 8830.000000
5a.) Normal curves

Aim: To visualize the data by arranging the probability distribution of each value in the
data
Algorithm:
1. Import matplotlib, scipy and numpy library

2. Set min and max value for generating random numbers between the specified range

3. Specify the mean and standard deviation

4. Generate 100 random numbers between the min and max

5. Use scipy library and stats module to call norm function

6. Plot the graph using matplotlib


5a.) Normal curves

PROGRAM:

import random
import statistics
from scipy.stats import norm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#set min and max value for generating random numbers


between the specified range
num=random.sample(range(1,11),10)
print(num)
mean_num=statistics.mean(num)
print('the mean of the random number:',mean_num)
std_num=statistics.stdev(num)
print('the standard deviation of the random number:',std_num)

#generate 100 random numbers between min and max


x=np.random.randn(100)
x.sort()#sort the list
plt.title('normal curves')
plt.plot(x,norm.pdf(x))
plt.show()

OUTPUT:
[NOTE: THE OUTPUT CHANGES FOR EACH EXECUTION]
[3, 7, 10, 6, 8, 2, 5, 1, 4, 9]
the mean of the random number: 5.5
the standard deviation of the random number:3.0276503540974917
5b.) Correlation coefficient

Aim: To describe the relationship between a pair of variables and plot them to visualize
Algorithm:
1. Import the excel file using pandas pd.read_excel(file path) and display first 4 rows
of the file to ensure you uploaded the file
2. Rename columns for better understanding

3. Find the correlation between 'Cement' (Column name C3) and 'Compressive
Strength’ (Column name Strength)
4. Plot the correlation between two variables 'Cement' (Column name C3) and
'Compressive Strength’ (Column name Strength) using seaborn
5. Plot the correlation matrix comprising all the variables using heatmap
5b.) Correlation coefficient

PROGRAM:

import pandas as pd
from scipy.stats import pearsonr
df=pd.read_csv('C:/fds/Concrete_Data.csv')
print(df.head(4)) #to display first 4 elements
print(df.columns)
df.rename(columns={'Cement':'C3','Concrete compressive
strength':'Strength'},inplace=True)
print(df.head(4))
a=df['C3'].corr(df['Strength']) #pearson correlation

print('THE CORRELATION BETWEEN CEMENT AND


STRENGTH:',round(a,5))

#new dataframe consists of c3['cement'],strength


df1=pd.DataFrame()
df1['c3']=df['C3']
df1['strength']=df['Strength']
print(df1.head(10))
corr_matrix=df1.corr()
print(corr_matrix)

#visulize the 2 columns[c3,strength]


import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(corr_matrix)
plt.show()

#visualize all columns

corr_matrix1=df.corr()
sns.heatmap(corr_matrix1)
plt.show()

OUTPUT:
Cement Blast Furnace Slag ... Age (day) Concrete compressive strength
0 540.0 0.0 ... 28 79.99
1 540.0 0.0 ... 28 61.89
2 332.5 142.5 ... 270 40.27
3 332.5 142.5 ... 365 41.05

[4 rows x 9 columns]
Index(['Cement', 'Blast Furnace Slag', 'Fly Ash', 'Water ', 'Superplasticizer ', 'Coarse Aggregate ',
'Fine Aggregate ',
'Age (day)', 'Concrete compressive strength'], dtype='object')
C3 Blast Furnace Slag Fly Ash ... Fine Aggregate Age (day) Strength
0 540.0 0.0 0.0 ... 676.0 28 79.99
1 540.0 0.0 0.0 ... 676.0 28 61.89
2 332.5 142.5 0.0 ... 594.0 270 40.27
3 332.5 142.5 0.0 ... 594.0 365 41.05
[4 rows x 9 columns]
'THE CORRELATION BETWEEN CEMENT AND STRENGTH:' 0.49783
c3 strength
0 540.0 79.99
1 540.0 61.89
2 332.5 40.27

3 332.5 41.05
4 198.6 44.30
5 266.0 47.03
6 380.0 43.70
7 380.0 36.45
8 266.0 45.85
9 475.0 39.29
c3 strength
c3 1.000000 0.497832
strength 0.497832 1.000000
6 a.) Implementation of one sample z-test

Aim: To Perform One Sample Z-Tests in Python


Algorithm:
1: Evaluate the data distribution given for one sample z-
test 2: Formulate Hypothesis statement symbolically
3: Define the level of significance (alpha)
4: Calculate Z test statistic or Z score.

5: Derive P-value for the Z score


calculated. 6: Make decision:
6.1 : P-Value <= alpha, then we reject H0.

6.2: If P-Value > alpha, Fail to reject H0


6 a.) Implementation of one sample z-test

PROGRAM:

from statsmodels.stats.weightstats import ztest


alpha=float(0.05)#level of significance
mean=100
standard_deviation=15
#enter 20 random IQ levels samples
data=[88,92,94,94,96,97,105,109,109,109,112,110,97,97,99,99,112,113,114,115]
a=ztest(data,value=mean)
p_value=a[1]
print('z value:',a[0],'\n p value:',p_value)
print('CONCLUSION:')
if p_value<= alpha:
print('we reject H0')
else:
print('accept H0')

OUTPUT:

z value: 1.5976240527147705
p value: 0.11012667014384257
CONCLUSION: Accept H0

6b.)Implementation of two sample ztest

Aim: To Perform Two Sample Z-Tests in Python


Algorithm:
1: Evaluate the data distribution given for two sample z-
test 2: Formulate Hypothesis statement symbolically
3: Define the level of significance
(alpha) 4: Calculate Z test statistic or Z
score.
5: Derive P-value for the Z score
calculated. 6: Make decision:
6.1 : P-Value <= alpha, then we reject H0.

6.2 : If P-Value > alpha, Fail to reject H0


6b.)Implementation of two sample ztest

PROGRAM:

from statsmodels.stats.weightstats import ztest as ztest


alpha=0.05
#random sample values for city A and city B
cityA = [82, 84, 85, 89, 91, 91, 92, 94, 99, 99,105, 109, 109, 109, 110, 112, 112,
113, 114, 114]
cityB = [90, 91, 91, 91, 95, 95, 99, 99, 108, 109,109, 114, 115, 116, 117, 117, 128,
129, 130, 133]
a=ztest(cityA,cityB,value=0)
p_value=a[1]
print('z value:',a[0],'\n p value:',p_value)
print('CONCLUSION:')
if p_value<= alpha:
print('we reject H0')
else:
print('Accept H0')

OUTPUT:

z value: -1.9953236073282115
p value: 0.046007596761332065

CONCLUSION:
we reject H0
6C.) IMPLEMENTATION OF Z-TEST – USING TITANIC CASE STUDY

Aim: To Perform Z-test on Titanic case


study
Algorithm:
1. load the required datasets

2. We will implement hypothesis test on below cases

2.1 Some new survey/research claims that the average age of passengers in Titanic who
survived is greater than 28.
2.2 There is a difference in average age between the two genders who survived?

2.3 Greater than 50% of passengers who survived in Titanic are in the age group of 20–40.

2.4.Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)

3. print the result


6C.) IMPLEMENTATION OF Z-TEST – USING TITANIC CASE STUDY

PROGRAM:
(i) Some new survey/research claims that the average
age of passengers in Titanic who survived is greater
than 28.

import pandas as pd
import numpy as np
import random
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import zconfint
alpha=float(0.05)
df=pd.read_csv('C:/fds/titanic.csv')
survived_passenger=(df[df['Survived']==1])
Age_valued=survived_passenger[survived_passenger['Age'].notna
()].head()
Age_column=(df[df['Age'].notna()].Age)
survey=[]
#AS PER CENTRAL LIMIT THEOREM WILL TAKE 60 RANDOM
SAMPLING
for i in range(60):
mean_age=np.random.choice(Age_column).mean()
survey.append(mean_age)
age=28
value=ztest(survey,value=age)
lower,upper=zconfint(survey,value=0)

print('z value:',value[0],'\np value:',value[1])


if value[1]< alpha:
print("CONCLUSION:reject H0")
else:
print("CONCLUSION:accept H0")
print("the average age of passengers in Titanic who survived is
between", lower ,'and',upper)

OUTPUT:
[NOTE: AT EACH EXECUTION VALUES DIFFER]
z value: 1.0036986169297506
p value: 0.3155239039221307
CONCLUSION:accept H0
the average age of passengers in Titanic who survived is between
26.372399866408035 and 33.04426680025863

(ii) There is a difference in average age between the


two genders who survived?
import pandas as pd
import numpy as np
import random
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import zconfint
alpha=float(0.05)
df=pd.read_csv("C:/fds/titanic.csv")
survived=df[df['Survived']==1]
male_survived=survived[survived['Sex']=='male']
male_age=male_survived[male_survived['Age'].notna()].Age
male_list=[]
for i in range(60):
mean_male=np.random.choice(male_age).mean()
male_list.append(mean_male)
female_survived=survived[survived['Sex']=='female']
female_age=female_survived[female_survived['Age'].notna()].Age
female_list=[]
for j in range(60):
mean_female=np.random.choice(female_age).mean()
female_list.append(mean_female)
z_value,p_value=ztest(x1=male_list,x2=female_list,value=0)
lower1,upper1=zconfint(male_list,value=0)
lower2,upper2=zconfint(female_list,value=0)
print("z_value:",z_value,"\np_value:",p_value)
print("the average age of male who is survived lies
between",lower1 ,'and ',upper1)
print("the average age of female who is survived lies between
",lower2 ,'and', upper2)
if p_value<alpha:
print("CONCLUSION:reject H0")
else:
print("CONCLUSION:accept H0")

OUTPUT:
[NOTE:AT EACH EXECUTION OUTPUT VALUES DIFFER]
z_value: 0.669303406337577
p_value: 0.5033019543375687
the average age of male who is survived lies between
22.807162910444323 and 30.91783708955568
the average age of female who is survived lies between
21.442984426168046 and 28.590348907165286
CONCLUSION:accept H0

(iii)Greater than 50% of passengers who survived in


Titanic are in the age group of 20–40.

import pandas as pd
import numpy as np
import random
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.weightstats import zconfint
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportion_confint
import math
alpha=0.05
df=pd.read_csv('C:/fds/titanic.csv')
survived=df[df['Survived']==1]
age_20_40=survived[(survived['Age']>=20)&(survived['Age']<=40)]
.Age
h=df[df['Survived']==1].Survived
len_survived=len(h)
nobs=len_survived
value=float(0.5)
count=len(age_20_40)
z_value,p_value=proportions_ztest(count,nobs,value)

print("z value:",z_value,"\n p_value:",p_value)


a=(count/nobs)-value
b=math.sqrt(((value*(1-value))/df['Age'].count()))
print('z score:',a/b)
if p_value<alpha:
print('CONCLUSION:REJECT H0')
else:
print('CONCLUSION:ACCEPT H0')

OUTPUT:
z value: -1.628491667667518
p_value: 0.10342067458876389
z score: -2.3439279326118236
CONCLUSION:ACCEPT H0

(iv)Greater than 50% of passengers in Titanic are in


the age group of 20–40 ( including
both survived and non-survived passengers)

import pandas as pd
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.proportion import proportions_ztest
import math
alpha=0.05
df=pd.read_csv('C:/fds/titanic.csv')
passengers=len(df['Survived'])
age=df[(df['Age']>=20)&(df['Age']<=40)].Age

count=(len(age))
value=0.5
nobs=passengers
z_value,p_value=proportions_ztest(count,nobs,value)
print('z value:',z_value,'\np value:',p_value)
a=(count/passengers)-value
b=math.sqrt((value*(1-value))/len(df['Age']))
print('z score:',a/b)
if p_value<alpha:
print("CONCLUSION:REJECT H0")
else:
print("CONCLUSION:ACCEPT H0")
OUTPUT:
z value: -3.064640290803521
p value: 0.0021793193662313385
z score: -3.048614706286276
CONCLUSION:REJECT H0
7a.) Implementation of T-Test – one sample t-test

Aim: To perform a one sample t-test to determine whether the mean of a population is equal
Algorithm :

1: Create some dummy age data for the population of voters in the entire country
2: Create Sample of voters in Minnesota and test the whether the average age of voters Minnesota
differs from the population
3: Conduct a t-test at a 95% confidence level and see if it correctly rejects the null hypothesis
that the sample comes from the same distribution as the population.
4: If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence
level and degrees of freedom, we reject the null hypothesis.
5: Calculate the chances of seeing a result as extreme as the one being observed (known as the p-
value) by passing the t-statistic in as the quantile to the stats.t.cdf() function
7a.) Implementation of T-Test – one sample t-test

Program:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
alpha=(100-95)/100
df=50-1
np.random.seed(6)
population_age1=stats.poisson.rvs(loc=18,mu=35,size=150000)
population_age2=stats.poisson.rvs(loc=18,mu=10,size=100000)
population_age=np.concatenate((population_age1,population_age2))
popmean=population_age.mean()

minnesota_age1=stats.poisson.rvs(loc=18,mu=30,size=30)
minnesota_age2=stats.poisson.rvs(loc=18,mu=10,size=20)
minnesota_age=np.concatenate((minnesota_age1,minnesota_age2))
t_statistic,
p_value=stats.ttest_1samp(a=minnesota_age,popmean=population_age.mean())
print('t_statistic value:',t_statistic,'p_value:',p_value)
quantile=alpha/2
quantiles=stats.t.ppf(quantile,df)

print('quantile value:',quantiles)

#passing the t-statistic value to quantile


quantiles1=stats.t.cdf(t_statistic,df)*2
print('t statistic to quantile:',quantiles1)
if p_value<alpha:
print("the p-value is lower than our significance level alpha",alpha,"so we
should reject the null hypothesis. ")
else:
print("the p-value is greater than our significance level alpha",alpha,"so we
should accept the null hypothesis. ")

sigma=minnesota_age.std()/math.sqrt(50)
a=stats.t.interval(0.95,df = 49,loc = minnesota_age.mean(),scale= sigma)
b=stats.t.interval(0.99,df=49,loc=minnesota_age.mean(),scale=sigma)

alpha=(100-99)/100
if p_value<alpha:
print("the p-value is lower than our significance level alpha",alpha,"so we
should reject the null hypothesis. ")
else:
print("the p-value is greater than our significance level alpha",alpha,"so we
should accept the null hypothesis. ")

OUTPUT:
t_statistic value: -2.5742714883655027 p_value: 0.013118685425061678
quantile value: -2.0095752344892093
t statistic to quantile: 0.013118685425061678
the p-value is lower than our significance level alpha 0.05 so we should reject the
null hypothesis.
the p-value is greater than our significance level alpha 0.01 so we should accept
the null hypothesis.
7 b.) Implementation of T-Test – Two sample t-test and Paired T-Test

Aim:

To perform a two sample t-test and paired t-test to determine whether the mean of two population
are equal to some value or not
Algorithm
1: Create the data

2: Conduct a two sample t-test.

3: Interpret the results


7 b.) Implementation of T-Test – Two sample t-test and Paired T-Test

Program:

import numpy as np
import pandas as pd
import math
import scipy.stats as stats
import matplotlib.pyplot as plt
alpha=(100-95)/100
np.random.seed(6)
population_age1=stats.poisson.rvs(loc=18,mu=35,size=150000)
population_age2=stats.poisson.rvs(loc=18,mu=10,size=100000)
population_age=np.concatenate((population_age1,population_age2))

minnesota_age1 = stats.poisson.rvs(loc=18, mu=30, size=30)


minnesota_age2 = stats.poisson.rvs(loc=18, mu=10, size=20)
minnesota_age= np.concatenate((minnesota_age1, minnesota_age2))

np.random.seed(12)
wisconsin_age1 = stats.poisson.rvs(loc=18, mu=33, size=30)
wisconsin_age2 = stats.poisson.rvs(loc=18, mu=13, size=20)
wisconsin_age= np.concatenate((wisconsin_age1, wisconsin_age2))

t_statistic, p_value=stats.ttest_ind(a=minnesota_age,b=wisconsin_age)
print('t statistic value:',t_statistic,'p value:',p_value)
if p_value<alpha:
print("the p-value is lower than our significance level alpha",alpha,"so we
should reject the null hypothesis. ")
else:
print("the p-value is greater than our significance level alpha",alpha,"so we
should accept the null hypothesis. ")

#PAIRED TEST
print("paired test ")
np.random.seed(11)
before= stats.norm.rvs(scale=30, loc=250, size=100)
after = before + stats.norm.rvs(scale=5, loc=-1.25, size=100)
t_statistic,p_value=stats.ttest_rel(a=before,b=after)
print('t statistic value:',t_statistic,'p value:',p_value)
plt.figure(figsize=(12,10))

plt.fill_between(x=np.arange(-4,-2,0.01), y1= stats.norm.pdf(np.arange(-4,-


2,0.01)) ,facecolor='red', alpha=0.35)
plt.fill_between(x=np.arange(-2,2,0.01),y1= stats.norm.pdf(np.arange(-2,2,0.01))
,facecolor='grey', alpha=0.35)

plt.fill_between(x=np.arange(2,4,0.01), y1= stats.norm.pdf(np.arange(2,4,0.01))


,facecolor='red',alpha=0.5)
plt.fill_between(x=np.arange(-4,-2,0.01), y1= stats.norm.pdf(np.arange(-4,-
2,0.01),loc=3, scale=2) ,facecolor='grey', alpha=0.35)
plt.fill_between(x=np.arange(-2,2,0.01), y1= stats.norm.pdf(np.arange(-
2,2,0.01),loc=3, scale=2) ,facecolor='blue',alpha=0.35)
plt.fill_between(x=np.arange(2,10,0.01),y1=
stats.norm.pdf(np.arange(2,10,0.01),loc=3, scale=2),facecolor='grey',alpha=0.35)

plt.text(x=-0.8, y=0.15, s= "Null Hypothesis")


plt.text(x=2.5, y=0.13, s= "Alternative")
plt.text(x=2.1, y=0.01, s= "Type 1 Error")
plt.text(x=-3.2, y=0.01, s= "Type 1 Error")
plt.text(x=0, y=0.02, s= "Type 2 Error")
plt.show()
low_quantile=stats.norm.pdf((100-95)/2)
high_quantile=stats.norm.pdf((100+95)/2)
low = stats.norm.cdf(low_quantile,loc=3,scale=2)
high = stats.norm.cdf(high_quantile,loc=3,scale=2)
print('Type II error:',high-low)

OUTPUT:
t statistic value: -1.7083870793286842 p value: 0.09073015386514258
the p-value is greater than our significance level alpha 0.05 so we should accept
the null hypothesis.
paired test
t statistic value: 2.5720175998568284 p value: 0.011596444318439859
Type II error: -0.001142591013029215
8 a.) IMPLEMENTATION OF VARIANCE ANALYSIS (ANOVA)

Aim: To Write a python Application Program to demonstrate the Analysis of covariance


Algorithm:

1. Rows are grouped according to their value in the category column.

2. The total mean value of the value column is computed.

3. The mean within each group is computed.

4. The difference between each value and the mean value for the group is calculated and
squared.

5. The squared difference values are added. The result is a value that relates to the total
deviation of rows from the mean of their respective groups. This value is referred to as the sum
of squares within groups, or

6. For each group, the difference between the total mean and the group mean is squared
and multiplied by the number of values in the group. The results are added. The result is
referred to as the sum of squares between groups
7. The two sums of squares are used to obtain a statistic for testing the null hypothesis, the so
called F-statistic. The F-statistic is calculated as: where (degree of freedom between groups)
equals the number of groups minus 1, (degree of freedom within groups) equals the total number
of values minus the number of groups
8 a.) IMPLEMENTATION OF VARIANCE ANALYSIS (ANOVA)
PROGRAM:
from scipy.stats import f_oneway
import pandas as pd
import numpy as np
a=[25,25,27,30,23,20]
b=[30,30,21,24,26,28]
c=[18,30,29,29,24,26]
data=zip(a,b,c)
df=pd.DataFrame(data,columns=['a','b','c'])
print(df)
m1=np.mean(a)
m2=np.mean(b)
m3=np.mean(c)
m=(m1+m2+m3)/3
x1=((m1-m)**2)*6
x2=((m2-m)**2)*6
x3=((m3-m)**2)*6
s2btwn=x1+x2+x3
num=s2btwn/(3-1)
y1=list(a-m1)
y2=list(b-m2)
y3=list(c-m3)
y=(y1+y2+y3)
wthn=[]
for i in y:
wthn.append(i**2)
s2wthn=np.sum(wthn)
n=6
k=3
df1=(k*(n-1))
den=(s2wthn)/(df1)
f=num/den
print('f statistic value:',f)

OUTPUT:
abc
0 25 30 18
1 25 30 30
2 27 21 29
3 30 24 29
4 23 26 24
5 20 28 26
f statistic value: 0.23489932885906037
9a.) DEMONSTRATION OF LINEAR REGRESSION

Aim: Write a python program Application Program Linear Regression

Algorithm
1: Consider a set of values x, y.

2: Take the linear set of equation y = a+bx.

3: Computer value of a, b with respect to the given values b = nΣxy −(Σx)


(Σy) / nΣx2−(Σx)2, a = Σy−b (Σx)n.
4: Implement the value of a, b in the equation y = a+ bx.
5: Regress the value of y for any x
9a.) DEMONSTRATION OF LINEAR REGRESSION

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def estimate_coef(x, y):
n = np.size(x)
mean_x = np.mean(x)
mean_y = np.mean(y)
ss_xy = np.sum(y*x) - n*mean_y*mean_x
ss_xx = np.sum(x*x) - n*mean_x*mean_x
c = ss_xy / ss_xx
d = mean_y - c*mean_x
return (d,c)
def plot_regression_line(x, y, b):
plt.scatter(x, y, color = "red",marker = "o")
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(x, y, b)
main()

OUTPUT:
Estimated coefficients:
b_0 = 1.2363636363636363 b_1 = 1.1696969696969697
10a.) DEMONSTRATION OF LOGISTIC REGRESSION

Aim: Write a python program Application Program to perform Classification using Logistic
Algorithm:

Step1: Initialize the variables Step2:


Set the Data frame
Step3: Spilt data set into training and testing.

Step4: Fit the data into logistic regression function. Step5: Predict the test
data set.
Step6: Print the results.
10a.) DEMONSTRATION OF LOGISTIC REGRESSION

PROGRAM:

from sklearn.linear_model import LogisticRegression

# Define input as GMAT score, GPA, and years of work experience


GMAT_score = float(input("Enter GMAT score: "))
GPA = float(input("Enter GPA: "))
years_of_work_experience = float(input("Enter years of work experience: "))
logreg = LogisticRegression()

# Train the model on the admissions dataset (replace with your own dataset)
X_train = [[700, 3.5, 2], [680, 3.9, 4], [720, 3.8, 3], [690, 3.3, 6], [730, 3.7, 5], [690,
3.5, 2], [720, 3.7, 6], [740, 3.6, 8], [700, 3.3, 1], [690, 2.7, 4]]
y_train = [1, 1, 1, 0, 1, 0, 1, 1, 0, 0]
logreg.fit(X_train, y_train)

X_test = [[GMAT_score, GPA, years_of_work_experience]]


y_pred = logreg.predict(X_test)

if y_pred[0] == 1:
print("Congratulations! You have been admitted.")

else:
print("Sorry, you have not been admitted.")

OUTPUT:

Enter GMAT score: 780


Enter GPA: 9
Enter years of work experience: 1
Congratulations! You have been admitted.
11a.) IMPLEMENTATION OF TIME SERIES ANALYSIS

Aim: Implement a python Application Program to analyze the characteristics of a given time
Algorithm

1: Loading time series dataset correctly in Pandas 2: Indexing in


Time-Series Data
3: Time-Resampling using Pandas 4:
Rolling Time Series
5: Plotting Time-series Data using Pandas
11a.) IMPLEMENTATION OF TIME SERIES ANALYSIS
PROGRAM:
import pandas as pd
df=pd.read_csv('C:/fds/time.csv')
print(df.head(5))
print(df.tail(5))
df['month']=pd.to_datetime(df['month'])
print(df.head())
df.index=df['month']
del(df['month'])
print(df.head())
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(df)
plt.ylabel('No.of.passengers')
plt.show()
rolling_mean = df.rolling(7).mean()
rolling_std = df.rolling(7).std()
plt.plot(df,color='red',label='Total_number.of.passengers')
plt.plot(rolling_mean,color='blue',label='rolling_mean.of.passengers_number')
plt.plot(rolling_std,color='green',label='rolling_std.of.passengers_standarddeviatio
n')
plt.title("Passenger Time Series, Rolling Mean, Standard Deviation")
plt.legend(loc='upper left')

plt.show()
from statsmodels.tsa.stattools import adfuller
#adf=augumented_dickey_fuller
adft=adfuller(df,autolag='AIC')
output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'],
adft[4]['5%'], adft[4]['10%']] , "Metric":["Test Statistics","p-value","No. of lags
used","Number of observations used", "critical value (1%)", "critical value (5%)",
"critical value (10%)"]})
print(output_df)

train_data = df['passengers'][:int(len(df)*0.8)]
test_data = df['passengers'][int(len(df)*0.8):]
plt.plot(train_data, color = "black",label='train_data')
plt.plot(test_data, color = "red",label='test_data')
plt.title("Train/Test split for Passenger Data")
plt.ylabel("Passenger Number")
plt.xlabel('Year-Month')
sns.set()
plt.legend(loc='upper left')
plt.show()
from pmdarima.arima import auto_arima

model = auto_arima(train_data, trace=True, error_action='ignore',


suppress_warnings=True)
model.fit(train_data)
forecast = model.predict(n_periods=len(test_data))
prediction=pd.DataFrame(forecast,
index=test_data.index,columns=['predictions'])
plt.plot(train_data, color = "black",label='train_data')
plt.plot(test_data, color = "red",label='test_data')
plt.plot(prediction, color = "orange",label='prediction')
sns.set()
plt.show()

OUTPUT:
month passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
month passengers
139 1960-08 606
140 1960-09 508
141 1960-10 461
142 1960-11 390
143 1960-12 432
month passengers
0 1949-01-01 112
1 1949-02-01 118

2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121
passengers
month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Values Metric
0 0.815369 Test Statistics
1 0.991880 p-value
2 13.000000 No. of lags used
3 130.000000 Number of observations used
4 -3.481682 critical value (1%)
5 -2.884042 critical value (5%)
6 -2.578770 critical value (10%)
Performing stepwise search to minimize aic
ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=0.16 sec
ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=1076.519, Time=0.00 sec

ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=1069.440, Time=0.03 sec


ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=1064.624, Time=0.03 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=1076.271, Time=0.00 sec
ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=1058.834, Time=0.08 sec
ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=inf, Time=0.24 sec
ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=inf, Time=0.49 sec
ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=1061.078, Time=0.06 sec
ARIMA(2,1,0)(0,0,0)[0] intercept : AIC=1066.203, Time=0.06 sec
ARIMA(1,1,1)(0,0,0)[0] : AIC=1058.246, Time=0.08 sec
ARIMA(0,1,1)(0,0,0)[0] : AIC=1063.646, Time=0.03 sec
ARIMA(1,1,0)(0,0,0)[0] : AIC=1068.536, Time=0.02 sec
ARIMA(2,1,1)(0,0,0)[0] : AIC=1058.648, Time=0.06 sec
ARIMA(1,1,2)(0,0,0)[0] : AIC=1057.328, Time=0.06 sec
ARIMA(0,1,2)(0,0,0)[0] : AIC=1060.685, Time=0.13 sec
ARIMA(2,1,2)(0,0,0)[0] : AIC=1057.516, Time=0.06 sec
ARIMA(1,1,3)(0,0,0)[0] : AIC=1058.949, Time=0.08 sec
ARIMA(0,1,3)(0,0,0)[0] : AIC=1062.466, Time=0.05 sec
ARIMA(2,1,3)(0,0,0)[0] : AIC=1056.580, Time=0.13 sec
ARIMA(3,1,3)(0,0,0)[0] : AIC=inf, Time=0.27 sec
ARIMA(2,1,4)(0,0,0)[0] : AIC=inf, Time=0.27 sec
ARIMA(1,1,4)(0,0,0)[0] : AIC=1051.655, Time=0.11 sec
ARIMA(0,1,4)(0,0,0)[0] : AIC=1050.070, Time=0.08 sec
ARIMA(0,1,5)(0,0,0)[0] : AIC=1051.622, Time=0.17 sec
ARIMA(1,1,5)(0,0,0)[0] : AIC=1053.620, Time=0.22 sec

ARIMA(0,1,4)(0,0,0)[0] intercept : AIC=inf, Time=0.35 sec

Best model: ARIMA(0,1,4)(0,0,0)[0]


Total fit time: 4.096 seconds

You might also like