List of Experiments
1. The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school
days in a week, the probability that it is Friday is 20 %. What is the probability that a student
is absent given that today is Friday? Apply Baye’s rule in python to get the result. (Ans: 15%)
2. Extract the data from database using python
3. Implement k-nearest neighbours classification using python
4. Given the following data, which specify classifications for nine combinations of VAR1
and VAR2, predict a classification for a case where VAR1=0.906 and VAR2=0.606, using
the result of kˇmeans clustering with 3 means (i.e., 3 centroids)
VAR1 VAR2 CLASS
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1
5. The following training examples map descriptions of individuals onto high, medium and
lowcredit-worthiness.
medium skiing design single twenties no -> highRisk
high golf trading married forties yes -> lowRisk
low speedway transport married thirties yes -> medRisk
medium football banking single thirties yes -> lowRisk
high flying media married fifties yes -> highRisK
low football security single twenties no -> medRisk
medium golf media single thirties yes -> medRisk medium golf transport married forties yes
-> lowRisk
high skiing banking single thirties yes -> highRisk low golf unemployed married forties yes -
> highRisk Input attributes are (from left to right) income, recreation, job, status, age-group,
home-owner. Find the unconditional probability of `golf' and the conditional probability of
`single' given `medRisk' in the dataset?
6. Implement linear regression using python.
PROBLEM STATEMENT 1:
The probability that it is Friday and that a student is absent is 3 %. Since there are 5
school days in a week, the probability that it is Friday is 20 %. What is the probability
that a student is absent given that today is Friday? Apply Baye’s rule in python to get
the result. (Ans: 15%)
PROCEDURE:
If A and B are two events in a sample space S, then the conditional probability
of A given B is defined as
Similarly, the P(B/A) formula is:
P(A/B) =P(A∩B) / P(B)
P(B/A) =P(A∩B) / P(A)
Two events A and B are independent if and only if P(A∩B)=P(A)P(B)
For the given experiment, we have, P(A∩B)=0.03
P(A)=0.2
P(B|A)=P(A∩B)/P(A)
=0.03/0.2=0.15
Another Example: If a fair die /is rolled. Let A be the event that the outcome is
an odd number, i.e., A={1,3,5}. Also let B be the event that the outcome is less
than or equal to 3, i.e., B={1,2,3}. The figure shows the Venn diagram of the
events. p(B|A)=(2/6)/(3/6)=2/3=0.666
Source code:
probitisFridaynstudentAbsent=float(input("Enter the probability of
being Friday and student is absent: "))
probitisFriday=0.2
pstudentisAbsentgivenitisFriday=probitisFridaynstudentAbsent/probit
isFriday
print("Probability that student is absent given it is Friday is:",
pstudentisAbsentgivenitisFriday)
Out put:
Enter the probability of being Friday and student is absent: 0.03 Probability
that student is absent given it is Friday is: 0.15
PROBLEM STATEMENT 2
Extract the data from database using python
MySQL is an open-source, relational database management system(RDBMS)
that is based on Structured Query Language(SQL). Download and install
MySQL database from official website https://www.mysql.com/downloads/.
Next install MySQL Connector for Python, MySQL Connector enables the
Python programs to access the MySQL database. It can be downloaded and
installed from https://dev.mysql.com/downloads/connector/python/ Or using the
following command.
python -m pip install mysql-connector-python
If the MySQL connector is installed correctly there will be no error after
executing the import statement, import mysql.connector.
PROCEDURE:
Following are the steps to connect a python application to MySQL database.
1. Import mysql.connector module
2. Create the connection object.
3. Create the cursor object
4. Execute the query
SOURCE CODE:
import mysql.connector
#establishing the connection
conn = mysql.connector.connect(user='root',
password='cmrec@1234',host='localhost', database='cmrec')
#Creating a cursor object using the cursor() method
cursor = conn.cursor()
#Retrieving single row
sql = '''SELECT * from AIML'''
#Executing the query
cursor.execute(sql)
#Fetching 1st row from the table
#result = cursor.fetchone();
#print(result)
#Fetching 1st row from the table
result = cursor.fetchall();
print(result)
#Closing the connection
conn.close()
OUTPUT:
| FIRST_NAME | Country |
+------------+-------------+
| Shikhar | India |
| Jonathan | SouthAfrica |
| Kumara | Srilanka |
| Virat | India |
| Rohit | India |
PROBLEM STATEMENT 3
Implement k-nearest neighbours classification using python
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique. It is an instance based or lazy learning
algorithm, here model is not learned using training data prior and the learning
process is postponed to a time when prediction is requested on the new instance.
K-NN Algorithm
• • Load the training data.
• • Choose K the number of nearest neighbors to look
• • Compute the test point’s distance from each training point
• • Sort the distances in ascending (or descending) order
• • Use the sorted distances to select the Knearest neighbors
• • Use majority rule(for classification) or averaging (for regression)
Advantages of KNN
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems
Disadvantages of KNN
5. Memory Intensive / Computationally expensive
6. Sensitive to scale of data
7. Not work well on rare event (skewed) target variable
8. Struggle when high number of independent variables
SOURCE CODE:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import StrMethodFormatter
sns.set()
import warnings
warnings.filterwarnings('ignore')
#%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
#import the csv file
df=pd.read_csv(r'C:\Users\AIMLJAVA4\Desktop\lab 3\diabetes.csv')
print(df)
df.info(verbose=True)
df.describe().T
df_copy = df.copy(deep = True)
df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] =
df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
## showing the count of Nans
print(df_copy.isnull().sum())
p = df.hist(figsize = (20,20))
df_copy['Glucose'].fillna(df_copy['Glucose'].mean(), inplace = True)
df_copy['BloodPressure'].fillna(df_copy['BloodPressure'].mean(), inplace = True)
df_copy['SkinThickness'].fillna(df_copy['SkinThickness'].median(), inplace = True)
df_copy['Insulin'].fillna(df_copy['Insulin'].median(), inplace = True)
df_copy['BMI'].fillna(df_copy['BMI'].median(), inplace = True)
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(df_copy.drop(["Outcome"],axis = 1),),
columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
X.head()
#splitting the trained set
y =df_copy.Outcome
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42, stratify=y)
test_scores = []
train_scores = []
for i in range(1,15):
knn = KNeighborsClassifier(i)
knn.fit(X_train,y_train)
train_scores.append(knn.score(X_train,y_train))
test_scores.append(knn.score(X_test,y_test))
## score that comes from testing on the same datapoints that were used for training
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1,
train_scores_ind))))
## score that comes from testing on the datapoints that were split in the beginning to be used for
testing solely
max_test_score = max(test_scores)
test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1,
test_scores_ind))))
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(11)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
y_pred = knn.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
#classifying the data through ROC ROC (Receiver Operating Characteristic)
#Curve tells us about how good the model can distinguish between two things
from sklearn.metrics import roc_curve
y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
#import GridSearchCV
from sklearn.model_selection import GridSearchCV
#In case of classifier like knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(X,y)
print("Best Score:" + str(knn_cv.best_score_))
print("Best Parameters: " + str(knn_cv.best_params_))
OUTPUT:
accuracy of 121/143 = 84.6%.
PROBLEM STATEMENT 4
Given the following data, which specify classifications for nine combinations of VAR1 and
VAR2, predict a classification for a case where VAR1=0.906 and VAR2=0.606, using the
result of kˇmeans clustering with 3 means (i.e., 3 centroids)
VAR1 VAR2 CLASS
1.713 1.586 0
0.180 1.786 1
0.353 1.240 1
0.940 1.566 0
1.486 0.759 1
1.266 1.106 0
1.540 0.419 1
0.459 1.799 1
0.773 0.186 1
SOURCE CODE:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1.713,1.586], [0.180,1.786], [0.353,1.240],
[0.940,1.566], [1.486,0.759], [1.266,1.106],[1.540,0.419],[0.459,1.799],[0.773,0.186]])
y=np.array([0,1,1,0,1,0,1,1,1])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X,y)
print("The input data is ")
print("VAR1 \t VAR2 \t CLASS")
i=0
for val in X:
print(val[0],"\t",val[1],"\t",y[i])
i+=1
print("="*20)
# To get test data from the user
print("The Test data to predict ")
test_data = []
VAR1 = float(input("Enter Value for VAR1 :"))
VAR2 = float(input("Enter Value for VAR2 :"))
test_data.append(VAR1)
test_data.append(VAR2)
print("="*20)
print("The predicted Class is : ",kmeans.predict([test_data]))
OUTPUT:
PROBLEM STATEMENT 5
The following training examples map descriptions of individuals onto high, medium and
lowcredit-worthiness.
medium skiing design single twenties no -> highRisk
high golf trading married forties yes -> lowRisk
low speedway transport married thirties yes -> medRisk
medium football banking single thirties yes -> lowRisk
high flying media married fifties yes -> highRisk
low football security single twenties no -> medRisk
medium golf media single thirties yes -> medRisk
medium golf transport married forties yes -> lowRisk
high skiing banking single thirties yes -> highRisk
low golf unemployed married forties yes -> highRisk
Input attributes are (from left to right) income, recreation, job, status, age-group, home-
owner. Find the unconditional probability of `golf' and the conditional probability of `single'
given `medRisk' in the dataset?
SOURCE CODE:
total_Records=10
numGolfRecords=4
unConditionalprobGolf=numGolfRecords / total_Records
print("Unconditional probability of golf: ={}".format(unConditionalprobGolf))
#conditional probability of 'single' given 'medRisk'
numMedRiskSingle=2
numMedRisk=3
probMedRiskSingle=numMedRiskSingle/total_Records
probMedRisk=numMedRisk/total_Records
conditionalProb=(probMedRiskSingle/probMedRisk)
print("Conditional probability of single given medRisk: = {}".format(conditionalProb))
OUTPUT:
Unconditional probability of golf: =0.4
Conditional probability of single given medRisk: = 0.6666666666666667
PROBLEM STATEMENT 6
Implement linear regression using python.
SOURCE CODE:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
OUTPUT: