10/16/2020 Case Study - Classifier - Colaboratory
NAME - AJINKYA KSHIRSAGAR
PRN - 19030141005
COURSE - Applied Data Analytics with Python
SEM - 3
Assignment No: 3
Title of the Assignment:-
Case Study : Machine Learning : Classi ers
Due tomorrow at 11:59 PM
Instructions
1. Select any data set of your own choice
2. Use following classi ers for the data prediction a. SVM b. K-NN c. K-means clustering d. Decision Tree classi er
3. Do the comparative study and discuss the predictions from these classi ers.
CASE STUDY
Choosing the right estimator for Machine Learning
INTRODUCTION
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of arti cial
intelligence.
Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or
decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications, such as email ltering and computer vision, where it is di cult or
infeasible to develop conventional algorithms to perform the needed tasks.
Example
A learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a
single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
1. Supervised learning
The in which the data comes with additional attributes that we want to predict .
This problem can be either:
Classi cation:
If thesamples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An
example of a classi cation problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a nite
number of discrete categories. Another way to think of classi cation is as a discrete (as opposed to continuous) form of supervised learning
where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or
class.
Regression:
If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem
would be the prediction of the length of a salmon as a function of its age and weight.
2. Unsupervised learning
In which the training data consists of a set of input vectors x without any corresponding target values.
The goal in such problems
Clustering:
To discover groups of similar examples within the data, where it is called clustering
Density estimation:
To determine the distribution of data within the input space, known as density estimation
Other ==> High-dimensional space:
To project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization
CHOOSE RIGHT ESTIMATOR
https://colab.research.google.com/drive/14fe7516SkWr_pLHGviNJ546-HlCcNK5o?authuser=1#scrollTo=1PXQI5cYxQYp&printMode=true 1/5
10/16/2020 Case Study - Classifier - Colaboratory
IMPEMENTATION PART -- DATASET USED - Bill Authentication
PREPROCESSING PART
1. Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
2. Importing Dataset
DF= pd.read_csv("/content/bill_authentication.csv")
DF.head()
Variance Skewness Curtosis Entropy V1 V2 Class
0 3.62160 8.6661 -2.8073 -0.44699 2.072345 -3.241693 0
1 4.54590 8.1674 -2.4586 -1.46210 17.936710 15.784810 0
2 3.86600 -2.6383 1.9242 0.10645 1.083576 7.319176 0
3 3.45660 9.5228 -4.0112 -3.59440 11.120670 14.406780 0
4 0.32924 -4.4552 4.5718 -0.98880 23.711550 2.557729 0
3. Target & Predicted Variable
A = DF.drop('Class', axis=1)
B = A.drop('V1', axis=1)
X = B.drop('V2', axis=1)
Y = DF['Class']
4. Splitting in Train & Test
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size = 0.20)
CLASSIFIER PART
1. Support Vector Machine(SVM) Classi er
from sklearn.svm import SVC
SVM_Classifier = SVC(kernel='linear')
SVM_Classifier.fit(X_train, Y_train)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
y_pred = SVM_Classifier.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix
print('Support Vector Machine : \n',classification report(y test,y pred))
https://colab.research.google.com/drive/14fe7516SkWr_pLHGviNJ546-HlCcNK5o?authuser=1#scrollTo=1PXQI5cYxQYp&printMode=true 2/5
10/16/2020 Case Study - Classifier - Colaboratory
print( Support Vector Machine : \n ,classification_report(y_test,y_pred))
Support Vector Machine :
precision recall f1-score support
0 0.97 1.00 0.99 149
1 1.00 0.97 0.98 126
accuracy 0.99 275
macro avg 0.99 0.98 0.99 275
weighted avg 0.99 0.99 0.99 275
2. K-Nearest Neighbour(KNN) Classi er
from sklearn.neighbors import KNeighborsClassifier
KNN_Classifier = KNeighborsClassifier(n_neighbors = 1)
KNN_Classifier.fit(X_train, Y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
y_pred = KNN_Classifier.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix
print('K-Nearest Neighbour : \n',classification_report(y_test,y_pred))
K-Nearest Neighbour :
precision recall f1-score support
0 1.00 1.00 1.00 149
1 1.00 1.00 1.00 126
accuracy 1.00 275
macro avg 1.00 1.00 1.00 275
weighted avg 1.00 1.00 1.00 275
3. K-Means Classi er
f1 = DF['V1'].values
f2 = DF['V2'].values
z = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)
<matplotlib.collections.PathCollection at 0x7f6db52cfc18>
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
k = 2
C_x = np.random.randint(0, np.max(z)-20, size=k)
C_y = np.random.randint(0, np.max(z)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids")
print(C)
Initial Centroids
[[11. 34.]
[36. 47.]]
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=600, c='g')
<matplotlib.collections.PathCollection at 0x7f6db4ae14a8>
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
from sklearn.metrics import classification_report
print('K-Means : \n',classification_report(y_test,y_pred))
https://colab.research.google.com/drive/14fe7516SkWr_pLHGviNJ546-HlCcNK5o?authuser=1#scrollTo=1PXQI5cYxQYp&printMode=true 3/5
10/16/2020 Case Study - Classifier - Colaboratory
K-Means :
precision recall f1-score support
0 1.00 1.00 1.00 149
1 1.00 1.00 1.00 126
accuracy 1.00 275
macro avg 1.00 1.00 1.00 275
weighted avg 1.00 1.00 1.00 275
4. Decision Tree Classi er
from sklearn import tree
Decision_Tree_Classifier = tree.DecisionTreeClassifier()
Decision_Tree_Classifier = Decision_Tree_Classifier.fit(X, Y)
tree.plot_tree(Decision_Tree_Classifier)
[Text(164.27686567164181, 203.85, 'X[0] <= 0.32\ngini = 0.494\nsamples = 1372\nvalue = [762, 610]'),
Text(107.4358208955224, 176.67000000000002, 'X[1] <= 7.565\ngini = 0.306\nsamples = 657\nvalue = [124, 533]'),
Text(74.95522388059702, 149.49, 'X[0] <= -0.403\ngini = 0.131\nsamples = 552\nvalue = [39, 513]'),
Text(39.97611940298508, 122.31, 'X[2] <= 6.219\ngini = 0.07\nsamples = 471\nvalue = [17, 454]'),
Text(19.98805970149254, 95.13, 'X[1] <= 7.293\ngini = 0.006\nsamples = 324\nvalue = [1, 323]'),
Text(9.99402985074627, 67.94999999999999, 'gini = 0.0\nsamples = 320\nvalue = [0, 320]'),
Text(29.982089552238808, 67.94999999999999, 'X[1] <= 7.349\ngini = 0.375\nsamples = 4\nvalue = [1, 3]'),
Text(19.98805970149254, 40.77000000000001, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(39.97611940298508, 40.77000000000001, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]'),
Text(59.964179104477616, 95.13, 'X[1] <= -4.675\ngini = 0.194\nsamples = 147\nvalue = [16, 131]'),
Text(49.97014925373135, 67.94999999999999, 'gini = 0.0\nsamples = 130\nvalue = [0, 130]'),
Text(69.95820895522388, 67.94999999999999, 'X[1] <= -2.962\ngini = 0.111\nsamples = 17\nvalue = [16, 1]'),
Text(59.964179104477616, 40.77000000000001, 'X[1] <= -4.581\ngini = 0.5\nsamples = 2\nvalue = [1, 1]'),
Text(49.97014925373135, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(69.95820895522388, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(79.95223880597015, 40.77000000000001, 'gini = 0.0\nsamples = 15\nvalue = [15, 0]'),
Text(109.93432835820896, 122.31, 'X[1] <= 5.454\ngini = 0.396\nsamples = 81\nvalue = [22, 59]'),
Text(99.9402985074627, 95.13, 'X[2] <= 2.625\ngini = 0.265\nsamples = 70\nvalue = [11, 59]'),
Text(89.94626865671643, 67.94999999999999, 'gini = 0.0\nsamples = 58\nvalue = [0, 58]'),
Text(109.93432835820896, 67.94999999999999, 'X[3] <= 1.228\ngini = 0.153\nsamples = 12\nvalue = [11, 1]'),
Text(99.9402985074627, 40.77000000000001, 'gini = 0.0\nsamples = 11\nvalue = [11, 0]'),
Text(119.92835820895523, 40.77000000000001, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(119.92835820895523, 95.13, 'gini = 0.0\nsamples = 11\nvalue = [11, 0]'),
Text(139.91641791044776, 149.49, 'X[0] <= -4.726\ngini = 0.308\nsamples = 105\nvalue = [85, 20]'),
Text(129.9223880597015, 122.31, 'gini = 0.0\nsamples = 20\nvalue = [0, 20]'),
Text(149.91044776119404, 122.31, 'gini = 0.0\nsamples = 85\nvalue = [85, 0]'),
Text(221.1179104477612, 176.67000000000002, 'X[2] <= -4.386\ngini = 0.192\nsamples = 715\nvalue = [638, 77]'),
Text(179.89253731343285, 149.49, 'X[1] <= 7.192\ngini = 0.363\nsamples = 42\nvalue = [10, 32]'),
Text(169.89850746268658, 122.31, 'gini = 0.0\nsamples = 32\nvalue = [0, 32]'),
Text(189.88656716417913, 122.31, 'gini = 0.0\nsamples = 10\nvalue = [10, 0]'),
Text(262.34328358208955, 149.49, 'X[0] <= 1.592\ngini = 0.125\nsamples = 673\nvalue = [628, 45]'),
Text(209.87462686567164, 122.31, 'X[2] <= -2.272\ngini = 0.352\nsamples = 184\nvalue = [142, 42]'),
Text(184.88955223880598, 95.13, 'X[1] <= 5.667\ngini = 0.198\nsamples = 27\nvalue = [3, 24]'),
Text(174.8955223880597, 67.94999999999999, 'gini = 0.0\nsamples = 24\nvalue = [0, 24]'),
Text(194.88358208955225, 67.94999999999999, 'gini = 0.0\nsamples = 3\nvalue = [3, 0]'),
Text(234.85970149253734, 95.13, 'X[3] <= 0.082\ngini = 0.203\nsamples = 157\nvalue = [139, 18]'),
Text(214.8716417910448, 67.94999999999999, 'X[0] <= 0.42\ngini = 0.017\nsamples = 120\nvalue = [119, 1]'),
Text(204.87761194029852, 40.77000000000001, 'X[2] <= -1.324\ngini = 0.111\nsamples = 17\nvalue = [16, 1]'),
Text(194.88358208955225, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'),
Text(214.8716417910448, 13.590000000000003, 'gini = 0.0\nsamples = 16\nvalue = [16, 0]'),
Text(224.86567164179107, 40.77000000000001, 'gini = 0.0\nsamples = 103\nvalue = [103, 0]'),
Text(254.84776119402986, 67.94999999999999, 'X[2] <= 1.853\ngini = 0.497\nsamples = 37\nvalue = [20, 17]'),
Text(244.85373134328358, 40.77000000000001, 'X[1] <= 3.559\ngini = 0.188\nsamples = 19\nvalue = [2, 17]'),
Text(234.85970149253734, 13.590000000000003, 'gini = 0.0\nsamples = 17\nvalue = [0, 17]'),
Text(254.84776119402986, 13.590000000000003, 'gini = 0.0\nsamples = 2\nvalue = [2, 0]'),
Text(264.84179104477613, 40.77000000000001, 'gini = 0.0\nsamples = 18\nvalue = [18, 0]'),
Text(314.81194029850747, 122.31, 'X[0] <= 2.037\ngini = 0.012\nsamples = 489\nvalue = [486, 3]'),
Text(304.8179104477612, 95.13, 'X[2] <= -2.648\ngini = 0.101\nsamples = 56\nvalue = [53, 3]'),
Text(294.8238805970149, 67.94999999999999, 'X[3] <= -1.796\ngini = 0.375\nsamples = 4\nvalue = [1, 3]'),
Text(284.8298507462687, 40.77000000000001, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'),
Text(304.8179104477612, 40.77000000000001, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]'),
Text(314.81194029850747, 67.94999999999999, 'gini = 0.0\nsamples = 52\nvalue = [52, 0]'),
Text(324.80597014925377, 95.13, 'gini = 0.0\nsamples = 433\nvalue = [433, 0]')]
tree.plot_tree(Decision_Tree_Classifier)
plt.savefig('DTImage')
y_pred = Decision_Tree_Classifier.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix
print('Decision Tree : \n',classification_report(y_test,y_pred))
Decision Tree :
precision recall f1-score support
0 1.00 1.00 1.00 149
1 1.00 1.00 1.00 126
accuracy 1.00 275
macro avg 1.00 1.00 1.00 275
weighted avg 1.00 1.00 1.00 275
https://colab.research.google.com/drive/14fe7516SkWr_pLHGviNJ546-HlCcNK5o?authuser=1#scrollTo=1PXQI5cYxQYp&printMode=true 4/5
10/16/2020 Case Study - Classifier - Colaboratory
COMPARISION ANALYSIS
https://colab.research.google.com/drive/14fe7516SkWr_pLHGviNJ546-HlCcNK5o?authuser=1#scrollTo=1PXQI5cYxQYp&printMode=true 5/5