Exercise 5 : Classification Tree
Essential Libraries
Let us begin by importing the essential Python Libraries.
NumPy : Library for Numeric Computations in Python
Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization
# % matplotlib inline will produce a figure immediately below
## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
Setup : Import the Dataset
Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
houseData = pd.read_csv('train.csv')
houseData.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
\
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal
MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12
YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000
[5 rows x 81 columns]
Problem 1 : Predicting CentralAir using SalePrice
Explore the variable CentralAir from the dataset, as mentioned in the problem.
houseData['CentralAir'].describe()
count 1460
unique 2
top Y
freq 1365
Name: CentralAir, dtype: object
Check the catplot for CentralAir, to visually understand the distribution.
sb.catplot(y = 'CentralAir', data = houseData, kind = "count")
<seaborn.axisgrid.FacetGrid at 0x141fd7475e0>
Note that the two levels of CentralAir, namely Y and N, are drastically imbalanced. This is not
a very good situation for a classification problem. It is desirable to have balanced classes for
classification, and there are several methods to make imbalanced classes balanced, or to get
desired classification results even from imbalanced classes. If you are interested, check out the
following article.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-
learning-dataset/
Plot CentralAir against SalePrice to visualize their mutual relationship.
f, axes = plt.subplots(1, 1, figsize=(16, 8))
sb.boxplot(x = 'SalePrice', y = 'CentralAir', data = houseData)
<AxesSubplot: xlabel='SalePrice', ylabel='CentralAir'>
Good to note that the two boxplots for SalePrice, for CentralAir = Y and CentralAir =
N, are very different from one another in terms of their median value, as well as spread. This
means that CentralAir has an effect on SalePrice, and hence, SalePrice will probably be
an important variable in predicting CentralAir. Boxplots do not tell us where to make the cuts
though -- it will be easier to visualize in the following swarmplot.
f, axes = plt.subplots(1, 1, figsize=(30, 15))
sb.swarmplot(x = 'SalePrice', y = 'CentralAir', data = houseData)
<AxesSubplot: xlabel='SalePrice', ylabel='CentralAir'>
Now it's time to build the Decision Tree classifier. Import the DecisionTreeClassifier
model from sklearn.tree.
# Import Decision Tree Classifier model from Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Create a Decision Tree Classifier object
dectree = DecisionTreeClassifier(max_depth = 2)
Prepare both the datasets by splitting in Train and Test sets.
Train Set with 1100 samples and Test Set with 360 samples.
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['SalePrice'])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
360)
Train the Decision Tree Classifier model dectree using the Train Set.
# Train the Decision Tree Classifier model
dectree.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=2)
Visual Representation of the Decision Tree Model
Method 1 :
Export the Decision Tree as a dot file using export_graphviz, and visualize.
For Windows 10 and 11:
1. Download and install graphviz from here
2. In Anaconda prompt, do conda install graphviz and conda install python-
graphviz.
3. In the following code, set the path 'C:/Program Files (x86)/Graphviz2.38/bin/' according
to where you installed graphviz.
For MAC:
1. In Anaconda prompt, do conda install graphviz and conda install python-
graphviz.
# Import export_graphviz from sklearn.tree
from sklearn.tree import export_graphviz
# Export the Decision Tree as a dot object
treedot = export_graphviz(dectree,
# the model
feature_names = X_train.columns.tolist(),
# the features
out_file = None,
# output file
filled = True,
# node colors
rounded = True,
# make pretty
special_characters = True)
# postscript
# Render using graphviz
#import os
#os.environ["PATH"] += os.pathsep + 'C:/Program Files
(x86)/Graphviz2.38/bin/'
import graphviz
graphviz.Source(treedot)
Method 2:
Some of you may encounter mysterious problems with the graphviz package. The issue may not
be easy to resolve as it could be due to some configuration in your OS or other problems. As
such, I have provided an alternative method using plot_tree() in the module sklearn.tree for you
to visualize the decision tree. As sklearn comes with Anaconda, you do not need to install
additional packages. The function plot_tree is less flexible, but it suffices for our purposes. You
may choose to use either method.
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 12))
out = plot_tree(dectree,
feature_names = X_train.columns.tolist(),
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of the Model
Check how good the predictions are on the Train Set.
Metrics : Classification Accuracy and Confusion Matrix.
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
# Print the Classification Accuracy
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
# Plot the two-way Confusion Matrix
from sklearn.metrics import confusion_matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
Classification Accuracy : 0.9381818181818182
<AxesSubplot: >
Prediction of Response based on the Predictor
Check how good the predictions are on the Test Set.
Metrics : Classification Accuracy and Confusion Matrix.
# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)
# Print the Classification Accuracy
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
# Plot the two-way Confusion Matrix
from sklearn.metrics import confusion_matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
Classification Accuracy : 0.9527777777777777
<AxesSubplot: >
Important : Note the huge imbalance in the False Positives and False Negatives in the confusion
matrix. In case of Training Data, False Positives = 58 whereas False Negatives = 8. In case of Test
Data, False Positives = 16 whereas False Negatives = 3. This is not surprising -- actually, this is a
direct effect of the huge Y vs N imbalance in the CentralAir variable. As CentralAir = Y
was more likely in the data, False Positives are more likely too.
Problem 2 : Predicting CentralAir using Other Variables
Use the other variables from the dataset to predict CentralAir, as mentioned in the problem.
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['GrLivArea'])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
360)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision
tree object
dectree.fit(X_train, y_train) # train the decision
tree model
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[1])
# Plot the Decision Tree
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 12))
out = plot_tree(dectree,
feature_names = X_train.columns.tolist(),
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of Model Train Dataset
Classification Accuracy : 0.9336363636363636
Goodness of Fit of Model Test Dataset
Classification Accuracy : 0.95
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['LotArea'])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
360)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision
tree object
dectree.fit(X_train, y_train) # train the decision
tree model
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[1])
# Plot the Decision Tree
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 12))
out = plot_tree(dectree,
feature_names = X_train.columns.tolist(),
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of Model Train Dataset
Classification Accuracy : 0.9336363636363636
Goodness of Fit of Model Test Dataset
Classification Accuracy : 0.9388888888888889
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['TotalBsmtSF'])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
360)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision
tree object
dectree.fit(X_train, y_train) # train the decision
tree model
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[1])
# Plot the Decision Tree
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 12))
out = plot_tree(dectree,
feature_names = X_train.columns.tolist(),
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of Model Train Dataset
Classification Accuracy : 0.9318181818181818
Goodness of Fit of Model Test Dataset
Classification Accuracy : 0.9444444444444444
Now that you have obtained Decision Tree of CentralAir against the four variables
SalePrice, GrLivArea, LotArea, TotalBsmtSF, compare the Classification Accuracy (and
other accuracy parameters) to determine which model is the best in order to predict
CentralAir. What do you think?
Extra : Predicting CentralAir using All Variables
Use all the other variables from the dataset to predict CentralAir, as mentioned in the
problem.
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData[['SalePrice', 'GrLivArea', 'LotArea',
'TotalBsmtSF']])
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
360)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 4) # create the decision
tree object
dectree.fit(X_train, y_train) # train the decision
tree model
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax =
axes[1])
# Plot the Decision Tree
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(48, 6))
out = plot_tree(dectree,
feature_names = X_train.columns.tolist(),
class_names = [str(x) for x in dectree.classes_],
filled=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of Model Train Dataset
Classification Accuracy : 0.9472727272727273
Goodness of Fit of Model Test Dataset
Classification Accuracy : 0.9444444444444444
Now that you have obtained Decision Tree of CentralAir against all four variables
SalePrice, GrLivArea, LotArea, TotalBsmtSF, compare the initial position of the
variables in the tree (which level of the tree does the variable appear for the first time), and the
number of times the variables are used, to determine which variable is the most important in
order to predict CentralAir. What do you think?