Descision Tree
It uses a tree-like model of decisions and their subsequent consequences to arrive at a particular decision. The the data is continuously split according to a
certain parameter, and finally, a decision is made.
Supervised Learning
Non-parametric
For both classification and regression
Basis of Random Forests
Interpretable (They can be explained as a series of questions/ if-else statements.)
In [ ]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
In [ ]: iris = load_iris()
Iris Dataset
The Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from
each of 3 species of iris. The species are Iris Setosa, Versicolor, and Virginica. The data set has 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
In [ ]: iris.feature_names
In [ ]: iris.data[0:5]
In [ ]: iris.target_names
In [ ]: iris.target
In [ ]: import matplotlib.pyplot as plt
plt.scatter(iris.data[0:50,2], iris.data[0:50,3], label="Setosa", marker="x", facecolor="blue")
plt.scatter(iris.data[50:100,2], iris.data[50:100,3], label="versicolor", marker="s", facecolor="red")
plt.scatter(iris.data[100:150,2], iris.data[100:150,3], label="virginica", facecolor="green")
plt.xlabel('Petal length (cm) ')
plt.ylabel('Petal width (cm)')
plt.show()
In [ ]: from sklearn.model_selection import train_test_split
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)
In [ ]: np.shape(X_train)
Decision Tree Algorithm
TERMS :
Root node: The topmost node in a tree.
Leaf Node: Nodes do not split is called Leaf or Terminal node
Splitting: It is a process of dividing a node into two or more sub-nodes.
Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.
Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
Gini Impurity
scikit-learn default
Gini Impurity
A measure of purity / variability of categorical data
Developed by Corrado Gini in 1912
Key Points:
A pure node (homogeneous contents or samples with the same class) will have a Gini coefficient of zero
As the variation increases (heterogeneneous classes or increase diversity), Gini coefficient increases and approaches 1.
r
2
Gini = 1 − ∑ p
j
p is the probability (often based on the frequency table)
Information Gain
Expected reduction in entropy caused by splitting
Keep splitting until you obtain a as close to homogeneous class as possible
In [ ]:
Classification Using Descision trees
Training a Decision Tree with Scikit-Learn Library
In [ ]: from sklearn.tree import DecisionTreeClassifier
dtclf = DecisionTreeClassifier()
In [ ]: dtclf= dtclf.fit(X_train, y_train)
In [ ]: #sudo apt-get install graphviz
#conda install -c conda-forge python-graphviz
#pip install graphviz
import graphviz
In [ ]: from sklearn.tree import export_graphviz
dot_data = export_graphviz(dtclf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
rounded=True,
filled=True)
In [ ]: graph = graphviz.Source(dot_data)
graph
In [ ]: #from sklearn.metrics import accuracy_score
#y_pred = dtclf.predict(X_test)
#accuracy_score(y_test, y_pred)
In [ ]:
Visualise the Decision Boundary
In [ ]: import seaborn as sns
sns.set_style('whitegrid')
In [ ]: df = sns.load_dataset('iris')
df.head()
In [ ]: col = ['petal_length', 'petal_width']
X = df.loc[:, col]
In [ ]: species_to_num = {'setosa': 0,
'versicolor': 1,
'virginica': 2}
df['tmp'] = df['species'].map(species_to_num)
y = df['tmp']
In [ ]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)
dtclf2 = DecisionTreeClassifier()
dtclf2 = dtclf2.fit(X_train, y_train)
In [ ]: Xv = X_train.values.reshape(-1,1)
h = 0.02
x_min, x_max = Xv.min(), Xv.max() + 1
y_min, y_max = y_train.min(), y_train.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
In [ ]: z = dtclf2.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X_train.values[:, 0], X_train.values[:, 1], c=y_train, s=80,
alpha=0.9, edgecolors='g');
In [ ]:
Overfitting
In [ ]: from sklearn.model_selection import GridSearchCV
params ={'min_samples_leaf': list(range(5, 20)),
'max_depth':list(range(3, 8))}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=11), params, n_jobs=-1, verbose=0)
grid_search_cv.fit(X_train, y_train)
In [ ]: grid_search_cv.best_estimator_
In [ ]: bestDT=DecisionTreeClassifier(max_depth=3,min_samples_leaf=5, random_state=11)
bestDT.fit(X_train, y_train)
In [ ]: dot_data2 = export_graphviz(bestDT, out_file=None,
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True)
graph = graphviz.Source(dot_data2)
graph
In [ ]:
In [ ]: z = bestDT.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X_train.values[:, 0], X_train.values[:, 1], c=y_train, s=80,
alpha=0.9, edgecolors='g');
In [ ]: from sklearn.metrics import accuracy_score
y_pred2 = bestDT.predict(X_test)
accuracy_score(y_test, y_pred2)
In [ ]: