KEMBAR78
Decision Trees for Data Scientists | PDF | Algorithms And Data Structures | Applied Mathematics
0% found this document useful (0 votes)
53 views1 page

Decision Trees for Data Scientists

The document describes decision trees, including their use for classification and regression. It discusses key concepts like nodes, splits, impurity measures and information gain. It then demonstrates building a decision tree model on the iris dataset using Scikit-Learn, and visualizing the tree and decision boundaries. Grid search is used to tune hyperparameters and reduce overfitting.

Uploaded by

Mina Nath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views1 page

Decision Trees for Data Scientists

The document describes decision trees, including their use for classification and regression. It discusses key concepts like nodes, splits, impurity measures and information gain. It then demonstrates building a decision tree model on the iris dataset using Scikit-Learn, and visualizing the tree and decision boundaries. Grid search is used to tune hyperparameters and reduce overfitting.

Uploaded by

Mina Nath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Descision Tree

It uses a tree-like model of decisions and their subsequent consequences to arrive at a particular decision. The the data is continuously split according to a
certain parameter, and finally, a decision is made.

Supervised Learning
Non-parametric
For both classification and regression
Basis of Random Forests
Interpretable (They can be explained as a series of questions/ if-else statements.)

In [ ]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

In [ ]: iris = load_iris()

Iris Dataset
The Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from
each of 3 species of iris. The species are Iris Setosa, Versicolor, and Virginica. The data set has 150 cases (rows) and 5 variables (columns) named
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

In [ ]: iris.feature_names

In [ ]: iris.data[0:5]

In [ ]: iris.target_names

In [ ]: iris.target

In [ ]: import matplotlib.pyplot as plt


plt.scatter(iris.data[0:50,2], iris.data[0:50,3], label="Setosa", marker="x", facecolor="blue")
plt.scatter(iris.data[50:100,2], iris.data[50:100,3], label="versicolor", marker="s", facecolor="red")
plt.scatter(iris.data[100:150,2], iris.data[100:150,3], label="virginica", facecolor="green")

plt.xlabel('Petal length (cm) ')


plt.ylabel('Petal width (cm)')
plt.show()

In [ ]: from sklearn.model_selection import train_test_split

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [ ]: np.shape(X_train)

Decision Tree Algorithm


TERMS :

Root node: The topmost node in a tree.

Leaf Node: Nodes do not split is called Leaf or Terminal node

Splitting: It is a process of dividing a node into two or more sub-nodes.

Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.

Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.

Gini Impurity
scikit-learn default

Gini Impurity

A measure of purity / variability of categorical data

Developed by Corrado Gini in 1912

Key Points:

A pure node (homogeneous contents or samples with the same class) will have a Gini coefficient of zero
As the variation increases (heterogeneneous classes or increase diversity), Gini coefficient increases and approaches 1.
r

2
Gini = 1 − ∑ p
j

p is the probability (often based on the frequency table)

Information Gain
Expected reduction in entropy caused by splitting
Keep splitting until you obtain a as close to homogeneous class as possible

In [ ]:

Classification Using Descision trees


Training a Decision Tree with Scikit-Learn Library

In [ ]: from sklearn.tree import DecisionTreeClassifier

dtclf = DecisionTreeClassifier()

In [ ]: dtclf= dtclf.fit(X_train, y_train)

In [ ]: #sudo apt-get install graphviz


#conda install -c conda-forge python-graphviz
#pip install graphviz

import graphviz

In [ ]: from sklearn.tree import export_graphviz

dot_data = export_graphviz(dtclf, out_file=None,


feature_names=iris.feature_names,
class_names=iris.target_names,
rounded=True,
filled=True)

In [ ]: graph = graphviz.Source(dot_data)
graph

In [ ]: #from sklearn.metrics import accuracy_score

#y_pred = dtclf.predict(X_test)
#accuracy_score(y_test, y_pred)

In [ ]:

Visualise the Decision Boundary


In [ ]: import seaborn as sns
sns.set_style('whitegrid')

In [ ]: df = sns.load_dataset('iris')
df.head()

In [ ]: col = ['petal_length', 'petal_width']


X = df.loc[:, col]

In [ ]: species_to_num = {'setosa': 0,
'versicolor': 1,
'virginica': 2}
df['tmp'] = df['species'].map(species_to_num)
y = df['tmp']

In [ ]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)


dtclf2 = DecisionTreeClassifier()
dtclf2 = dtclf2.fit(X_train, y_train)

In [ ]: Xv = X_train.values.reshape(-1,1)
h = 0.02
x_min, x_max = Xv.min(), Xv.max() + 1
y_min, y_max = y_train.min(), y_train.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

In [ ]: z = dtclf2.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X_train.values[:, 0], X_train.values[:, 1], c=y_train, s=80,
alpha=0.9, edgecolors='g');

In [ ]:

Overfitting

In [ ]: from sklearn.model_selection import GridSearchCV

params ={'min_samples_leaf': list(range(5, 20)),


'max_depth':list(range(3, 8))}

grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=11), params, n_jobs=-1, verbose=0)

grid_search_cv.fit(X_train, y_train)

In [ ]: grid_search_cv.best_estimator_

In [ ]: bestDT=DecisionTreeClassifier(max_depth=3,min_samples_leaf=5, random_state=11)
bestDT.fit(X_train, y_train)

In [ ]: dot_data2 = export_graphviz(bestDT, out_file=None,


feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True)

graph = graphviz.Source(dot_data2)
graph

In [ ]:

In [ ]: z = bestDT.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X_train.values[:, 0], X_train.values[:, 1], c=y_train, s=80,
alpha=0.9, edgecolors='g');

In [ ]: from sklearn.metrics import accuracy_score

y_pred2 = bestDT.predict(X_test)
accuracy_score(y_test, y_pred2)

In [ ]:

You might also like