Decision Tree Code Explanation
1. Importing Libraries
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
What it does: These lines import all the tools we need:
numpy (as np): For working with arrays and numerical data
matplotlib.pyplot (as plt): For creating graphs and visualizations
load_breast_cancer : A built-in dataset about breast cancer cases
train_test_split : Splits data into training and testing portions
DecisionTreeClassifier : The machine learning algorithm we'll use
accuracy_score : Measures how well our model performs
tree : Helps us visualize the decision tree
2. Loading the Dataset
python
data = load_breast_cancer()
X = data.data
y = data.target
What it does:
data = load_breast_cancer() : Loads the breast cancer dataset (569 samples with 30 features each)
X = data.data : Gets the input features (measurements like tumor size, texture, etc.)
y = data.target : Gets the labels (0 = malignant, 1 = benign)
3. Splitting the Data
python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
What it does:
Splits the data into training (80%) and testing (20%) sets
Training data: Used to teach the model
Testing data: Used to evaluate how well the model learned
random_state=42 : Ensures we get the same split every time (reproducibility)
4. Creating and Training the Model
python
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
What it does:
clf = DecisionTreeClassifier() : Creates a decision tree classifier object
clf.fit(X_train, y_train) : Trains the model using the training data
The model learns patterns by asking questions like "Is the tumor radius > 15?" and creating a tree of
decisions
5. Making Predictions
python
y_pred = clf.predict(X_test)
What it does:
Uses the trained model to predict outcomes for the test data
y_pred contains the model's guesses (0 or 1) for each test sample
6. Calculating Accuracy
python
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
What it does:
Compares the model's predictions ( y_pred ) with the actual answers ( y_test )
Calculates what percentage the model got right
Prints the accuracy as a percentage (e.g., "Model Accuracy: 93.86%")
7. Testing on a Single Sample
python
new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)
prediction_class = "Benign" if prediction == 1 else "Malignant"
print(f"Predicted Class for the new sample: {prediction_class}")
What it does:
Takes the first sample from the test set
Makes a prediction for just this one sample
Converts the numerical prediction (0 or 1) to a readable label:
1 = "Benign" (not cancerous)
0 = "Malignant" (cancerous)
Prints the result
8. Visualizing the Decision Tree
python
plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names, class_names=data.target_name
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()
What it does:
Creates a large figure (12x8 inches)
Draws the decision tree with:
filled=True : Colors the nodes based on the majority class
feature_names : Shows actual feature names instead of numbers
class_names : Shows "malignant" and "benign" instead of 0 and 1
Adds a title and displays the visualization
How the Decision Tree Works
The decision tree makes predictions by asking a series of yes/no questions about the tumor
characteristics. For example:
1. "Is the mean radius ≤ 16.8?"
If yes → go left branch
If no → go right branch
2. Continue asking questions until reaching a final decision (leaf node)
Each path from top to bottom represents a different rule for classification, making the model
interpretable and easy to understand!