Murang’a University of Technology
Innovation for Prosperity
Lecture 2
Supervised Learning
- Classification
Elements a Learning Task
Together, these elements frame the scope, input, and evaluation of a machine
learning task. Three key elements that define a learning task in machine learning are:
1. Task (T)
• This defines what the machine learning model is expected to accomplish.
• Examples include classification, regression, clustering, or reinforcement learning
tasks.
2. Experience (E)
• This refers to the data or interaction the model uses to learn.
• For supervised learning, the experience involves labeled datasets with input-
output pairs. In unsupervised learning, the experience comes from unlabeled data
patterns. Reinforcement learning draws experience from agent-environment
interactions and feedback (rewards).
3
Elements a Learning Task
3. Performance Measure (P)
– This quantifies how well the model is achieving the task.
– Common metrics include accuracy, precision, recall, F1-score for
classification, mean squared error (MSE) for regression, or cumulative
reward for reinforcement learning.
– The performance measure evaluates the model's output on unseen
test data to ensure it generalizes well.
4
Introduction to Supervised Learning
• Supervised learning is a type of machine learning where the model
learns from labeled data.
• The goal is to map the input to the output and predict the labels of
unseen data accurately.
• Supervised Learning presents two types of problems; Classification
and Regression.
How It Works:
1. Input Data: Contains both features (independent variables) and labels
(dependent variables).
2. Learning Phase: The model identifies patterns in the data that map
inputs to outputs.
3. Prediction Phase: For new data, the model predicts the label using the
learned patterns.
5
Introduction to Classification
• Classification is a supervised learning task where the model assigns
a category or label to an input based on its features.
• It deals with discrete outputs, such as "yes" or "no," "cat" or "dog,"
or multiple classes like "setosa," "versicolor," and "virginica" in the
Iris dataset.
Key Terms in Classification:
• Classes: Categories or labels (e.g., spam/not spam).
• Features: Attributes used to classify the input (e.g., word
frequencies in an email).
• Decision Boundary: The boundary that separates different classes
in the feature space.
6
Types of Classification Tasks
1. Binary Classification:
• Two possible classes (e.g., spam vs. not spam).
2. Multiclass Classification:
• More than two classes (e.g., classifying images as "cat," "dog," or
"bird").
3. Multilabel Classification:
• Each instance can belong to multiple classes simultaneously (e.g.,
tagging a movie with genres like "action," "comedy," and "thriller").
7
Types of Classification Algorithms
1. k-Nearest Neighbors (k-NN)
• The k-NN algorithm is one of the simplest yet powerful classification algorithms. It
classifies data points based on the majority class among their nearest neighbors.
How It Works:
• Compute the distance (e.g., Euclidean, Manhattan) between the input data point
and all other points in the training set.
• Identify the k nearest neighbors to the data point.
• Assign the class that is most common among these k neighbors.
Strengths:
• Simple to implement and understand.
• Works well with small datasets and non-linear decision boundaries.
Weaknesses:
• Computationally expensive for large datasets.
• Sensitive to irrelevant or redundant features.
8
k-Nearest Neighbors (k-NN)
9
2. Naïve Bayes
• Naïve Bayes is based on Bayes' Theorem, which calculates the probability of a class
given a set of features.
• It assumes that the features are conditionally independent of each other, which
may not always be true in practice but works surprisingly well for many problems.
How It Works:
• For each class, calculate the likelihood of the data belonging to that class
based on the feature probabilities.
• Multiply these probabilities and apply Bayes’ Theorem to calculate the
posterior probability.
• Assign the class with the highest posterior probability.
Strengths:
• Extremely fast and efficient for high-dimensional data.
• Performs well on text classification problems.
Weaknesses:
• Assumes independence among features, which may not always hold true.
• Performs poorly if features are highly correlated.
10
Naïve Bayes
11
3. Support Vector Machines (SVM)
• SVM is a robust and versatile classification algorithm that works by finding the
hyperplane that best separates the data points of different classes.
• Depending on the type of data, there are two types of Support Vector
Machines:
Linear SVM or Simple SVM,
• is used for data that is linearly separable. A dataset is termed linearly
separable data if it can be classified into two classes using a single straight line.
Nonlinear SVM or Kernel SVM,
• is a type of SVM that is used to classify nonlinearly separated data, or data
that cannot be classified using a straight line. It has more flexibility for
nonlinear data because more features can be added to fit a hyperplane
instead of a two-dimensional space.
12
Support Vector Machines (SVM)
How SVM Works
• Separate Classes: SVM finds the best hyperplane that divides
data into distinct classes.
• Maximize Margin: It ensures the margin (distance) between the
hyperplane and the nearest data points (support vectors) is as
large as possible.
• Kernel Trick: For non-linear data, SVM transforms the data into a
higher-dimensional space using kernel functions to make it
separable.
• Support Vectors: The data points closest to the hyperplane are
called support vectors, which define the decision boundary.
13
Support Vector Machines (SVM)
14
4. Decision Trees
• Decision trees use a tree-like structure where internal nodes represent feature-
based decisions, and leaf nodes represent class labels.
How It Works:
• At each node, the algorithm selects the feature that best splits the data into
pure subsets (e.g., using metrics like Gini impurity or information gain).
• This process continues recursively until all data points are classified.
Strengths:
• Easy to interpret and visualize.
• Handles both numerical and categorical data.
Weaknesses:
• Prone to overfitting, especially for deep trees.
• Sensitive to small changes in data.
15
Decision Trees
16
5. Random Forests
• Random Forests address the limitations of decision trees by creating an ensemble
of trees and averaging their predictions.
How It Works:
• Generates multiple decision trees using bootstrap samples of the training
data.
• At each split, a random subset of features is considered to ensure diversity
among the trees.
• Final prediction is made by majority voting or averaging.
Strengths:
• Reduces overfitting compared to a single decision tree.
• Robust to noisy data and outliers.
Weaknesses:
• Can be computationally expensive for large datasets.
• Less interpretable than a single decision tree.
17
Random Forests
18
Evaluating Classification Performance
• Evaluating the performance of a classification model is crucial
for understanding how well it predicts the target variable.
• Evaluation metrics for classification tasks help us understand
how good machine learning models are by giving us valuable
information about different aspects of model performance.
• This information helps us choose, improve, and use these
models effectively.
• The following are some common metrics and techniques
used:
➢ Confusion Matrix, Accuracy, Recall, Precision and F1 Score
19
Confusion Matrix
• A confusion matrix is a table that summarizes the classification results
and indicates the number of true positive, true negative, false positive,
and false negative results.
• It provides a clear summary of predictions versus actual class labels, which
offers insights into the model’s accuracy and misclassifications.
20
Confusion Matrix
21
Accuracy Metric
• The accuracy score represents the percentage of correct predictions
in the overall test data.
• A high accuracy score indicates that the model is making a large
proportion of correct predictions, while a low accuracy score
indicates that the model is making too many incorrect predictions.
22
Recall Metric
• Recall provides the accuracy for individual classes.
• It is a crucial metric for evaluating model performance.
• Recall measures the proportion of true positives among all
actual positive instances.
23
Precision Metric
• Precision measures the proportion of true positives (correctly
classified positive cases) out of all cases classified as positive.
• Precision tells us how often the model’s positive predictions
are correct, highlighting the accuracy of its relevant
predictions.
24
F1-Score
• F1-score calculates the harmonic mean of recall and precision,
representing a balanced measure of model performance.
• F1-score will only be good, when your Recall and Precision value is
good.
• The F1 score combines precision and recall to produce a single
score that is the harmonic average of the two metrics.
25
Non-Parametric Models
• Non-parametric models do not make strong assumptions
about the data distribution and can have a flexible number of
parameters that can grow with the data.
• They are often more flexible but can be computationally more
expensive.
Examples: k-NN, Support Vector Machines (SVM), Decision
Trees, and Random Forests are non-parametric.
Strengths: Can capture complex relationships in data without
assuming a specific functional form.
Weaknesses: Require large amounts of data to generalize
effectively.
26
Non-Parametric Models
• Non-parametric methods make minimal assumptions about the
data compared to parametric methods. However, they still rely on
some key assumptions to function effectively.
• Here are three assumptions typically associated with non-
parametric methods:
– Independence: Data points are independent and not influenced
by others.
– Random Sampling: Data represents a random sample from the
population.
– Homogeneity of Measurement: Measurements are consistent
across all data points.
27
Applications of Classification
Classification models are widely used in diverse fields, offering solutions to
real-world problems:
i. Healthcare: Disease diagnosis (e.g., cancer detection using image
classification).
ii. Finance: Fraud detection in credit card transactions.
iii. Marketing: Customer segmentation (e.g., classifying customers based on
purchasing behavior).
iv. Natural Language Processing (NLP): Email spam detection.
v. Image Recognition: Object detection in autonomous vehicles.
vi. Cybersecurity: Intrusion detection in networks.
vii. Education: Plagiarism detection using text classification techniques.
28
Limitations of Classification
Data Dependency:
• Requires labeled data, which can be expensive and time-consuming to obtain.
Overfitting:
• Complex models may overfit the training data, leading to poor generalization.
Imbalanced Data:
• Models struggle when one class dominates the dataset (e.g., fraud detection).
Computational Cost:
• Some algorithms can be computationally expensive for large datasets.
Interpretability:
• Advanced models (e.g., Neural Networks) are "black boxes," making them
hard to explain.
29
Class Activity
1. Implement a Support Vector Machine (SVM) classifier on the
Iris dataset. Use a linear kernel and split the data into training
and testing sets with a test size of 0.2 and random state=42.
Calculate and print the accuracy of your model on the test set.
(7 Marks)
2. Using the breast cancer dataset from scikit-learn, implement a
binary classification model using any classifier covered in this
lecture. Print the following evaluation metrics for your model's
performance on the test set: Confusion Matrix, Accuracy,
Precision, Recall and F1-score. (13 Marks)
30