ECS784U/P DATA ANALYTICS
(WEEK 4, 2024)
SUPERVISED LEARNING (CLASSIFICATION)
DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
TIMETABLE
2
LECTURE OVERVIEW
Supervised Learning (Classification)
▪ Week 4 Lab.
▪ Logistic Regression.
▪ Coursework Q&A.
▪ Support Vector Classification (SVC).
▪ K-Nearest Neighbour (K-NN) Classification.
▪ Big-O complexity notation.
3
WEEK 4 LAB OVERVIEW
Scikit-Learn
What is Scikit-Learn?
▪ A Python library (an extension of SciPy) for machine learning.
▪ Built on SciPy and NumPy: provides access to classic machine
learning algorithms for classification, regression and clustering.
▪ Dimensionality reduction with Principal Component Analysis
(PCA).
▪ Linear and non-linear (including logistic) regression.
▪ K-nearest neighbour (K-NN).
▪ Support vector machines (SVMs).
▪ K-means clustering.
▪ Gaussian Mixture Models (GMMs).
4
WEEK 4 LAB OVERVIEW
5
WHAT IS CLASSIFICATION?
▪ Given data containing 𝑥 features and a target feature 𝑦, where
▪ 𝑦 is a categorical variable
▪ 𝑥 is a set of different features.
▪ Learn a model to predict 𝑦
ො given 𝑥.
𝑦3
𝑥1 (e.g. length)
𝑦1 𝑦𝑛
𝑦2
𝑥2 (e.g., weight)
6
CLASSIFICATION WITH
LOGISTIC REGRESSION
▪ Logistic regression is Linear Regression where the
dependent variable 𝑦 is categorical (e.g., car/train/bike)
instead of numerical (e.g., 10/15/20).
▪ Linear binary separation.
Feature 1
(e.g. length)
Feature 2 (e.g., weight)
7
CLASSIFICATION: PROCEDURE
𝑦3
𝑥1 (e.g. length) 𝑦𝑛
𝑦1
𝑦2
𝑥2 (e.g., weight)
Basic linear classification algorithm:
1. Search for straight lines.
2. For each line, return the classification accuracy.
3. Pick the line with the highest classification accuracy.
8
HOW TO SEARCH EFFECTIVELY?
Which line is more accurate and why?
A B
C D
9
HIDDEN SLIDE
10
HIDDEN SLIDE
11
HIDDEN SLIDE
12
HIDDEN SLIDE
13
HIDDEN SLIDE
14
VISUALISING FEATURE IMPORTANCE
▪ Which feature is more important for classification:
▪ Feature 1 or Feature 2?
Feature 1
Feature 2
15
VISUALISING FEATURE IMPORTANCE
▪ Which feature is more important for classification:
▪ Feature 1 or Feature 2?
Feature 1
Feature 2
16
HIDDEN SLIDE
17
HIDDEN SLIDE
18
LINEAR CLASSIFICATION
▪ Data may be linearly separable if the dimensionality of the data is big
relative to the sample size of the input data.
▪ The more dimensions we have and the less samples those
dimensions contain, the easier is to find a linear line that separates
the data.
▪ But high dimensionality may lead to overfitting.
▪ While having more samples makes it harder to find a linear
separation, the increased number of samples will help produce a
more accurate linear separation on previously unseen cases.
Feature 1
(e.g. length)
19
Feature 2 (e.g., weight)
HOW TO SEARCH EFFECTIVELY?
Find the line that is as distant as possible from each class.
▪ Logistic regression relies on the sigmoid function, also known as the
logistic function, for binary classification.
▪ It takes an input 𝑥 and outputs a class probability value.
𝑓 𝑥 = 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 sigmoid 𝑥 = (𝑎𝑥1 + 𝑏𝑥2 + 𝑐)
▪ Default decision boundary is set to 0.5
▪ E.g., if sigmoid(𝑥) > 0.5 then class 1, otherwise class 0
CLASS 1
𝑒𝑥 1
sigmoid 𝑥 = 𝑥 = =
sigmoid 𝑥 𝑒 + 1 1 + 𝑒 −𝑥
1
= 0 to 1
CLASS 0 1 + 𝑒 −𝑎𝑥1+𝑏𝑥2+𝑐
20
𝑥
HOW TO SEARCH EFFECTIVELY?
▪ If a new data point 𝑥 is placed somewhere on the red line then:
▪ 𝑦ො = sigmoid 𝑥 = 0.5 = either class.
▪ This is because the red line is the decision boundary.
▪ If a new data point 𝑥 is placed somewhere on the blue line then:
▪ 𝑦ො = sigmoid 𝑥 = ~0.75 = class train.
▪ If a new data point 𝑥 is placed somewhere on the green line then:
▪ 𝑦ො = sigmoid 𝑥 = ~0.25 = class car.
Feature 1
(e.g. length)
Feature 2 (e.g., weight)
21
THERE ARE DIFFERENT SIGMOID
FUNCTIONS
▪ Different parameters or exponential functions produce different
sigmoid functions.
▪ Different sigmoid functions determine the probability of a particular
instance belonging to a particular class given its distance from the
separating line.
▪ The figure below compares some sigmoid functions.
22
Figure source: https://en.wikipedia.org/wiki/Sigmoid_function
REGRESSION VS CLASSIFICATION
Linear Regression Logistic Regression
▪ Given data 𝑥, 𝑦 ▪ Given data 𝑥, 𝑦
▪ Where 𝑦 is a set of numerical values. ▪ Where 𝑦 is a set of categorical values.
▪ Learn a function 𝑓(𝑥) to predict 𝑦
ො given 𝑥. ▪ Learn the sigmoid of a function 𝑓(𝑥) to
▪ E.g., in the case of linear regression predict 𝑦ො given new data 𝑥.
we have 𝑦ො = 𝑓(𝑥) = 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 ▪ E.g., in the case of logistic regression we
have 𝑦ො = sigmoid 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 > 0.5
𝑦3
𝑥1 (e.g. length)
𝑦1 𝑦𝑛
𝑦 𝑦2
𝑥2 (e.g., weight)
𝑥
ො given 𝑥𝑖
▪ E.g., predict carPrice 𝑦 ො given 𝑥𝑖
▪ E.g., predict category 𝑦
▪ 𝐜𝐚𝐫𝐏𝐫𝐢𝐜𝐞 = 𝑎 × carModel + 𝑏 × mileAge + 𝑐 ▪ if sigmoid(𝑥) > 0.5 then: car
23
▪ Else: train
MULTINOMIAL LOGISTIC CLASSIFICATION
▪ From bivariate/binary 𝑦 = {0,1} to multinomial 𝑦 = {0,1, … , 𝑘}
▪ Also referred to as Multiclass Logistic Regression
▪ We use the same process to generate additional
decision/classification boundaries.
24
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
دقائق استراحة10
10 MINUTI DI PAUSA
דקות10 הפסקה של
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 25
SUPPORT VECTOR CLASSIFICATION
Finds the support vectors that maximise margins
large margin small margin
Support Vectors
26
SUPPORT VECTOR CLASSIFICATION
▪ Unlike Logistic Regression which uses the Sigmoid function to find the line which is
as distant as possible from each class, Support Vector Classification (SVC) finds
the decision boundary that is as distant as possible from the boundary point
of both classes.
▪ Support Vector Classification (SVC) returns a decision boundary that is similar to
that returned by Logistic Regression.
▪ While the decision boundary will often not be identical between these two
algorithms, their classification accuracy will often be identical. Why is that?
one possible reason could be the amount of the dataset is not big enough
only if the testing data lie in between two lines will make the accuracy not
identical=> so less data, less chance it will happen
model 1
model 2 27
Logistic regression Support Vector Classification
HIDDEN SLIDE
28
K-NEAREST NEIGHBOUR
▪ K-NN is a well-known and simple machine learning algorithm for classification.
▪ Unlike Logistic Regression, which is a model-based classification, K-NN is an
instance-based classification.
▪ Model-based classification: learns a model and uses that model to
generate predictions.
▪ Instance-based classification: Does not learn a model. Instead, it takes
new data instances and compares them to historical data instances.
Classification is determined by closest match.
𝑦3 𝑦3
𝑥2 (e.g. length)
𝑦1 𝑦𝑛 𝑥2 (e.g. length) 𝑦1 𝑦𝑛
𝑦2 𝑦2
𝑥2 (e.g., weight) 𝑥2 (e.g., weight)
Model-based classification Instance-based classification 29
K-NN: MEASURING DISTANCE
How to measure similarity?
𝑦3
𝑥2 (e.g. length) 𝑦1 𝑦𝑛
𝑦2
𝑥2 (e.g., weight)
Distance-based similarity.
▪ Euclidean distance: 𝐷 𝐱, 𝐲 = 𝑥𝑖 − 𝑦𝑖 2
▪ Manhattan distance:
▪ Hamming distance:
30
▪ Measures the number of changes between 𝐴 and 𝐵 required to make 𝐴 = 𝐵.
K-NN: MEASURING DISTANCE
Distance-based similarity.
▪ Euclidean distance:
▪ In Euclidean distance the green line has
length 0 − 6 2 + 0 − 6 2 = 8.49, and it is
the unique shortest path.
▪ Manhattan distance: Euclidean (green) vs. Manhattan
(other colours) distance
▪ In Manhattan distance, the red, yellow and
blue paths all have the shortest length of 12.
▪ Hamming distance:
▪ Measures the number of changes between 𝐴 and 𝐵,
needed to make 𝐴 = 𝐵.
The Hamming distance between: Two example distances: 100→011 has
•"karolin" and "kathrin" is 3. distance 3; 010→111 has distance 2.
•"karolin" and "kerstin" is 3. The minimum distance between any two
•1011101 and 1001001 is 2. vertices is the Hamming distance
•2173896 and 2233796 is 3. between the two binary strings.
source of images: wikipedia 31
K-NN: DATA SCALING
▪ What if we measure weight in grams and length in meters.
▪ There is a risk that the classification will be driven by the weight parameter.
▪ The distances generated between weights would be significantly larger relative to the
distances generated between lengths.
▪ Potential solution: Normalisation
▪ Normalise for a particular scale.
▪ E.g., normalise everything between 0-1.
▪ Normalise for standard deviation.
▪ E.g., normalise everything to have 1 STD.
32
Source of images: https://i.stack.imgur.com/
K-NN: CLASSIFICATION BOUNDARY
▪ How does the classification boundary look like?
▪ Like a Voronoi diagram; i.e., partition of a plane/surface into regions close to each of
the given set of objects.
▪ The classification boundaries depend on the distance metric.
▪ e.g., Euclidean, Manhattan, etc.
Feature 2
Feature 2
33
Feature 1 Feature 1
K-NN: OUTLIERS
What if the division of the surface is unrealistic?
▪ Since we are taking the historical data as is, without generalising, the
classification boundaries often become too complex.
▪ How to smooth boundaries?
Feature 2
Feature 2
swapped
34
Feature 1 Feature 1
HIDDEN SLIDE
35
HIDDEN SLIDE
36
HIDDEN SLIDE
37
K-NN: PROCEDURE Reading
slide
▪ pseudocode:
1. Given data 𝐷 and new input 𝑥 to classify.
2. Measure the distance of input 𝑥 to every
known data point in 𝐷.
3. Search for the 𝐾 smaller distances.
4. Retrieve the majority class from the 𝐾
Feature 2
smaller distances.
5. Assign the majority class to 𝑦.
ො
▪ Note that we can also retrieve the probability
of each class by simply counting the
neighbours that associate to each class,
relative to 𝐾.
▪ i.e., K-NN could be used for regression.
Feature 1
38
LOGISTIC REGRESSION VS. Reading
K-NEAREST NEIGHBOUR CLASSIFICATION slide
Logistic Regression K-NN
▪ Model-based. ▪ Instance-based.
▪ Given data 𝑥, 𝑦 ▪ No model training.
▪ Where 𝑦 is a set of categorical values. ▪ Memorises all historical data.
▪ Learn the sigmoid of a function 𝑓(𝑥) to predict ▪ Use historical data to predict 𝑦
ො given new data 𝑥
𝑦ො given new data 𝑥. ▪ i.e., find similar 𝑦𝑖 from historical data given
▪ i.e., in the case of logistic regression we new 𝑥𝑖.
have 𝑦ො = sigmoid 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 > 0.5
𝑦3 𝑦3
𝑥2 (e.g. length)
𝑥2 (e.g. length)
𝑦1 𝑦𝑛 𝑦1 𝑦𝑛
𝑦2 𝑦2
𝑥2 (e.g., weight) 𝑥2 (e.g., weight)
▪ E.g., predict category 𝑦ො given 𝑥𝑖
ො given 𝑥𝑖
▪ E.g., predict category 𝑦
▪ if number of class 0 neighbours >
▪ if sigmoid(𝑥) > 0.5 then: car
number of class 1 neighbours then: car 39
▪ Else: train
▪ Else: train
BIG-O NOTATION
The Big-O notation.
▪ A formalism for expressing the computational complexity of an
algorithm.
▪ Tells us something about the runtime of the algorithm.
▪ The same notation can be used to express how much memory the
function uses.
▪ The letter ‘𝑂’ is used because the growth of a function is also referred to
as the order of the function. Examples:
▪ 𝑂 1 indicates that computational complexity is independent of the
input data size.
▪ E.g., doubling the input size does not influence the runtime.
▪ 𝑂 𝑛 indicates that computational complexity is proportional (linear) to
the input size.
▪ E.g., doubling the input size doubles the runtime.
▪ 𝑂 𝑛2 indicates that computational complexity is proportional to the
square of the input size.
▪ E.g., doubling the input size quadruples the runtime. 40
COMPUTATIONAL COMPLEXITY Reading
slide
The Big-O notation.
41
COMPUTATIONAL COMPLEXITY
▪ Logistic Regression: 𝑁 = samples, 𝐷 = dimensions
▪ Train: 𝑶 𝑵𝑫 Perform optimisation on each data point.
▪ Memory: 𝑶 𝑫 Store weights 𝑤 for each variable.
▪ Test: 𝑶 𝑫 Multiply input data with associated weights 𝑤.
▪ K-Nearest Neighbour:
▪ Train: 𝑶 𝟏 Get the dataset (no training).
▪ Memory: 𝑶 𝑵𝑫 Store the entire dataset (K-NN does not produce a model).
▪ Test: 𝑶 𝑵𝑫 Compare input to each data point.
Which of the two methods is less computationally expensive for 42
repeated use of classification?
HIDDEN SLIDE
43