Classification
• As the name suggests, Classification is the task of “classifying things”
into sub-categories. Classification is part of supervised machine
learning in which we put labeled data for training.
• Classification is a process of categorizing data or objects into
predefined classes or categories based on their features or attributes.
• The main objective of classification machine learning is to build a
model that can accurately assign a label or category to a new
observation based on its features.
Types of classification
1. Binary Classification:
• Definition: Involves two classes or categories. The model predicts one of two
possible outcomes.
• Examples:
• Email Classification: Spam or Not Spam
2. Multiclass Classification:
• Definition: Involves more than two classes. Each instance is classified into one of the
three or more categories.
• Examples:
• Handwritten Digit Recognition: Classifying digits from 0 to 9
3. Multilabel Classification:
• Definition: Each instance can belong to multiple classes simultaneously.
• Examples:
• Text Categorization: A news article can belong to multiple categories such as "Sports" and
"Politics."
Types of classification
Classification
MNIST
• MNIST dataset, which is a set of 70,000 small images of
digits handwritten by high school students and employees
of the US Census Bureau. Each image is labeled with the
digit it represents.
The sklearn.datasets package contains mostly three types of
functions:
• fetch_* functions such as fetch_openml() to download real-life
datasets,
• load_* functions to load small toy datasets bundled with Scikit-
Learn and
• make_* functions to generate fake datasets, useful for tests
Generated datasets are usually returned as an (X, y) tuple containing
the input data and the targets, both as 1 NumPy arrays.
There are 70,000 images, and each image has 784 features. This is because
each image is 28×28 pixels, and each feature simply represents one pixel’s
intensity, from 0 (white) to 255 (black)
• We should always create a test set and set it aside before inspecting
the data closely. The MNIST dataset is actually already split into a
training set (the first 60,000 images) and a test set (the last 10,000
images):
Training a Binary Classifier
• Let’s simplify the problem for now and only try to identify one digit—
for example, the number 5.
• This “5-detector” will be an example of a binary classifier, capable of
distinguishing between just two classes, 5 and not-5.
• Let’s create the target vectors for this classification task:
• now let’s pick a classifier and train it
Training a Binary Classifier
Performance Measures
Measuring Accuracy Using Cross-Validation
• Let’s use the cross_val_score() function to evaluate your
SGDClassifier model using K-fold cross-validation, with three
folds
Measuring Accuracy Using Cross-Validation
let’s look at a very dumb classifier that just classifies every single image in the “not-5” class:
This demonstrates why accuracy is generally not the preferred performance measure for
classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are
much more frequent than others)
Confusion Matrices
• A much better way to evaluate the performance of a classifier is to
look at the confusion matrix.
• The general idea is to count the number of times instances of class A
are classified as class B.
• For example, to know the number of times the classifier confused
images of 5s with 3s, you would look in the 5th row and 3rd column
of the confusion matrix.
Confusion Matrices
• To compute the confusion matrix, you first need to have a set of
predictions, so they can be compared to the actual targets.
• Just like the cross_val_score() function, cross_val_predict() performs
K-fold cross-validation, but instead of returning the evaluation scores,
it returns the predictions made on each test fold
Confusion Matrices
Confusion Matrices
• Each row in a confusion matrix represents an actual class, while each
column represents a predicted class.
• The first row of this matrix considers non-5 images (the negative class):
53,892 of them were correctly classified as non-5s (they are called true
negatives),
• while the remaining 687 were wrongly classified as 5s (false positives, also
called type I errors).
• The second row considers the images of 5s (the positive class): 1,891 were
wrongly classified as non-5s (false negatives, also called type II errors),
• while the remaining 3,530 were correctly classified as 5s (true positives).
• The confusion matrix gives you a lot of information,
but sometimes you may prefer a more concise metric.
• An interesting one to look at is the accuracy of the
positive predictions; this is called the precision of the
classifier
precision = TP /(TP + FP)
Where TP is the number of true positives, and FP is the
number of false positives.
• precision is typically used along with another metric
named recall, also called sensitivity or the true
positive rate (TPR):
• this is the ratio of positive instances that are correctly
detected by the classifier
recall = TP /(TP + FN)
Where FN is, of course, the number of false negatives.
F1 Score
The Precision/Recall Trade-off
Instead of calling the classifier’s predict() method, you can
call its decision_function() method, which returns a score
for each instance, and then use any threshold you want to
make predictions based on those scores
How do you decide which threshold to use?
• Suppose you decide to aim for 90% precision. You could use the first
plot to find the threshold you need to use, but that’s not very precise
• Alternatively, you can search for the lowest threshold that gives you at
least 90% precision. For this, you can use the NumPy array’s argmax()
method
The ROC Curve
• The receiver operating characteristic (ROC) curve is another common tool
used with binary classifiers.
• It is very similar to the precision/recall curve, but instead of plotting
precision versus recall, the ROC curve plots the true positive rate (another
name for recall) against the false positive rate (FPR).
• The FPR (also called the fall-out) is the ratio of negative instances that are
incorrectly classified as positive.
• It is equal to 1 – the true negative rate (TNR), which is the ratio of negative
instances that are correctly classified as negative.
• The TNR is also called specificity.
• Hence, the ROC curve plots sensitivity (recall) versus 1 – specificity
The RandomForestClassifier class does not have a decision_function() method, due to the way
it works. We can call the cross_val_predict() function to train the RandomForestClassifier using
cross-validation and make it predict class probabilities for every image as follows:
Multiclass Classification
• multiclass classifiers (also called multinomial classifiers) can distinguish
between more than two classes.
• there are various strategies that you can use to perform multiclass
classification with multiple binary classifiers
• One way to create a system that can classify the digit images into 10
classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a
0-detector, a 1-detector, a 2-detector, and so on).
• Then when you want to classify an image, you get the decision score
from each classifier for that image and you select the class whose
classifier outputs the highest score. This is called the one-versus-the-
rest (OvR) strategy, or sometimes one-versus-all (OvA)
• Another strategy is to train a binary classifier for every pair of digits:
one to distinguish 0s and 1s, another to distinguish 0s and 2s, another
for 1s and 2s, and so on. This is called the one-versus-one (OvO)
strategy.
• If there are N classes, you need to train N × (N – 1) / 2 classifiers. For
the MNIST problem, this means training 45 binary classifiers!
• When you want to classify an image, you have to run the image
through all 45 classifiers and see which class wins the most duels.
• The main advantage of OvO is that each classifier only needs to be
trained on the part of the training set containing the two classes that
it must distinguish
• Scikit-Learn detects when you try to use a binary classification
algorithm for a multiclass classification task, and it automatically runs
OvR or OvO, depending on the algorithm
• Let’s try this with a support vector machine classifier using the
sklearn.svm.SVC class
• Let’s only train on the first 2,000 images, or else it will take a very long
time:
• This code actually made 45 predictions—one per pair of classes—and
it selected the class that won the most duels.
• If you call the decision_function() method, you will see that it returns
10 scores per instance: one per class.
• If you want to force Scikit-Learn to use one-versus-one or one-versus-
the rest, you can use the OneVsOneClassifier or OneVsRestClassifier
classes.
• Simply create an instance and pass a classifier to its constructor.
• Training an SGDClassifier on a multiclass dataset and using it to make predictions is
just as easy:
Error Analysis
• Let’s look at the confusion matrix. For this, you first need to make
predictions using the cross_val_predict() function; then you can pass
the labels and predictions to the confusion_matrix() function
• since there are now 10 classes instead of 2, the confusion matrix will
contain quite a lot of numbers, and it may be hard to read. A colored
diagram of the confusion matrix is much easier to analyze. To plot such a
diagram, use the ConfusionMatrixDisplay.from_predictions() function:
• This confusion matrix looks
pretty good:
• most images are on the main
diagonal, which means that
they were classified correctly.
• Notice that the cell on the
diagonal in row #5 and
column #5 looks slightly
darker than the other digits.
• This could be because the
model made more errors on
5s, or because there are
fewer 5s in the dataset than
the other digits.
• It’s important to normalize the confusion matrix by dividing each
value by the total number of images in the corresponding (true)
class (i.e., divide by the row’s sum).
• This can be done simply by setting normalize="true". We can also
specify the values_format=".0%" argument to show percentages
with no decimals.
• Now we can easily see
that only 82% of the
images of 5s were
classified correctly.
• The most common
error the model made
with images of 5s was
to misclassify them as
8s: this happened for
10% of all 5s.
• But only 2% of 8s got
misclassified as 5s
• confusion matrices are
generally not
symmetrical
• If you want to make the errors stand out more, you can try putting zero
weight on the correct predictions.
Multilabel Classification
• Until now, each instance has always been assigned to just one class. But
in some cases you may want your classifier to output multiple classes for
each instance.
• Consider a face-recognition classifier: what should it do if it recognizes
several people in the same picture?
• It should attach one tag per person it recognizes. Say the classifier has
been trained to recognize three faces: Alice, Bob, and Charlie.
• Then when the classifier is shown a picture of Alice and Charlie, it should
output [True, False, True] (meaning “Alice yes, Bob no, Charlie yes”).
• Such a classification system that outputs multiple binary tags is called a
multilabel classification system
This code creates a y_multilabel array containing two target labels for each digit
image: the first indicates whether or not the digit is large (7, 8, or 9), and the second
indicates whether or not it is odd. Then the code creates a KNeighborsClassifier
instance, which supports multilabel classification (not all classifiers do), and trains this
model using the multiple targets array.
• There are many ways to evaluate a multilabel classifier, and selecting the right metric really
depends on your project.
• One approach is to measure the F score for each individual label (or any other binary classifier
metric discussed earlier), then simply compute the average score.
• The following code computes the average F score across all labels:
Multioutput Classification
• The last type of classification task we’ll discuss here is called multioutput–
multiclass classification (or just multioutput classification).
• It is a generalization of multilabel classification where each label can be
multiclass
• Unlike multiclass classification (where each instance is assigned to one class
from a set of classes), multioutput classification involves predicting multiple
labels for each instance
• Example Scenario:
Imagine a movie recommendation system that predicts:
• The genre(s) of the movie (Action, Comedy, Drama, etc.)
• The target audience (Kids, Teens, Adults)
• The expected rating (Low, Medium, High)
A simple linear model
life_satisfaction = θ0 + θ1 × GDP_per_capita
This model is just a linear function of the input feature GDP_per_capita.
θ0 and θ1 are the model’s parameters
• More generally, a linear model makes a prediction by simply
computing a weighted sum of the input features, plus a constant
called the bias term (also called the intercept term)
Linear regression model prediction (vectorized form)
yˆ = 𝛉 T . x,
where 𝛉 ⊺ is the transpose of θ (a row vector instead of a column vector)
and 𝛉 ⊺. x is the matrix multiplication of 𝛉 ⊺ and x.
Logistic Regression
• Logistic regression (also called logit regression) is commonly used to
estimate the probability that an instance belongs to a particular class
• If the estimated probability is greater than a given threshold, then the
model predicts that the instance belongs to that class called the
positive class, labeled “1”, and otherwise it predicts that it does not
i.e., it belongs to the negative class, labeled “0”.
• Thus makes it a binary classifier.
• Like a linear regression model, a logistic regression model computes a
weighted sum of the input features plus a bias term, but instead of
outputting the result directly like the linear regression model does, it
outputs the logistic of this result
The logistic σ(·)—is a sigmoid function (i.e., S-shaped) that outputs a number
between 0 and 1
Once the logistic regression model has estimated the probability
pˆ = h (x) that an instance x belongs to the positive class, it can make its
prediction ŷ easily
Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a logistic
regression model using the default threshold of 50% probability predicts
1 if θ x is positive and 0 if it is negative.
Training and Cost Function
• The objective of training is to set the parameter vector θ so that the
model estimates high probabilities for positive instances (y = 1) and
low probabilities for negative instances (y = 0).
• This idea is captured by the cost function shown in Equation below-
for a single training instance x.
• The cost function over the whole training set is the average cost over
all training instances. It can be written in a single expression called
the log loss
Iris Dataset
• This is a famous dataset that contains the sepal and petal length and
width of 150 iris flowers of three different species:
Iris setosa, Iris versicolor, and Iris virginica
Let’s try to build a classifier to detect the Iris virginica type based only on
the petal width feature. The first step is to load the data
Next we’ll split the data and train a logistic regression model on the training
set:
Let’s look at the model’s estimated probabilities for flowers with petal
widths varying from 0 cm to 3 cm
Softmax Regression
• The logistic regression model can be generalized to support multiple
classes directly, without having to train and combine multiple binary
classifiers. This is called softmax regression, or multinomial logistic
regression
• when given an instance x, the softmax regression model first
computes a score s (x) for each class k, then estimates the probability
of each class by applying the softmax function
K is the number of classes.
s(x) is a vector containing the scores of each class for the instance x.
σ(s(x)) is the estimated probability that the instance x belongs to class
k, given the scores of each class for that instance.
• Just like the logistic regression classifier, by default the softmax
regression classifier predicts the class with the highest estimated
probability (which is simply the class with the highest score), as shown
in Equation
Minimizing the cost function shown in Equation, called the cross
entropy, should lead to this objective because it penalizes the model
when it estimates a low probability for a target class. Cross entropy is
frequently used to measure how well a set of estimated class
probabilities matches the target classes.
• Let’s use softmax regression to classify the iris plants into all three classes.
ScikitLearn’s LogisticRegression classifier uses softmax regression
automatically when you train it on more than two classes It also applies ℓ
regularization by default, which you can control using the hyperparameter
C, as mentioned earlier: