Dimensionality reduction
by
Keerthan Kumar T G
• Dimensionality reduction is a technique used to reduce the
number of features in a dataset while retaining as much of the
important information as possible.
• In other words, it is a process of transforming high-dimensional
data into a lower-dimensional space that still preserves the essence
of the original data.
Contd.
• In machine learning, high-dimensional data refers to data
with a large number of features or variables. The curse of
dimensionality is a common problem in machine learning,
• The curse of dimensionality refers to the challenges and
problems that arise when analyzing and organizing data
in high-dimensional spaces.
• where the performance of the model deteriorates as the
number of features increases. This is because the complexity
of the model increases with the number of features, and it
becomes more difficult to find a good solution.
Key Issues Associated with the Curse of
Dimensionality
1.Sparsity of Data:
In high-dimensional spaces, data points become spread out, making it difficult to find
meaningful patterns or clusters. For instance, the distance between points tends to
increase, reducing the notion of "closeness."
2.Increased Computational Complexity:
The computational cost of processing high-dimensional data increases exponentially,
making algorithms slower and more resource-intensive.
3.Overfitting in Machine Learning:
High-dimensional datasets often allow models to fit the training data very well, but this
can lead to overfitting, where the model performs poorly on new data.
4.Need for More Data:
As the number of dimensions increases
• Three views of the same four points. Left: As numbers, where the links are unclear. Centre: As four plotted
points. Right: As four points that lie on a circle.
It shows the x and y coordinates of 4 points. Looking at the numbers it is hard to see any
correlation between the points, and even when they are plotted it simply looks like they
might form corners of a rotated rectangle.
However, the plot on the right of the figure shows that they are simply a set of four points
from a circle, (in fact, the points at (3/6, 4/6, 7/6, 11/6)) and using this one coordinate,
the angle, makes the data a lot easier to understand and analyse.
Example
•In 2D, the nearest neighbor might actually feel "near" in a
human-intuitive sense (e.g., close on a 2D plane).
•In 100D, most points are almost equidistant, making it difficult to
distinguish which point is the true nearest neighbor.
•In a high-dimensional dataset (e.g., images with 1,000 features), you
may need exponentially more data to effectively train a model.
Without sufficient data, the model may fail to generalize well,
overfitting to noise in the training data.
• An Example of dimensionality reduction can be discussed
through a simple e-mail classification problem, where we
need to classify whether the e-mail is spam or not.
• This can involve a large number of features, such as whether
or not the e-mail has a generic title, the content of the e-
mail, Keywords, whether the e-mail uses a template, etc.
• However, some of these features may overlap.
❑Generic Title and Presence of Certain Keywords
❑Feature 1: Generic Title: This feature flags e-mails with generic titles such as:
❑"Congratulations!"
❑"Special Offer!"
❑"Limited Time!"
❑Feature 2: Presence of Keywords: This feature counts occurrences of words in the e-mail body
like:
❑"offer"
❑"limited"
❑"free"
❑there is a overlap. For example:
❑If the title is "Special Offer!" (detected by Feature 1), it likely contains the keyword "offer", which
contributes to Feature 2.
❑If both "Generic Title" and "Presence of Keywords" are retained, they might double-count the same
information, misleading the model about the true importance of these information.
A 3-D classification problem can be hard to
visualize, whereas a 2-D one can be
mapped to a simple 2-dimensional space,
and a 1-D problem to a simple line.
The below figure illustrates this concept,
where a 3-D feature space is split into two
2-D feature spaces, and later, if found to be
correlated, the number of features can be
reduced even further.
We can Ignore X also???
How to overcome??
• There are three different ways to do dimensionality reduction. The first is feature
selection, which typically means looking through the features that are available
and seeing whether or not they are actually useful, i.e., correlated to the output
variables.
• The second method is feature derivation, which means deriving new features
from the old ones, generally by applying transforms to the dataset that simply
change the axes (coordinate system) of the graph by moving and rotating them.
• The third method is simply to use clustering in order to group together similar
datapoints, and to see whether this allows fewer features to be used.
• They work differently from traditional reduction techniques.
To address the curse of dimensionality:
• What is Linear Discriminant Analysis?
• Linear Discriminant Analysis (LDA), also known as Normal
Discriminant Analysis or Discriminant Function Analysis, is
a dimensionality reduction technique primarily utilized
in supervised classification problems.
• LDA assumes that the data has a Gaussian distribution and that
the covariance matrices of the different classes are equal. It also
assumes that the data is linearly separable, meaning that a
linear decision boundary can accurately classify the different
classes.
• Suppose we have two sets of data
points belonging to two different
classes that we want to classify. As
shown in the given 2D graph, when
the data points are plotted on the
2D plane, there’s no straight line
that can separate the two classes of
data points completely. Hence, in
this case, LDA (Linear Discriminant
Analysis) is used which reduces
the 2D graph into a 1D graph in
order to maximize the separability
between the two classes.
• Here, Linear Discriminant Analysis uses both
axes (X and Y) to create a new axis and
projects data onto a new axis in a way to
maximize the separation of the two
categories and hence, reduces the 2D graph
into a 1D graph.
• Two criteria are used by LDA to create a new
axis:
1.Maximize the distance between the means of
the two classes.
2.Minimize the variation within each class
the covariance matrix can tell us about the scatter within a dataset, which is the
amount of spread that there is within the data.
The way to find this scatter is to multiply the covariance by the pc, the probability of the class (that is, the number
of datapoints there are in that class divided by the total number). Adding the values of this for all of the
classes gives us a measure of the within-class scatter of the dataset:
If our dataset is easy to separate into classes,
then this within-class scatter should be
small, so that each class is tightly clustered
together. However, to be able to separate the
data, we also want the distance between the
classes to be large. This is known as the between
classes scatter and is a significantly simpler
computation, simply looking at the difference
in the means:
• Figure 6.4 shows two projections of the dataset onto a straight line. For the projection on the left it is clear
that we can’t separate out the two classes, while for the one on the right we can. So we just need to find a
way to compute a suitable projection.
Two different possible projection lines. The one on the left fails to separate the classes.
• The projection of the data can be written as z = wT · x for datapoint x. This gives us a scalar that is the
• distance along the w vector that we need to go to find the projection of point x.
• Since the mean can be treated as a datapoint, we can project that as well: μ’c = wT · μc.
• within-class and between-class scatters
• The goal is to maximize the ratio of between-class scatter to within-
class scatter:
• To find the www that maximizes this ratio, differentiate it and set the
derivative to zero. The derivative simplifies to
• Basics
Covariance
Variance and Covariance:
◦ Measure of the “spread” of a set of points around their center of mass(mean)
Variance:
◦ Measure of the deviation from the mean for points in one dimension
Covariance:
◦ Measure of how much each of the dimensions vary from the mean with respect
to each other
• Covariance is measured between two dimensions
• Covariance sees if there is a relation between two dimensions
• Covariance between one dimension is the variance
Eigenvalues & Eigenvectors
An eigenvector of a square matrix A is a nonzero vector x such that for some number λ, we have
the following:
Ax = λx
We call λ an eigenvalue.
So, in our example in the introduction, λ = 3,
Notice that if x = cy, where c is some number, then
A(cy) = λcy
cAy = λcy
Ay = λy
Therefore, every constant multiple of an eigenvector is an eigenvector, meaning there are an infinite
number of eigenvectors,
while, as we'll find out later, there are a finite amount of eigenvalues. Each eigenvalue will have its
own set of eigenvectors.
1. Compute the mean vectors for each class.
2. Compute the Overall Mean for the entire dataset.
3. Compute the covariance matrix is used to understand the relationship between features and how they vary together. [
Not required always] We compute Scatter Matrix
4. Compute
1. Within-Class Scatter Matrix: Measures how spread out the features are within each class.
2. Between-Class Scatter Matrix: Measures how far apart the class means are relative to the overall mean.
5. Compute the eigenvalues and eigenvectors for the scatter matrices.
1. The eigenvectors define directions of maximum class separability, and eigenvalues indicate how much
separation they provide.
5. Pick the top k−1 eigenvectors (where k is the number of classes, 3 for Iris). This means we select 2 eigenvectors for
projection.
•k=3 (Setosa, Versicolor, Virginica)
•Feature Space After LDA: At k−1=2 dimensions.
•We select the top 2 eigenvectors corresponding to the 2 largest eigenvalues for projection
EXAMPLE
TWO dimensional data
Direct Computation without covariance
Direct Computation without covarianc
Where m is the overall mean, mi
and Ni are the sample mean and
sizes of the respective classes
largest eigenvalues. These
eigenvectors define the
directions (or axes) in the
feature space that maximize the
separation between different
classes.
Original Data New axes
IRIS Dataset ; Example 2 for LDA
Step 1: Computing the d-dimensional mean vectors
Step 2: Computing the Scatter Matrices
We will compute the two 4x4-dimensional matrices: The within-class and the
between class scatter matrix.
2.1 Within-class scatter matrix Sw
The within-class scatter matrix Sw is computed by the following equation:
Step 1: Computing the d-dimensional mean vectors
Step 2: Computing the Scatter Matrices
We will compute the two 4x4-dimensional matrices: The within-class and the between class
scatter matrix.
2.1 Within-class scatter matrix Sw
The within-class scatter matrix Sw is computed by the following equation:
43
Scatter Matrix
A scatter matrix is a estimation of covariance matrix when covariance cannot be calculated or costly to
calculate. The scatter matrix is also used in lot of dimensionality reduction exercises.
If there are k variables, scatter matrix will have k rows and k columns i.e. k X k matrix.
https://sebastianraschka.com/Articles/2014_python_lda.html
2.2 Between-class scatter matrix SB
The between-class scatter matrix SB is computed by the following equation:
Where m is the overall mean, mi and Ni are the sample mean and sizes of
the respective classes between-class Scatter Matrix:
46
Step 3: Solving the generalized eigenvalue problem for the matrix
Checking the eigenvector-eigenvalue calculation
A quick check that the eigenvector-eigenvalue calculation is correct and satisfy the equation
Step 4: Selecting linear discriminants for the new feature subspace
4.1. Sorting the eigenvectors by decreasing eigenvalues
--The common approach is to rank the eigenvectors from highest to lowest
corresponding eigenvalue and choose the top eigenvectors.
47
Eigenvalues in decreasing order:
32.2719577997
0.27756686384
5.71450476746e-15
5.71450476746e-15
In LDA, the number of linear discriminants is at most where is the number of class labels,
4.2. Choosing k eigenvectors with the largest eigenvalues
◦ It is now time to construct our k X d- dimensional eigenvector matrix W (here : based on the 2 most informative
eigenpairs)
Step 5: Transforming the samples onto the new subspace
In the last step, we use the -dimensional matrix that we just computed to transform
our samples onto the new subspace via the equation
(where X is a n X d-dimensional matrix representing the n-samples, and Y is the transformed n
X k-dimensional samples in the new subspace).
48
49
Useful website for LINEAR DISCRIMINANT ANALYSIS (LDA)
https://sebastianraschka.com/Articles/2014_python_lda.html#lda-via-scikit-learn
https://github.com/bot13956/linear-discriminant-analysis-iris-
dataset/blob/master/LDA_irisdataset.ipynb
51
52
LDA –Example 3
the average of each feature for each class
• Advantages of using LDA
1.It is a simple and computationally efficient algorithm.
2.It can work well even when the number of features is much
larger than the number of training samples.
3.It can handle multicollinearity(correlation between features) in
the data.
Applications of LDA
• Medical: In this field, Linear discriminant analysis (LDA) is used
to classify the patient’s disease state as mild, moderate, or
severe based on the patient’s various parameters and the
medical treatment he is going through. This helps the doctors
to intensify or reduce the pace of their treatment.
• Face Recognition: In the field of Computer Vision, face
recognition is a very popular application in which each face is
represented by a very large number of pixel values. Linear
discriminant analysis (LDA) is used here to reduce the number
of features to a more manageable number before the process
of classification.
Principal Component Analysis (PCA)
▪ It is a unsupervised non-parametric feature reduction technique.
▪ PCA doesn't require labels (class information). It works purely on the structure
of the data.
▪ It makes no assumptions about the data's distribution.
▪ It is a linear projection method.
▪ PCA finds new axes (principal components) that are linear combinations of the
original features.
▪ What is actually PCA is??
▪ Pattern Recognition: Helps identify patterns and correlations in high-
dimensional data.
▪ Data Visualization: Reduces dimensions to 2D or 3D for easier visualization
▪ Compression: Reduces storage and computational costs by eliminating less
informative dimensions (e.g., in image compression)
▪ Noise Reduction: Removes redundant features and noise, retaining only the
most important information
▪ Main advantage of PCA is that once we have found these patterns in the data,
and we can compress the data, i.e. by reducing the number of dimensions,
without much loss of information.
▪ Ex. Image compression
• The next few methods that we are going to look at are also involved in computing
transformations of the data in order to identify a lower-dimensional set of axes.
• However, unlike LDA, they are designed for unlabelled data.
• This does not stop them being used for labelled data, since the learning that takes
place in the lower dimensional space can still use the target data, although it does
mean that they miss out on any information that is contained in the targets.
• The idea is that by finding particular sets of coordinate axes, it will become clear
that some of the dimensions are not required.
• This is demonstrated in Figure 6.6, which shows two versions of the same dataset.
• In the first the data are arranged in an ellipse that runs at 45 to the axes, while in
the second the axes have been moved so that the data now runs along the x−axis
and is centred on the origin.
• The potential for dimensionality reduction is in the fact that the y dimension does
not now demonstrate much variability, and so it might be possible to ignore it and
use the x axis values alone without compromising the results of a learning
algorithm.
• PCA Measures how each variable is associated with one another using
a Covariance matrix
• Understands the directions of the spread of our data using Eigenvectors
• Brings out the relative importance of these directions using Eigenvalues
• Suppose we have plotted a scatter plot of random variables, and a line of best fit
is drawn between these points.
• This line of best fit, shows the direction of maximum variance in the dataset.
• The Eigenvector is the direction of that line, while the eigenvalue is a number
that tells us how the data set is spread out on the line which is an Eigenvector.
• We are going to rotate our data to fit these new axes. But what will the
coordinates of the rotated data be?
• The coordinates of the rotated data will be the projections of the original
data onto the new axes (principal components). These coordinates are
referred to as the principal component scores.
•
• For 2D Data
• A 2D dataset has 2 features.
• Hence, you can get 2 principal components.
• For 3D Data
• A 3D dataset has 3 features.
• Hence, you can get 3 principal components.
1.Variance and Dimensionality:
1. Each principal component captures a certain amount of the total variance in
the data.
2. The first principal component (PC1) always captures the maximum variance.
3. The second principal component (PC2) is orthogonal to PC1 and captures the
next highest variance, and so on.
• Dimensionality Reduction:
• we can retain fewer principal components if you want to reduce
dimensionality (e.g., 2D -> 1D, 3D -> 2D).
• The number of components retained depends on how much variance we
want to preserve.
PCA -Example
1.
Get
Some
Data
2.
Subtract
the
mean
https://www.youtube.com/watch?v=5vgP05YpKdE
3
Calculat
e the
covarian
ce
matrix
4.
Calculate the
eigenvectors
and eigenvalues
of the
covariance
matrix
https://www.youtube.com/watch?v=5vgP05YpKdE
http://kybele.psych.cornell.edu/~edelman/Psych-465-
Spring-2003/PCA-tutorial.pdf
Relation with Multi-layer Perceptrons (MLPs)
• Particularly auto-associative neural networks.
• PCA is widely used for dimensionality reduction, which simplifies the input
data and improves training efficiency in neural networks.
• An auto-associative MLP is a type of neural network that learns to
reproduce its input at the output layer. The hidden layer of such a network
often learns representations that resemble the principal components of
the input data.
•PCA is inherently linear—it can only rotate and translate axes. It
cannot capture non-linear relationships in data.
•Neural networks, especially those with non-linear activation functions,
can model complex, non-linear relationships, which PCA cannot
• Linear Limitations of Basic MLPs:
• In a simple auto-associative MLP (with two layers), the hidden layer performs
a task similar to PCA, which is linear.
• This is analogous to a perceptron’s ability to perform only linear tasks.
• When we use a deeper MLP (e.g., four layers) for auto-association,
the network transforms the input data in a more complex way:
• First layer: Computes non-linear transformations of the input data.
• Second layer (bottleneck): Captures patterns akin to PCA but on the non-
linear transformations of the inputs.
• Third and fourth layers: Reconstruct the data to match the original input at
the output layer.
Kernel Principal Component Analysis
(KPCA)
• Kernel Principal Component Analysis (KPCA) is a technique used in
machine learning for nonlinear dimensionality reduction. It is an
extension of the classical Principal Component Analysis (PCA)
algorithm,
• Kernel PCA uses a kernel function to project dataset into a higher
dimensional feature space, where it is linearly separable.
• It is similar to the idea of Support Vector Machines. There are
various kernel methods like linear, polynomial, and gaussian.
Kernel PCA
• One problem with PCA is that it assumes that the directions of variation are all
straight lines.
• This is often not true. We can use the auto-associator with multiple hidden layers as
just discussed, but there is a very nice extension to PCA that uses the kernel trick
(which is described in Section 8.2) to get around this problem, just as the SVM got
around it for the Perceptron.
• Just as is done there, we apply a (possibly non-linear) function Φ(·) to each datapoint
x that transforms the data into the kernel space, and then perform normal linear PCA
in that space.
• The covariance matrix is defined in the kernel space and is:
90
91
• This is a computationally expensive algorithm, since it requires computing the kernel
matrix and then the eigenvalues and eigenvectors of that matrix.
• The naïve implementation of the algorithm on the website is O(n3 ), but with care it is
possible to take advantage of the fact that not all of the eigenvalues are needed, which
can lead to an O(n 2 ) algorithm.
• Figure 6.9 shows the output of kernel PCA when applied to the iris dataset.
• The fact that it can separate this data well is not very surprising since the linear
methods that we have already seen can do it, but it is a useful check of the method.
• A rather more difficult example is shown in Figure 6.10.
• Data are sampled from three concentric circles.
• Clearly, linear PCA would not be able to separate this data, but applying kernel PCA to
this example separates the data using only one component.
92
93
Decision Trees in Machine Learning
Decision trees are built on the binary tree data structure
Low computational cost to build the tree
Transparent and easy to understand.
Users can trace decisions step-by-step, increasing trust compared
to "black box" models like neural network.
Widely used for classification tasks.
Common in real-world applications, such as guiding operators
in customer helplines based on user responses.
• Classification Process:
• Breaks down decisions into sequential questions about features.
• Progresses from the root (base) to leaves, where final classifications are
made.
• Decision trees can be translated into simple if-then rules, making
them versatile for rule-based systems.
• Utilizes a greedy heuristic: Makes locally optimal decisions at each
stage.
A simple decision tree to decide how you will spend the evening.
As a student, deciding what to do in the evening can be
simplified with an algorithm based on a decision tree.
The steps are:
1.Party Check: If friends know about a party, you go to it.
2.Deadline Check: If there's an urgent assignment due,
you study.
3. If you feel none, you might go to pub.
4.Otherwise: If you're lazy, you'll likely watch TV. Else
Study
One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if ... then
rules) that then go into program code very simply—the first part of the tree above can be turned into:
• Decision trees can model complex relationships between input features
and outputs without requiring feature scaling or transformation.
• Decision trees work for both classification (e.g., deciding if an email is
spam or not) and regression (e.g., predicting house prices) tasks.
• Training and prediction with decision trees are computationally efficient,
especially for small- to medium-sized datasets.
• Decision trees do not assume a specific data distribution
• Handles Missing and Noisy Data: They are less sensitive to noise compared
to linear models, though pruning techniques may be needed to prevent
overfitting.
CONSTRUCTING DECISION TREES
•Decision tree algorithms build trees greedily, starting at the root. They
select the most informative feature at each step.
• Common Algorithms:
• ID3 (Iterative Dichotomiser 3): A widely used method for
constructing decision trees.
• C4.5: An extension of ID3 with improvements like handling
continuous data.
• CART (Classification and Regression Trees): Another popular method
for building decision trees.
Information Theory
• At each step of tree construction, the algorithm selects the feature that provides the most
information about the target variable.
•Information theory measures the amount of uncertainty reduced by knowing a feature.
•Key metrics include:
•Entropy: Measures the level of disorder or uncertainty in the data.
•Information Gain: The reduction in entropy when a dataset is split based on a feature.
•The feature with the highest information gain is chosen as the next node in the tree.
• Analogy with 20 Questions:
• Similar to asking the most informative question first (e.g., "Is it an animal?" before "Is it a
cat?"), the algorithm prioritizes features that yield the most significant classification
improvement.
• Understanding Disorder (Uncertainty)
• Disorder exists when the data points are spread across multiple
classes or outcomes without a clear majority.
• Example: A dataset where 50% of items are labeled "Yes" and 50% are "No" is
highly uncertain because no single class dominates.
• Certainty occurs when all data points belong to a single class.
• Example: A dataset where 100% of items are labeled "Yes" has no
uncertainty.
• Measuring Uncertainty with Entropy
• Reducing Uncertainty in Decision Trees
Entropy in Information Theory
• Shannon proposed information entropy to quantify impurity or
uncertainty in data.
• Entropy quantifies disorder mathematically
• Where:
• Pi is the proportion of data points in class i.
• The sum is over all possible classes.
• Logarithm base 2 reflects encoding in binary (bits).
• Defined such that 0log0=0.
•Entropy measures how much extra information a feature provides.
•Features with higher entropy reduction (information gain) are more useful for classification
tasks.
• High Entropy: Data is evenly distributed across classes (maximum
uncertainty).
• Low Entropy: Data is skewed toward one class (low uncertainty).
•Minimum Entropy (0):
•If all examples belong to a single class (e.g., all are positive or all are
negative), the entropy is 0. There is no additional information gained
because the outcome is certain.
•Maximum Entropy:
•When the feature perfectly splits the dataset into 50% positive and 50% negative
examples, the entropy reaches its maximum. In this case, knowing the feature is highly
informative.
•For any other proportion of positive and negative examples, the entropy lies between 0
and the maximum value.
•The x-axis represents the proportion of
positive examples (p), ranging from 0 to 1.
•The y-axis represents entropy (H(p))
•Entropy is 0 at p=0 and p=1: No
uncertainty as all examples belong to a
single class.
•Maximum entropy occurs at p=0.5:
Equal proportions of positive and negative
examples lead to maximum uncertainty.
ID3
• ID3 (Iterative Dichotomizer 3) is a algorithm for generating decision
trees.
• It is used for classification tasks by splitting the data into subsets
based on the feature that provides the most information gain.
• The goal is to create a tree where each leaf node corresponds to a
class label.
ID3 Steps
•Calculate Entropy:
•Measure the uncertainty in the dataset (target variable).
•Compute Information Gain:
•For each feature, compute how much uncertainty (entropy) is reduced when splitting on that
feature.
•Information Gain=Entropy (before split)−Weighted Entropy (after split)
•Select the Best Feature:
•The feature with the highest information gain is chosen as the root or internal node.
•Split the Dataset:
•Partition the dataset based on the values of the selected feature.
•Repeat the process recursively for each subset until:
•All instances in a subset belong to the same class.
•There are no features left to split on.
•Build the Tree:
•The process stops when all subsets are pure or no further splits are possible.
• To decide the best feature for classification, Information Gain (IG) is
used. It measures the reduction in entropy when a feature is chosen
to split the dataset.
Where:
•S: The entire dataset of examples.
•F: A feature from the set of all possible features.
•∣S∣: Total number of examples in S.
•Sf: Subset of S where feature F has value f.
•∣Sf∣: Count of examples in Sf.
•H(S): Entropy of the entire set S.
•H(Sf): Entropy of the subset Sf.
Algorithm
• ID3(S, Features):
• if all examples in S have the same label:
• return Leaf(Label)
•
• if Features is empty:
• return Leaf(MostCommonLabel(S))
•
• F_best = feature in Features that maximizes Information Gain
• Tree = Node(F_best)
•
• for each value f in F_best:
• S_f = subset of S where F_best = f
• if S_f is empty:
• Add Leaf(MostCommonLabel(S)) as a branch for value f
• else:
• Subtree = ID3(S_f, Features - {F_best})
• Add Subtree as a branch for value f
•
EX: