Module 6
What is Curse of dimensionality? Explain the PCA dimensionality techniques
in detail.
● The "curse of dimensionality" refers to the challenges that arise when
dealing with high-dimensional data.
● As the number of features or dimensions increases, the amount of data
required to effectively cover the feature space grows exponentially.
● This can lead to various problems such as increased computational
complexity, sparsity of data, and difficulty in visualization and
interpretation.
Principal Component Analysis (PCA) is a dimensionality reduction technique
commonly used to address the curse of dimensionality.
1. Data Transformation:
- PCA transforms high-dimensional data into a new coordinate system, where
the axes (principal components) are orthogonal to each other and capture the
maximum variance in the data.
2. Principal Components:
- Principal components are linear combinations of the original features.
- They are ordered by the amount of variance they explain in the data.
- The first principal component captures the most variance, the second captures
the second most, and so on.
3. Dimensionality Reduction:
- PCA retains only a subset of the principal components that capture most of the
variance in the data.
- By selecting a smaller number of principal components, PCA effectively
reduces the dimensionality of the data.
4. Eigenvalue Decomposition:
- PCA uses eigenvalue decomposition to find the principal components.
- It calculates the covariance matrix of the data and then finds the eigenvectors
(principal components) corresponding to the largest eigenvalues.
5. Variance Retention:
- PCA allows users to specify the desired amount of variance to be retained in
the reduced-dimensional space.
- By selecting the appropriate number of principal components, users can
balance the trade-off between dimensionality reduction and information
preservation.
6. Applications:
- PCA is widely used in various fields, including image processing, signal
processing, finance, and genetics.
- It helps in data compression, visualization, noise reduction, and feature
extraction.
Feature Selection and Feature extraction
● Feature Selection:
1. Definition:
- Feature selection involves choosing a subset of the most relevant features from
the original set of features.
- The goal is to improve model performance, reduce overfitting, and enhance
interpretability.
2. Methods:
- Filter Methods: Evaluate the relevance of features independently of the
model.
- Wrapper Methods: Use a specific machine learning model to evaluate the
importance of features.
- Embedded Methods: Feature selection is integrated into the model training
process.
3. Techniques:
- Univariate Selection: Select features based on statistical tests such as
chi-square, ANOVA, or correlation.
- Recursive Feature Elimination (RFE): Iteratively removes the least
important features based on model performance.
- Feature Importance: Uses algorithms like decision trees or random forests to
measure feature importance.
4. Benefits:
- Reduces overfitting by removing irrelevant or redundant features.
- Improves model interpretability and reduces computational complexity.
- Can lead to faster model training and better generalization performance.
● Feature Extraction:
1. Definition:
- Feature extraction involves transforming the original features into a new set of
features that captures the essential information.
- It aims to reduce dimensionality, remove noise, and enhance the representation
of the data.
2. Methods:
- Principal Component Analysis (PCA): Linear transformation to find
orthogonal principal components.
- Linear Discriminant Analysis (LDA): Supervised technique that maximizes
class separability.
- Non-linear Techniques: Kernel PCA, t-distributed Stochastic Neighbor
Embedding (t-SNE), autoencoders.
3. Techniques:
- PCA: Projects data onto a lower-dimensional space while maximizing
variance.
- LDA: Finds the linear combination of features that best separates different
classes.
- Non-linear Techniques: Capture complex relationships in the data that linear
methods cannot.
4. Benefits:
- Reduces dimensionality, which can lead to faster computation and improved
model performance.
- Enhances the representation of the data by capturing underlying patterns or
structures.
- Helps in visualizing high-dimensional data and understanding its underlying
characteristics.
Feature Selection Feature Extraction
Selects a subset of relevant Extracts a new set of features
1.
features from the original set of that are more informative and
features. compact.
Captures the essential
Reduces the dimensionality of
2. information from the original
the feature space and simplifies
features and represents it in a
the model.
lower-dimensional feature space.
Can be categorized into filter,
3. Can be categorized into linear
wrapper, and embedded
and nonlinear methods.
methods.
4. Requires domain knowledge and Can be applied to raw data
feature engineering. without feature engineering.
Can improve the model’s Can improve the model
5.
interpretability and reduce performance and handle
overfitting. nonlinear relationships.
May lose some information and May introduce some noise and
6.
introduce bias if the wrong redundancy if the extracted
features are selected. features are not informative.
Principal Component Analysis(PCA)
1. Purpose:
- PCA is a technique used for dimensionality reduction.
- It transforms high-dimensional data into a lower-dimensional space while
preserving the most important information.
2. Process:
- PCA identifies the directions (principal components) in which the data varies
the most.
- It projects the data onto these principal components, effectively reducing the
dimensionality.
3. Principal Components:
- Principal components are orthogonal vectors that capture the maximum
variance in the data.
- The first principal component explains the most variance, the second explains
the second most, and so on.
4. Mathematical Calculation:
- PCA calculates the covariance matrix of the data.
- It then finds the eigenvectors (principal components) corresponding to the
largest eigenvalues of the covariance matrix.
5. Dimensionality Reduction:
- PCA retains only the top (k) principal components that capture most of the
variance in the data.
- This reduces the dimensionality of the data from (n) dimensions to (k)
dimensions ((k < n).
6. Applications:
- PCA is widely used in data preprocessing, feature extraction, and visualization.
- It helps in reducing noise, speeding up machine learning algorithms, and
improving interpretability.