KEMBAR78
Unit 1 Machine Learning | PDF | Machine Learning | Principal Component Analysis
0% found this document useful (0 votes)
14 views29 pages

Unit 1 Machine Learning

Machine learning is a subset of Artificial Intelligence that enables computers to learn from data and make decisions without explicit programming. It can be classified into supervised, unsupervised, and reinforcement learning, each with distinct methods and applications. The learning process involves data input, algorithm processing, model training, and evaluation to improve accuracy and generalization over time.

Uploaded by

befikraaman2630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

Unit 1 Machine Learning

Machine learning is a subset of Artificial Intelligence that enables computers to learn from data and make decisions without explicit programming. It can be classified into supervised, unsupervised, and reinforcement learning, each with distinct methods and applications. The learning process involves data input, algorithm processing, model training, and evaluation to improve accuracy and generalization over time.

Uploaded by

befikraaman2630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine learning

Machine learning is a subset of Artificial Intelligence (AI). Machine learning (ML) allows
computers to learn and make decisions without being explicitly programmed. It involves
feeding data into algorithms to identify patterns and make predictions on new
data. Machine learning is used in various applications, including image and speech
recognition, natural language processing, and recommender systems.
The Machine Learning algorithm's operation is depicted in the following block diagram:

A machine “learns” by recognizing patterns and improving its performance on a task based on
data, without being explicitly programmed.
The process involves:
1. Data Input: Machines require data (e.g., text, images, numbers) to analyze.
2. Algorithms: Algorithms process the data, finding patterns or relationships.
3. Model Training: Machines learn by adjusting their parameters based on the input data
using mathematical models.
4. Feedback Loop: The machine compares predictions to actual outcomes and corrects
errors (via optimization methods like gradient descent).
5. Experience and Iteration: Repeating this process with more data improves the
machine’s accuracy over time.
6. Evaluation and Generalization: The model is tested on unseen data to ensure it
performs well on real-world tasks.
Features of Machine Learning:
o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation.
Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.
Abstraction
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models
and creation of new models. The process of fitting a model to a dataset is known as training.
When the model has been trained, the data is transformed into an abstract form that summarizes
the original information.
Generalization
The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form that
can be utilized for future action. These actions are to be carried out on tasks that are similar,
but not identical, to those what have been seen before. In generalization, the goal is to discover
those properties of the data that will be most relevant to future tasks.
Evaluation
Evaluation is the last component of the learning process. It is the process of giving feedback
to the user to measure the utility of the learned knowledge. This feedback is then utilised to
effect improvements in the whole learning process.
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised machine learning
Supervised machine learning is a fundamental approach for machine learning and artificial
intelligence. It involves training a model using labelled data, where each input comes with a
corresponding correct output. The process is like a teacher guiding a student—hence the term
“supervised” learning.
supervised learning is a type of machine learning where a model is trained on labelled data—
meaning each input is paired with the correct output. the model learns by comparing its
predictions with the actual answers provided in the training data. Over time, it adjusts itself to
minimize errors and improve accuracy. The goal of supervised learning is to make accurate
predictions when given new, unseen data. For example, if a model is trained to recognize
handwritten digits, it will use what it learned to correctly identify new numbers it hasn’t seen
before.

How Supervised Machine Learning Works?


Where supervised learning algorithm consists of input features and corresponding output
labels. The process works through:
• Training Data: The model is provided with a training dataset that includes input data
(features) and corresponding output data (labels or target variables).
• Learning Process: The algorithm processes the training data, learning the relationships
between the input features and the output labels. This is achieved by adjusting the
model’s parameters to minimize the difference between its predictions and the actual
labels.
After training, the model is evaluated using a test dataset to measure its accuracy and
performance. Then the model’s performance is optimized by adjusting parameters and using
techniques like cross-validation to balance bias and variance. This ensures the model
generalizes well to new, unseen data.
Categories of Supervised Machine Learning:
Supervised machine learning can be classified into two types of problems, which are given
below:
• Classification
• Regression
Classification: Classification algorithms are used to solve the classification problems in which
the output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. • The
classification algorithms predict the categories present in the dataset.
Some real-world examples of classification algorithms are Spam Detection, Email filtering,
etc.
Some popular classification algorithms are given below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
Regression: Regression algorithms are used to solve regression problems in which there is a
linear relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression
Advantages of Supervised Learning:
• Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
• These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages of Supervised Learning:
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the training data.
• It requires lots of computational time to train the algorithm.

Unsupervised Machine Learning:


Unsupervised learning is different from the supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is
trained using the unlabelled dataset, and the machine predicts the output.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset.
Categories of Unsupervised Machine Learning:
Unsupervised Learning can be further classified into two types, which are given below:
• Clustering
• Association
Clustering: The clustering technique is used when we want to find the inherent groups from
the data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups.
An example of the clustering algorithm is grouping the customers by their purchasing
behaviour.
Some of the popular clustering algorithms are given below:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
Association: Association rule learning is an unsupervised learning technique, which finds
interesting relations among variables within a large dataset.
The main aim of this learning algorithm is to find the dependency of one data item on another
data item and map those variables accordingly so that it can generate maximum profit.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages of Unsupervised Learning Algorithm:
• These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabelled dataset.
• Unsupervised algorithms are preferable for various tasks as getting the unlabelled dataset is
easier
as compared to the labelled dataset.
Disadvantages Unsupervised Learning Algorithm:
• The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and
algorithms are not trained with the exact output in prior.
• Working with Unsupervised learning is more difficult as it works with the unlabelled dataset
that
does not map with the output.

Reinforcement Learning:
Reinforcement learning works on a feedback-based process, in which an AI agent (A software
component) automatically explore its surrounding by hitting & trail, taking action, learning
from experiences, and improving its performance.
Agent gets rewarded for each good action and get punished for each bad action; hence the goal
of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life.
• An example of reinforcement learning is to play a game, where the Game is the environment,
moves of an agent at each step define states, and the goal of the agent is to get a high score.
• Agent receives feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.
Categories of Reinforcement Learning:
Reinforcement learning is categorized mainly into two types of methods/algorithms:
Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the
tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.
Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite
to the positive RL. It increases the tendency that the specific behaviour would occur again by
avoiding the negative condition.
Real-world Use cases of Reinforcement Learning
• Video Games
• Robotics
• Text Mining

Main Challenges of Machine Learning:


Lack Of Quality Data
One of the main issues in Machine Learning is the absence of good data. While upgrading,
algorithms tend to make developers exhaust most of their time on artificial intelligence.
Data can be noisy which will result in inaccurate predictions.
Incorrect or incomplete information can also lead to faulty programming through Machine
Learning.
Fault In Credit Card Fraud Detection
Although this AI-driven software helps to successfully detect credit card fraud, there are issues
in Machine Learning that make the process redundant.
Getting Bad Recommendations
Proposal engines are quite regular today. While some might be dependable, others may not
appear to provide the necessary results. Machine Learning algorithms tend to only impose what
these proposal engines have suggested.
Talent Deficit
Albeit numerous individuals are pulled into the ML business, however, there are still not many
experts who can take complete control of this innovation.
Implementation
Organizations regularly have examination engines working with them when they decide to
move up to ML. The usage of fresher ML strategies with existing procedures is a complicated
errand.
Making The Wrong Assumptions
ML models can’t manage datasets containing missing data points. Thus, highlights that contain
a huge part of missing data should be erased.
Deficient Infrastructure
ML requires a tremendous amount of data stirring abilities. Inheritance frameworks can’t deal
with the responsibility and clasp under tension.
Having Algorithms Become Obsolete When Data Grows
ML algorithms will consistently require a lot of data when being trained. Frequently, these ML
algorithms will be trained over a specific data index and afterwards used to foresee future data,
a cycle which you can only expect with a significant amount of effort.
Absence Of Skilled Resources
The other issues in Machine Learning are that deep analytics and ML in their present structures
are still new technologies.
Customer Segmentation
Let us consider the data of human behaviour by a user during a time for testing and the relevant
previous practices. All things considered, an algorithm is necessary to recognize those
customers that will change over to the paid form of a product and those that won’t. The lists of
supervised learning algorithms in ML are:
• Neural Networks
• Naive Bayesian Model
• Classification
• Support Vector Machines
• Regression
• Random Forest Model
Complexity
Although Machine Learning and Artificial Intelligence are booming, a majority of these sectors
are still in their experimental phases, actively undergoing a trial-and-error method.
Slow Results
Another one of the most common issues in Machine Learning is the slow-moving program.
The Machine Learning Models are highly efficient bearing accurate results but the said results
take time to be produced.
Maintenance
Requisite results for different actions are bound to change and hence the data needed for the
same is different.
Hypothesis
The hypothesis is a common term in Machine Learning and data science projects. As we know,
machine learning is one of the most powerful technologies across the world, which helps us to
predict results based on past experiences. Moreover, data scientists and ML professionals
conduct experiments that aim to solve a problem. These ML professionals and data scientists
make an initial assumption for the solution of the problem. This assumption in Machine
learning is known as Hypothesis.
The hypothesis is defined as the supposition or proposed explanation based on insufficient
evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.
Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume
they may cause blindness. However, it may or may not be possible. Hence, these types of
assumptions are called a hypothesis.
How does a Hypothesis work?
In most supervised machine learning algorithms, our main goal is to find a possible hypothesis
from the hypothesis space that could map out the inputs to the proper outputs. The following
figure shows the common method to find out the possible hypothesis from the Hypothesis
space:

Hypothesis Space (H)


Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the
machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
Hypothesis (h)
A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon the
restrictions and bias that we have imposed on the data.
The Hypothesis can be calculated as:

Hypothesis Space and Representation in Machine Learning


The hypothesis space comprises all possible legal hypotheses that a machine learning algorithm
can consider. Hypotheses are formulated based on various algorithms and techniques, including
linear regression, decision trees, and neural networks. These hypotheses capture the mapping
function transforming input data into predictions.

Hypothesis Formulation and Representation in Machine Learning


Hypotheses in machine learning are formulated based on various algorithms and techniques,
each with its representation. For example:

In the case of complex models like neural networks, the hypothesis may involve multiple layers
of interconnected nodes, each performing a specific computation.
Hypothesis Evaluation:
The process of machine learning involves not only formulating hypotheses but also evaluating
their performance. This evaluation is typically done using a loss function or an evaluation
metric that quantifies the disparity between predicted outputs and ground truth labels. Common
evaluation metrics include mean squared error (MSE), accuracy, precision, recall, F1-score,
and others. By comparing the predictions of the hypothesis with the actual outcomes on a
validation or test dataset, one can assess the effectiveness of the model.
Hypothesis Testing and Generalization:
Once a hypothesis is formulated and evaluated, the next step is to test its generalization
capabilities. Generalization refers to the ability of a model to make accurate predictions on
unseen data. A hypothesis that performs well on the training dataset but fails to generalize to
new instances is said to suffer from overfitting. Conversely, a hypothesis that generalizes well
to unseen data is deemed robust and reliable.
The process of hypothesis formulation, evaluation, testing, and generalization is often iterative
in nature. It involves refining the hypothesis based on insights gained from model performance,
feature importance, and domain knowledge. Techniques such as hyperparameter tuning, feature
engineering, and model selection play a crucial role in this iterative refinement process.
Hypothesis in Statistics
In statistics, a hypothesis refers to a statement or assumption about a population parameter. It
is a proposition or educated guess that helps guide statistical analyses. There are two types of
hypotheses: the null hypothesis (H0) and the alternative hypothesis (H1 or Ha).
• Null Hypothesis(H0): This hypothesis suggests that there is no significant difference
or effect, and any observed results are due to chance. It often represents the status quo
or a baseline assumption.
• Alternative Hypothesis (H1 or Ha): This hypothesis contradicts the null hypothesis,
proposing that there is a significant difference or effect in the population. It is what
researchers aim to support with evidence.
Inductive Bias
Inductive bias can be defined as the set of assumptions or biases that a learning algorithm
employs to make predictions on unseen data based on its training data. These assumptions are
inherent in the algorithm's design and serve as a foundation for learning and generalization.
The inductive bias of an algorithm influences how it selects a hypothesis (a possible
explanation or model) from the hypothesis space (the set of all possible hypotheses) that best
fits the training data. It helps the algorithm navigate the trade-off between fitting the training
data perfectly (overfitting) and generalizing well to unseen data (underfitting).
Types of Inductive Bias
Inductive bias can manifest in various forms, depending on the algorithm and its underlying
assumptions. Some common types of inductive bias include:
1. Bias towards simpler explanations: Many machine learning algorithms, such as
decision trees and linear models, have a bias towards simpler hypotheses. They prefer
explanations that are more parsimonious and less complex, as these are often more
likely to generalize well to unseen data.
2. Bias towards smoother functions: Algorithms like kernel methods or Gaussian
processes have a bias towards smoother functions. They assume that neighbouring
points in the input space should have similar outputs, leading to smooth decision
boundaries.
3. Bias towards specific types of functions: Neural networks, for example, have a bias
towards learning complex, nonlinear functions. This bias allows them to capture
intricate patterns in the data but can also lead to overfitting if not regularized properly.
4. Bias towards sparsity: Some algorithms, like Lasso regression, have a bias towards
sparsity. They prefer solutions where only a few features are relevant, which can
improve interpretability and generalization.
Importance of Inductive Bias
Inductive bias is crucial in machine learning as it helps algorithms generalize from limited
training data to unseen data. Without a well-defined inductive bias, algorithms may struggle to
make accurate predictions or may overfit the training data, leading to poor performance on new
data.
Understanding the inductive bias of an algorithm is essential for model selection, as different
biases may be more suitable for different types of data or tasks. It also provides insights into
how the algorithm is learning and what assumptions it is making about the data, which can aid
in interpreting its predictions and results.
Bayes Classifiers
Naive Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities.
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a faster
speed than other classification algorithms.
• It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to
the predictions with no relation between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.
Why it is Called Naive Bayes?
It is named as “Naive” because it assumes the presence of one feature does not affect other
features. The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”)
for playing golf. Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No


Outlook Temperature Humidity Windy Play Golf

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists
of the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
• Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
• Feature independence: This means that when we are trying to classify something, we
assume that each feature (or piece of information) in the data does not affect any other
feature.
• Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
• Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
• Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
• No missing data: The data should not contain any missing values.
How it works?
• Assumption of Independence: The "naive" assumption in Naive Bayes is that the
presence of a particular feature in a class is independent of the presence of any other
feature, given the class. This is a strong assumption and may not hold true in real-world
data, but it simplifies the calculation and often works well in practice.
• Calculating Class Probabilities: Given a set of features x1,x2,...,xn, the Naive Bayes
classifier calculates the probability of each class Ck given the features using Bayes'
theorem:

o the denominator P(x1,x2,...,xn) is the same for all classes and can be ignored for
the purpose of comparison.
• Classification Decision: The classifier selects the class Ck with the highest probability
as the predicted class for the given set of features.
2. Bayes optimal classifier
The Bayes optimal classifier is a theoretical concept in machine learning that represents the
best possible classifier for a given problem. It is based on Bayes' theorem, which describes how
to update probabilities based on new evidence.
In the context of classification, the Bayes optimal classifier assigns the class label that has the
highest posterior probability given the input features. Mathematically, this can be expressed as:

Advantages of Naive Bayes Classifier


• Easy to implement and computationally efficient.
• Effective in cases with a large number of features.
• Performs well even with limited training data.
• It performs well in the presence of categorical features.
• For numerical features data is assumed to come from normal distributions
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold in real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor generalization.
Applications of Naive Bayes Classifier
• Spam Email Filtering: Classifies emails as spam or non-spam based on features.
• Text Classification: Used in sentiment analysis, document categorization, and topic
classification.
• Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
• Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
• Weather Prediction: Classifies weather conditions based on various factors.

Bayes error
In machine learning, "Bayes error" refers to the theoretical minimum error rate that any
classifier could achieve on a given dataset, representing the lowest possible classification error
given the inherent overlap and uncertainty between different classes in the data
distribution; essentially, it's the best possible performance a classifier can achieve under the
given data conditions, acting as a benchmark to compare the performance of different
classification algorithms.
Key points about Bayes error:
• Theoretical limit:
It is a theoretical concept because it assumes perfect knowledge of the true underlying
probability distributions of the data, which is usually not available in practice.
• Calculating Bayes error:
To calculate the Bayes error, you need to determine the conditional probabilities of each class
given the features, and then choose the class with the highest probability for each data point.

• Importance of Bayes error:


• Evaluating classifier performance: Comparing a classifier's error rate to the
Bayes error helps understand how well the classifier is performing relative to
the inherent difficulty of the classification task.
• Identifying overfitting: If a classifier has a significantly lower error rate on the
training data compared to the estimated Bayes error, it may be overfitting.
How to interpret Bayes error:
• Low Bayes error:
If the Bayes error is low, it means the classes in the data are well-separated and a classifier can
achieve high accuracy.
• High Bayes error:
If the Bayes error is high, it means the classes are highly overlapping, indicating a fundamental
limitation in how well any classifier can perform on that data.
Or
The Bayes Error of a dataset is the lowest possible error rate that any model can achieve. In
particular, if the Bayes Error is non-zero, then the two classes have some overlaps, and even
the best model will make some wrong predictions.
There are many possible reasons for a dataset to have a non-zero Bayes Error. For example:
• Poor data quality: Some images in a computer vision dataset are very blurry.
• Mislabelled data
• The labelling process is inconsistent: When deciding whether a job applicant should
proceed to the next round of interview, different interviewers might have different
opinions.
• The data generating process is inherently stochastic: Predicting heads or tails from
coin flipping.
• Information missing from the feature vectors: When predicting whether a baby has
certain genetic traits or not, the feature vector contains information about the father but
not the information about the mother.

Occam's razor
Occam's razor, a principle named after the 14th-century English philosopher William of
Ockham, serves as a guiding tool in various fields of knowledge, from philosophy to science.
This principle suggests that among competing hypotheses or explanations, the simplest one is
often the most accurate. By advocating for simplicity, Occam's razor encourages us to prioritize
elegant and straightforward solutions over unnecessarily convoluted ones.
Occam's razor is a principle that suggests that, when faced with multiple explanations or
hypotheses, the simplest one is usually the most accurate. In other words, it encourages us to
choose the option with the fewest assumptions or complexities.
Occam's razor serves as a guide to avoid unnecessary complications and to prioritize elegant
and straightforward solutions. By applying this principle, we can navigate through the
complexities of problems and make decisions based on the simplest and most plausible
explanation.

Relevance of Occam’s razor.


There are many events that favour a simpler approach either as an inductive bias or a
constraint to begin with. Some of them are:
• Studies like this, where the results have suggested that preschoolers are sensitive to
simpler explanations during their initial years of learning and development.
• Preference for a simpler approach and explanations to achieve the same goal is seen in
various facets of sciences; for instance, the parsimony principle applied to
the understanding of evolution.
• In theology, ontology, epistemology, etc this view of parsimony is used to derive various
conclusions.
• Variants of Occam’s razor are used in knowledge Discovery.
Uses of Occam’s Razor in Machine Learning
One example of how Occam's razor is used in machine learning is feature selection. Feature
selection involves choosing a subset of relevant features from a larger set of available features
to improve the model's performance and interpretability. Occam's razor can guide this process
by favouring simpler models with fewer features.
When faced with a high-dimensional dataset, selecting all available features may lead to
overfitting and increased computational complexity. Occam's razor suggests that a simpler
model with a reduced set of features can often achieve comparable or even better performance.
Various techniques can be employed to implement Occam's razor in feature selection. One
common approach is called "forward selection," where features are incrementally added to the
model based on their individual contribution to its performance. Starting with an empty set of
features, the algorithm iteratively selects the most informative feature at each step, considering
its impact on the model's performance. This process continues until a stopping criterion, such
as reaching a desired level of performance or a predetermined number of features, is met.
Another approach is "backward elimination," where all features are initially included in the
model, and features are gradually eliminated based on their contribution or lack thereof. The
algorithm removes the least informative feature at each step, re-evaluates the model's
performance, and continues eliminating features until the stopping criterion is satisfied.
By employing these feature selection techniques guided by Occam's razor, machine learning
models can achieve better generalization, reduce overfitting, improve interpretability, and
optimize computational efficiency. Occam's razor helps to uncover the most relevant features
that capture the essence of the problem at hand, simplifying the model without sacrificing its
predictive capabilities.

Curse of Dimensionality in Machine Learning arises when working with high-dimensional


data, leading to increased computational complexity, overfitting, and spurious correlations.
Techniques like dimensionality reduction, feature selection, and careful model design are
essential for mitigating its effects and improving algorithm performance. Navigating this
challenge is crucial for unlocking the potential of high-dimensional datasets and ensuring
robust machine-learning solutions.

What is Curse of Dimensionality?


• Curse of Dimensionality refers to the phenomenon where the efficiency and
effectiveness of algorithms deteriorate as the dimensionality of the data increases
exponentially.
• In high-dimensional spaces, data points become sparse, making it challenging to
discern meaningful patterns or relationships due to the vast amount of data required to
adequately sample the space.
• Curse of Dimensionality significantly impacts machine learning algorithms in various
ways. It leads to increased computational complexity, longer training times, and higher
resource requirements. Moreover, it escalates the risk of overfitting and spurious
correlations, hindering the algorithms' ability to generalize well to unseen data.
How to Overcome the Curse of Dimensionality?
To overcome the curse of dimensionality, you can consider the following strategies:
1. Dimensionality Reduction Techniques:
• Feature Selection: Identify and select the most relevant features from the original
dataset while discarding irrelevant or redundant ones. This reduces the dimensionality
of the data, simplifying the model and improving its efficiency.
• Feature Extraction: Transform the original high-dimensional data into a lower-
dimensional space by creating new features that capture the essential information.
Techniques such as Principal Component Analysis (PCA) and t-distributed Stochastic
Neighbour Embedding (t-SNE) are commonly used for feature extraction.
2. Data Preprocessing:
• Normalization: Scale the features to a similar range to prevent certain features from
dominating others, especially in distance-based algorithms.
• Handling Missing Values: Address missing data appropriately through imputation or
deletion to ensure robustness in the model training process.
What is Dimensionality Reduction?
Dimensionality reduction is a technique used to reduce the number of features in a dataset while
retaining as much of the important information as possible. In other words, it is a process of
transforming high-dimensional data into a lower-dimensional space that still preserves the
essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of features or
variables. The curse of dimensionality is a common problem in machine learning, where the
performance of the model deteriorates as the number of features increases. This is because the
complexity of the model increases with the number of features, and it becomes more difficult
to find a good solution. In addition, high-dimensional data can also lead to overfitting, where
the model fits the training data too closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance.
There are two main approaches to dimensionality reduction: feature selection and feature
extraction.

Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining
the most important features. There are several methods for feature selection, including filter
methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the
criteria for selecting features, and embedded methods combine feature selection with the
model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data in
a lower-dimensional space. There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbour embedding (t-SNE). PCA is a popular technique that projects the
original features onto a lower-dimensional space while preserving as much of the variance as
possible.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modelling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A
3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2-dimensional space, and a 1-D problem to a simple line. The below figure illustrates
this concept, where a 3-D feature space is split into two 2-D feature spaces, and later, if found
to be correlated, the number of features can be reduced even further.

Components of Dimensionality Reduction


There are two components of dimensionality reduction:
• Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method used.
The prime linear method, called Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on the condition that while the data in
a higher dimensional space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.

It involves the following steps:


• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or 3D, which
can help in better understanding and analysis.
• Overfitting Prevention: High dimensional data may lead to overfitting in machine
learning models, which can lead to poor generalization performance. Dimensionality
reduction can help in reducing the complexity of the data, and hence prevent overfitting.
• Feature Extraction: Dimensionality reduction can help in extracting important features
from high dimensional data, which can be useful in feature selection for machine
learning models.
• Data Preprocessing: Dimensionality reduction can be used as a preprocessing step
before applying machine learning algorithms to reduce the dimensionality of the data
and hence improve the performance of the model.
• Improved Performance: Dimensionality reduction can help in improving the
performance of machine learning models by reducing the complexity of the data, and
hence reducing the noise and irrelevant information in the data.
Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes
undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb
rules are applied.
• Interpretability: The reduced dimensions may not be easily interpretable, and it may be
difficult to understand the relationship between the original features and the reduced
dimensions.
• Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially
when the number of components is chosen based on the training data.
• Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to
outliers, which can result in a biased representation of the data.
• Computational complexity: Some dimensionality reduction techniques, such as
manifold learning, can be computationally intensive, especially when dealing with large
datasets.
Important points:
• Dimensionality reduction is the process of reducing the number of features in a dataset
while retaining as much information as possible.
This can be done to reduce the complexity of a model, improve the performance of a
learning algorithm, or make it easier to visualize the data.
• Techniques for dimensionality reduction include: principal component analysis (PCA),
singular value decomposition (SVD), and linear discriminant analysis (LDA).
• Each technique projects the data onto a lower-dimensional space while preserving
important information.
• Dimensionality reduction is performed during pre-processing stage before building a
model to improve the performance
• It is important to note that dimensionality reduction can also discard useful information,
so care must be taken when applying these techniques.
Feature Scaling
Feature Scaling is a technique to standardize the independent features present in the data. It is
performed during the data pre-processing to handle highly varying values. If feature scaling is
not done then machine learning algorithm tends to use greater values as higher and
consider smaller values as lower regardless of the unit of the values. For example, it will
take 10 m and 10 cm both as same regardless of their unit.
Feature scaling guarantees that each feature contributes similarly to the version's mastering
system, stopping capabilities with large values from skewing the consequences. There are
several common methods to acquire function scaling:
Absolute Maximum Scaling
This method of scaling requires two-step:
1. We should first select the maximum absolute value out of all the entries of a particular
measure.
2. Then after this we divide each entry of the column by this maximum value.

After performing the above-mentioned two steps we will observe that each entry of the column
lies in the range of -1 to 1. But this method is not used that often the reason behind this is that
it is too sensitive to the outliers. And while dealing with the real-world data presence of outliers
is a very common thing.

Min-Max Scaling
Min-Max Scaling nis a function scaling method that transforms the values of capabilities
to fit within a distinctive variety, generally between zero and 1. This method is mainly
beneficial whilst you want to make certain that all features have the same scale,
preventing any single function from dominating the version because of its large fee range.
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.
Where:
• Xi is the unique fee of the characteristic.
• Xmin is the minimum fee of the function.
• Xmax is the maximum price of the characteristic.
• X scaled is the scaled cost of the characteristic.

As we are using the maximum and the minimum value this method is also prone to outliers but
the range in which the data will range after performing the above two steps is between 0 to 1.
Normalization
This method is more or less the same as the previous method but here instead of the minimum
value we subtract each entry by the mean value of the whole data and then divide the results
by the difference between the minimum and the maximum value.

Standardization
It is also referred to as Z-rating normalization, is a characteristic scaling technique that
transforms the values of a feature so that they've an average of 0 and a standard deviation
of 1. This technique is specially useful while you want to middle your data and ensure that
each characteristic contributes similarly to the model's learning procedure.
This method of scaling is basically based on the central tendencies and variance of the data.
1. First we should calculate the mean and standard deviation of the data we would like to
normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the
result by the standard deviation.
This helps us achieve a normal distribution of the data with a mean equal to zero and a standard
deviation equal to 1.

Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range
After calculating these two values we are supposed to subtract the median from each entry and
then divide the result by the interquartile range.

Where:
Xi is the original price of the function.
XMedian is the median of the function.
IQR is the interquartile range of the feature, that is the difference between the 75th percentile
(Q3) and the 25th percentile (Q1).
X scaled is the sturdy scaled fee of the feature.
This method rescales the characteristic by centring it across the median and scaling it in step
with the IQR, which reduces the effect of outliers.
Why use Feature Scaling?
In machine learning feature scaling is used for number of purposes:
• Range: Scaling guarantees that all features are on a comparable scale and have
comparable ranges. This process is known as feature normalisation. This is significant
because the magnitude of the features has an impact on many machine learning
techniques. Larger scale features may dominate the learning process and have an
excessive impact on the outcomes.
• Algorithm performance improvement: When the features are scaled several machine
learning methods including gradient descent-based algorithms, distance-based
algorithms (such k-nearest neighbours) and support vector machines perform better or
converge more quickly. The algorithm’s performance can be enhanced by scaling the
features which prevent the convergence of the algorithm to the ideal outcome.
• Preventing numerical instability: Numerical instability can be prevented by avoiding
significant scale disparities between features. For examples include distance
calculations where having features with differing scales can result in numerical
overflow or underflow problems. Stable computations are required to mitigate this issue
by scaling the features.
• Equal importance: Scaling features makes sure that each characteristic is given the
same consideration during the learning process. Without scaling bigger scale features
could dominate the learning producing skewed outcomes. This bias is removed through
scaling and each feature contributes fairly to model predictions.
What is Feature Selection?

• Feature Selection is the method of reducing the input variable to your model by using
only relevant data and getting rid of noise in data.
It is the process of automatically choosing relevant features for your machine learning
model based on the type of problem you are trying to solve. We do this by including or
excluding important features without changing them. It helps in cutting down the noise
in our data and reducing the size of our input data.

Or
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction creates
new features. Feature selection is a way of reducing the input variable for the model by
using only relevant data in order to reduce overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection


o It helps in avoiding the curse of dimensionality.
o It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Models

Feature selection models are of two types:


1. Supervised Models: Supervised feature selection refers to the method which uses the
output label class for feature selection. They use the target variables to identify the
variables which can increase the efficiency of the model
2. Unsupervised Models: Unsupervised feature selection refers to the method which does
not need the output label class for feature selection. We use them for unlabelled data.
We can further divide the supervised models into three :

Filter Methods
Filter methods evaluate each feature independently with target variable. Feature with
high correlation with target variable are selected as it means this feature has some
relation and can help us in making predictions. These methods are used in the
preprocessing phase to remove irrelevant or redundant features based on statistical tests
(correlation) or other criteria.

Advantages:
• Fast and inexpensive: Can quickly evaluate features without training the model.
• Good for removing redundant or correlated features.

Limitations: These methods don’t consider feature interactions so they may miss
feature combinations that improve model performance.

Some techniques used are:


• Information Gain – It is defined as the amount of information provided by the feature
for identifying the target value and measures reduction in the entropy values.
Information gain of each attribute is calculated considering the target values for feature
selection.
• Chi-square test — Chi-square method (X2) is generally used to test the relationship
between categorical variables. It compares the observed values from different attributes
of the dataset to its expected value.

• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose variance
doesn’t meet the specific threshold. By default, this method removes features having
zero variance. The assumption made using this method is higher variance features are
likely to contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the
mean absolute difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean
(AM) to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to
∞ as AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant
feature.

2. Wrapper methods
Wrapper methods are also referred as greedy algorithms that train algorithm. They use
different combination of features and compute relation between these subset
features and target variable and based on conclusion addition and removal of
features are done. Stopping criteria for selecting the best subset are usually pre-
defined by the person training the model such as when the performance of the model
decreases or a specific number of features are achieved.
Advantages:
• Can lead to better model performance since they evaluate feature subsets in the context
of the model.
• They can capture feature dependencies and interactions.

Limitations: They are computationally more expensive than filter methods especially
for large datasets.

Some techniques used are:


• Forward selection – This method is an iterative approach where we initially start with
an empty set of features and keep adding a feature which best improves our model after
each iteration. The stopping criterion is till the addition of a new variable does not
improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we initially
start with all features and after each iteration, we remove the least significant feature.
The stopping criterion is till no improvement in the performance of the model is
observed after the feature is removed.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is trained
on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.

3. Embedded methods
Embedded methods perform feature selection during the model training process. They
combine the benefits of both filter and wrapper methods. Feature selection is integrated
into the model training allowing the model to select the most relevant features based on
the training process dynamically.
Advantages:
• More efficient than wrapper methods because the feature selection process is embedded
within model training.
• Often more scalable than wrapper methods.

Limitations: Works with a specific learning algorithm so the feature selection might
not work well with other models

Some techniques used are:

• L1 Regularization (Lasso): A regression method that applies L1 regularization to


encourage sparsity in the model. Features with non-zero coefficients are considered
important.
• Decision Trees and Random Forests: These algorithms naturally perform feature
selection by selecting the most important features for splitting nodes based on criteria
like Gini impurity or information gain.
• Gradient Boosting: Like random forests gradient boosting models select important
features while building trees by prioritizing features that reduce error the most.
4.Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset.

This method takes care of the machine training iterative process while maintaining the
computation cost to be minimum. Eg: Lasso and Ridge Regression.

You might also like