Unit 1 Machine Learning
Unit 1 Machine Learning
Machine learning is a subset of Artificial Intelligence (AI). Machine learning (ML) allows
computers to learn and make decisions without being explicitly programmed. It involves
feeding data into algorithms to identify patterns and make predictions on new
data. Machine learning is used in various applications, including image and speech
recognition, natural language processing, and recommender systems.
The Machine Learning algorithm's operation is depicted in the following block diagram:
A machine “learns” by recognizing patterns and improving its performance on a task based on
data, without being explicitly programmed.
The process involves:
1. Data Input: Machines require data (e.g., text, images, numbers) to analyze.
2. Algorithms: Algorithms process the data, finding patterns or relationships.
3. Model Training: Machines learn by adjusting their parameters based on the input data
using mathematical models.
4. Feedback Loop: The machine compares predictions to actual outcomes and corrects
errors (via optimization methods like gradient descent).
5. Experience and Iteration: Repeating this process with more data improves the
machine’s accuracy over time.
6. Evaluation and Generalization: The model is tested on unseen data to ensure it
performs well on real-world tasks.
Features of Machine Learning:
o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation.
Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.
Abstraction
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models
and creation of new models. The process of fitting a model to a dataset is known as training.
When the model has been trained, the data is transformed into an abstract form that summarizes
the original information.
Generalization
The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form that
can be utilized for future action. These actions are to be carried out on tasks that are similar,
but not identical, to those what have been seen before. In generalization, the goal is to discover
those properties of the data that will be most relevant to future tasks.
Evaluation
Evaluation is the last component of the learning process. It is the process of giving feedback
to the user to measure the utility of the learned knowledge. This feedback is then utilised to
effect improvements in the whole learning process.
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised machine learning
Supervised machine learning is a fundamental approach for machine learning and artificial
intelligence. It involves training a model using labelled data, where each input comes with a
corresponding correct output. The process is like a teacher guiding a student—hence the term
“supervised” learning.
supervised learning is a type of machine learning where a model is trained on labelled data—
meaning each input is paired with the correct output. the model learns by comparing its
predictions with the actual answers provided in the training data. Over time, it adjusts itself to
minimize errors and improve accuracy. The goal of supervised learning is to make accurate
predictions when given new, unseen data. For example, if a model is trained to recognize
handwritten digits, it will use what it learned to correctly identify new numbers it hasn’t seen
before.
Reinforcement Learning:
Reinforcement learning works on a feedback-based process, in which an AI agent (A software
component) automatically explore its surrounding by hitting & trail, taking action, learning
from experiences, and improving its performance.
Agent gets rewarded for each good action and get punished for each bad action; hence the goal
of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life.
• An example of reinforcement learning is to play a game, where the Game is the environment,
moves of an agent at each step define states, and the goal of the agent is to get a high score.
• Agent receives feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.
Categories of Reinforcement Learning:
Reinforcement learning is categorized mainly into two types of methods/algorithms:
Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the
tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.
Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite
to the positive RL. It increases the tendency that the specific behaviour would occur again by
avoiding the negative condition.
Real-world Use cases of Reinforcement Learning
• Video Games
• Robotics
• Text Mining
In the case of complex models like neural networks, the hypothesis may involve multiple layers
of interconnected nodes, each performing a specific computation.
Hypothesis Evaluation:
The process of machine learning involves not only formulating hypotheses but also evaluating
their performance. This evaluation is typically done using a loss function or an evaluation
metric that quantifies the disparity between predicted outputs and ground truth labels. Common
evaluation metrics include mean squared error (MSE), accuracy, precision, recall, F1-score,
and others. By comparing the predictions of the hypothesis with the actual outcomes on a
validation or test dataset, one can assess the effectiveness of the model.
Hypothesis Testing and Generalization:
Once a hypothesis is formulated and evaluated, the next step is to test its generalization
capabilities. Generalization refers to the ability of a model to make accurate predictions on
unseen data. A hypothesis that performs well on the training dataset but fails to generalize to
new instances is said to suffer from overfitting. Conversely, a hypothesis that generalizes well
to unseen data is deemed robust and reliable.
The process of hypothesis formulation, evaluation, testing, and generalization is often iterative
in nature. It involves refining the hypothesis based on insights gained from model performance,
feature importance, and domain knowledge. Techniques such as hyperparameter tuning, feature
engineering, and model selection play a crucial role in this iterative refinement process.
Hypothesis in Statistics
In statistics, a hypothesis refers to a statement or assumption about a population parameter. It
is a proposition or educated guess that helps guide statistical analyses. There are two types of
hypotheses: the null hypothesis (H0) and the alternative hypothesis (H1 or Ha).
• Null Hypothesis(H0): This hypothesis suggests that there is no significant difference
or effect, and any observed results are due to chance. It often represents the status quo
or a baseline assumption.
• Alternative Hypothesis (H1 or Ha): This hypothesis contradicts the null hypothesis,
proposing that there is a significant difference or effect in the population. It is what
researchers aim to support with evidence.
Inductive Bias
Inductive bias can be defined as the set of assumptions or biases that a learning algorithm
employs to make predictions on unseen data based on its training data. These assumptions are
inherent in the algorithm's design and serve as a foundation for learning and generalization.
The inductive bias of an algorithm influences how it selects a hypothesis (a possible
explanation or model) from the hypothesis space (the set of all possible hypotheses) that best
fits the training data. It helps the algorithm navigate the trade-off between fitting the training
data perfectly (overfitting) and generalizing well to unseen data (underfitting).
Types of Inductive Bias
Inductive bias can manifest in various forms, depending on the algorithm and its underlying
assumptions. Some common types of inductive bias include:
1. Bias towards simpler explanations: Many machine learning algorithms, such as
decision trees and linear models, have a bias towards simpler hypotheses. They prefer
explanations that are more parsimonious and less complex, as these are often more
likely to generalize well to unseen data.
2. Bias towards smoother functions: Algorithms like kernel methods or Gaussian
processes have a bias towards smoother functions. They assume that neighbouring
points in the input space should have similar outputs, leading to smooth decision
boundaries.
3. Bias towards specific types of functions: Neural networks, for example, have a bias
towards learning complex, nonlinear functions. This bias allows them to capture
intricate patterns in the data but can also lead to overfitting if not regularized properly.
4. Bias towards sparsity: Some algorithms, like Lasso regression, have a bias towards
sparsity. They prefer solutions where only a few features are relevant, which can
improve interpretability and generalization.
Importance of Inductive Bias
Inductive bias is crucial in machine learning as it helps algorithms generalize from limited
training data to unseen data. Without a well-defined inductive bias, algorithms may struggle to
make accurate predictions or may overfit the training data, leading to poor performance on new
data.
Understanding the inductive bias of an algorithm is essential for model selection, as different
biases may be more suitable for different types of data or tasks. It also provides insights into
how the algorithm is learning and what assumptions it is making about the data, which can aid
in interpreting its predictions and results.
Bayes Classifiers
Naive Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities.
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a faster
speed than other classification algorithms.
• It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to
the predictions with no relation between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.
Why it is Called Naive Bayes?
It is named as “Naive” because it assumes the presence of one feature does not affect other
features. The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”)
for playing golf. Here is a tabular representation of our dataset.
The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists
of the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
• Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
• Feature independence: This means that when we are trying to classify something, we
assume that each feature (or piece of information) in the data does not affect any other
feature.
• Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
• Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
• Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
• No missing data: The data should not contain any missing values.
How it works?
• Assumption of Independence: The "naive" assumption in Naive Bayes is that the
presence of a particular feature in a class is independent of the presence of any other
feature, given the class. This is a strong assumption and may not hold true in real-world
data, but it simplifies the calculation and often works well in practice.
• Calculating Class Probabilities: Given a set of features x1,x2,...,xn, the Naive Bayes
classifier calculates the probability of each class Ck given the features using Bayes'
theorem:
o the denominator P(x1,x2,...,xn) is the same for all classes and can be ignored for
the purpose of comparison.
• Classification Decision: The classifier selects the class Ck with the highest probability
as the predicted class for the given set of features.
2. Bayes optimal classifier
The Bayes optimal classifier is a theoretical concept in machine learning that represents the
best possible classifier for a given problem. It is based on Bayes' theorem, which describes how
to update probabilities based on new evidence.
In the context of classification, the Bayes optimal classifier assigns the class label that has the
highest posterior probability given the input features. Mathematically, this can be expressed as:
Bayes error
In machine learning, "Bayes error" refers to the theoretical minimum error rate that any
classifier could achieve on a given dataset, representing the lowest possible classification error
given the inherent overlap and uncertainty between different classes in the data
distribution; essentially, it's the best possible performance a classifier can achieve under the
given data conditions, acting as a benchmark to compare the performance of different
classification algorithms.
Key points about Bayes error:
• Theoretical limit:
It is a theoretical concept because it assumes perfect knowledge of the true underlying
probability distributions of the data, which is usually not available in practice.
• Calculating Bayes error:
To calculate the Bayes error, you need to determine the conditional probabilities of each class
given the features, and then choose the class with the highest probability for each data point.
Occam's razor
Occam's razor, a principle named after the 14th-century English philosopher William of
Ockham, serves as a guiding tool in various fields of knowledge, from philosophy to science.
This principle suggests that among competing hypotheses or explanations, the simplest one is
often the most accurate. By advocating for simplicity, Occam's razor encourages us to prioritize
elegant and straightforward solutions over unnecessarily convoluted ones.
Occam's razor is a principle that suggests that, when faced with multiple explanations or
hypotheses, the simplest one is usually the most accurate. In other words, it encourages us to
choose the option with the fewest assumptions or complexities.
Occam's razor serves as a guide to avoid unnecessary complications and to prioritize elegant
and straightforward solutions. By applying this principle, we can navigate through the
complexities of problems and make decisions based on the simplest and most plausible
explanation.
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining
the most important features. There are several methods for feature selection, including filter
methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the
criteria for selecting features, and embedded methods combine feature selection with the
model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data in
a lower-dimensional space. There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbour embedding (t-SNE). PCA is a popular technique that projects the
original features onto a lower-dimensional space while preserving as much of the variance as
possible.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modelling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A
3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2-dimensional space, and a 1-D problem to a simple line. The below figure illustrates
this concept, where a 3-D feature space is split into two 2-D feature spaces, and later, if found
to be correlated, the number of features can be reduced even further.
After performing the above-mentioned two steps we will observe that each entry of the column
lies in the range of -1 to 1. But this method is not used that often the reason behind this is that
it is too sensitive to the outliers. And while dealing with the real-world data presence of outliers
is a very common thing.
Min-Max Scaling
Min-Max Scaling nis a function scaling method that transforms the values of capabilities
to fit within a distinctive variety, generally between zero and 1. This method is mainly
beneficial whilst you want to make certain that all features have the same scale,
preventing any single function from dominating the version because of its large fee range.
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.
Where:
• Xi is the unique fee of the characteristic.
• Xmin is the minimum fee of the function.
• Xmax is the maximum price of the characteristic.
• X scaled is the scaled cost of the characteristic.
As we are using the maximum and the minimum value this method is also prone to outliers but
the range in which the data will range after performing the above two steps is between 0 to 1.
Normalization
This method is more or less the same as the previous method but here instead of the minimum
value we subtract each entry by the mean value of the whole data and then divide the results
by the difference between the minimum and the maximum value.
Standardization
It is also referred to as Z-rating normalization, is a characteristic scaling technique that
transforms the values of a feature so that they've an average of 0 and a standard deviation
of 1. This technique is specially useful while you want to middle your data and ensure that
each characteristic contributes similarly to the model's learning procedure.
This method of scaling is basically based on the central tendencies and variance of the data.
1. First we should calculate the mean and standard deviation of the data we would like to
normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the
result by the standard deviation.
This helps us achieve a normal distribution of the data with a mean equal to zero and a standard
deviation equal to 1.
Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range
After calculating these two values we are supposed to subtract the median from each entry and
then divide the result by the interquartile range.
Where:
Xi is the original price of the function.
XMedian is the median of the function.
IQR is the interquartile range of the feature, that is the difference between the 75th percentile
(Q3) and the 25th percentile (Q1).
X scaled is the sturdy scaled fee of the feature.
This method rescales the characteristic by centring it across the median and scaling it in step
with the IQR, which reduces the effect of outliers.
Why use Feature Scaling?
In machine learning feature scaling is used for number of purposes:
• Range: Scaling guarantees that all features are on a comparable scale and have
comparable ranges. This process is known as feature normalisation. This is significant
because the magnitude of the features has an impact on many machine learning
techniques. Larger scale features may dominate the learning process and have an
excessive impact on the outcomes.
• Algorithm performance improvement: When the features are scaled several machine
learning methods including gradient descent-based algorithms, distance-based
algorithms (such k-nearest neighbours) and support vector machines perform better or
converge more quickly. The algorithm’s performance can be enhanced by scaling the
features which prevent the convergence of the algorithm to the ideal outcome.
• Preventing numerical instability: Numerical instability can be prevented by avoiding
significant scale disparities between features. For examples include distance
calculations where having features with differing scales can result in numerical
overflow or underflow problems. Stable computations are required to mitigate this issue
by scaling the features.
• Equal importance: Scaling features makes sure that each characteristic is given the
same consideration during the learning process. Without scaling bigger scale features
could dominate the learning producing skewed outcomes. This bias is removed through
scaling and each feature contributes fairly to model predictions.
What is Feature Selection?
• Feature Selection is the method of reducing the input variable to your model by using
only relevant data and getting rid of noise in data.
It is the process of automatically choosing relevant features for your machine learning
model based on the type of problem you are trying to solve. We do this by including or
excluding important features without changing them. It helps in cutting down the noise
in our data and reducing the size of our input data.
Or
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction creates
new features. Feature selection is a way of reducing the input variable for the model by
using only relevant data in order to reduce overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.
Filter Methods
Filter methods evaluate each feature independently with target variable. Feature with
high correlation with target variable are selected as it means this feature has some
relation and can help us in making predictions. These methods are used in the
preprocessing phase to remove irrelevant or redundant features based on statistical tests
(correlation) or other criteria.
Advantages:
• Fast and inexpensive: Can quickly evaluate features without training the model.
• Good for removing redundant or correlated features.
Limitations: These methods don’t consider feature interactions so they may miss
feature combinations that improve model performance.
• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose variance
doesn’t meet the specific threshold. By default, this method removes features having
zero variance. The assumption made using this method is higher variance features are
likely to contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the
mean absolute difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean
(AM) to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to
∞ as AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant
feature.
2. Wrapper methods
Wrapper methods are also referred as greedy algorithms that train algorithm. They use
different combination of features and compute relation between these subset
features and target variable and based on conclusion addition and removal of
features are done. Stopping criteria for selecting the best subset are usually pre-
defined by the person training the model such as when the performance of the model
decreases or a specific number of features are achieved.
Advantages:
• Can lead to better model performance since they evaluate feature subsets in the context
of the model.
• They can capture feature dependencies and interactions.
Limitations: They are computationally more expensive than filter methods especially
for large datasets.
3. Embedded methods
Embedded methods perform feature selection during the model training process. They
combine the benefits of both filter and wrapper methods. Feature selection is integrated
into the model training allowing the model to select the most relevant features based on
the training process dynamically.
Advantages:
• More efficient than wrapper methods because the feature selection process is embedded
within model training.
• Often more scalable than wrapper methods.
Limitations: Works with a specific learning algorithm so the feature selection might
not work well with other models
This method takes care of the machine training iterative process while maintaining the
computation cost to be minimum. Eg: Lasso and Ridge Regression.