DL Unit 1 Notes
DL Unit 1 Notes
MR20-1CS0158
UNIT I
Prepared by
Dr.M.Narayanan
Professor
Department of CSE
Malla Reddy University, Hyderabad
DEEP LEARNING AND APPLICATIONS
MR20-1CS0158
UNIT I
Machine Learning Basics: Learning Algorithms, Capacity, Over fitting, and Under
fitting, Hyper parameters and Validation Sets, Estimators, Bias and Variance, Maximum
Likelihood Estimation, Bayesian Statistics, Supervised and Unsupervised Learning
algorithms, Stochastic Gradient Descent, Building a ML algorithm, Challenges and
Motivation to Deep learning.
Text Book
1. Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016
Machine Learning
Machine learning is a branch of artificial intelligence that empowers computers to learn
from data and improve their performance over time without explicit programming.
It involves creating algorithms that can recognize patterns in data, make predictions, and
solve complex tasks, leading to applications in areas such as image recognition, language
processing, and autonomous systems.
Application:
Develop a machine learning model that can analyze medical imaging data to accurately
detect and diagnose specific diseases, such as identifying early signs of diabetic
retinopathy in retinal scans.
Construct a sentiment analysis model that can automatically determine the sentiment
(positive, negative, neutral) expressed in a given text, which could be used for
analyzing social media posts or customer reviews.
Create a model that predicts stock market trends and prices by analyzing historical
market data and incorporating relevant economic indicators, assisting investors in
making informed decisions.
Deep Learning
Deep learning is a subset of machine learning focused on neural networks with multiple
layers, allowing automated extraction of complex data features.
Deep learning is a method in artificial intelligence (AI) that teaches computers to process
data in a way that is inspired by the human brain.
Deep learning models can recognize complex patterns in pictures, text, sounds, and
other data to produce accurate insights and predictions.
It excels in tasks like image recognition, natural language processing, and autonomous
systems, by learning hierarchical representations from data and eliminating the need for
extensive manual feature engineering.
Application:
Design a deep learning algorithm that enables a drone to navigate through complex
environments and avoid obstacles, using onboard cameras for visual input.
Create a deep learning model to segment and identify specific structures or anomalies in
medical images, aiding in tasks like tumor detection in MRI scans.
Aspect Machine Learning Deep Learning
Definition Subset of AI that learns patterns from data Subset of ML using deep neural networks
Works well with smaller, structured Requires large amounts of data, works with
Data Dependency
datasets unstructured data
Feature Requires manual feature extraction and Automatically extracts features from raw
Engineering selection data
Performance with May not improve significantly with Performance improves with more data and
Scale more data deeper models
Generally more interpretable and Often considered a "black box" with less
Interpretability
explainable. interpretability.
Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data.
What is the mean by learning?
Mitchell (1997) provides the definition
“A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E.”
One can imagine a very wide variety of experiences E, tasks T, and performance measures
P.
Learning refers to the process by which a computer program (or system) becomes better at
performing a certain task T as it gains more experience E.
The improvement in performance is measured by a performance measure P. This process
involves the program adjusting its internal parameters based on the data it encounters,
allowing it to make better predictions or decisions over time.
The goal of learning is to enable the program to generalize from the provided data
and perform well on new, unseen data.
The Task, T
Machine learning allows us to tackle tasks that are too difficult to solve with fixed
programs written and designed by human beings.
From a scientific and philosophical point of view, machine learning is interesting
because developing our understanding of the principles that underlie (behind)
intelligence.
In this relatively formal definition of the word “task,”
“The process of learning itself is not the task. Learning is our means of attaining
the ability to perform the task”.
For example, if we want a robot to be able to walk, then walking is the task.
We could program the robot to learn to walk, or we could attempt to directly write a
program that specifies how to walk manually.
Machine learning tasks are usually described in terms of how the machine learning
system should process.
An example is a collection of features that have been quantitatively measured from
some object or event that we want the machine learning system to process.
We typically represent an example as a vector x ∈ Rn where each entry xi of the vector
is another feature.
For example, the features of an image are usually the values of the pixels in the image.
Many kinds of tasks can be solved with machine learning. Some of the most
common machine learning tasks include the following:
1.Classification:
Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data.
In classification, the model is fully trained using the training data, and then it is
evaluated on test data before being used to perform prediction on new unseen data.
In this type of task, the computer program is asked to specify which of k categories some
input belongs to.
To solve this task, the learning algorithm is usually asked to produce a function
f : Rn → {1, . . . , k}.
When y = f (x), the model assigns an input described by vector x to a category identified
by numeric code y.
There are other variants of the classification task, for example, where f outputs a
probability distribution over classes.
An example of a classification task is object recognition, where the input is an image
(usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image.
2. Classification with missing inputs:
Classification becomes more challenging if the computer program is not guaranteed that every
measurement in its input vector will always be provided.
In order to solve the classification task, the learning algorithm only has to define a single
function mapping from a vector input to a categorical output.
When some of the inputs may be missing, rather than providing a single classification function,
the learning algorithm must learn a set of functions. Each function corresponds to classifying x
with a different subset of its inputs missing.
This kind of situation arises frequently in medical diagnosis, because many kinds of medical
tests are expensive or invasive.
One way to efficiently define such a large set of functions is to learn a probability distribution
over all of the relevant variables, then solve the classification task by marginalizing out the
missing variables.
3. Regression:
Regression is a supervised machine learning technique which is used to predict
continuous values. The ultimate goal of the regression algorithm is to plot a best-fit
line or a curve between the data. The three main metrics that are used for evaluating
the trained regression model are variance, bias and error.
In this type of task, the computer program is asked to predict a numerical value given some
input.
To solve this task, the learning algorithm is asked to output a function f : Rn → R.
This type of task is similar to classification, except that the format of output is different.
An example of a regression task is the prediction of the expected claim amount that an
insured person will make (used to set insurance premiums), or the prediction of future
prices of securities.
These kinds of predictions are also used for algorithmic trading.
4. Transcription:
In this type of task, the machine learning system is asked to observe a relatively
unstructured representation of some kind of data and transcribe it into discrete, textual
form.
For example, in optical character recognition, the computer program is shown a
photograph containing an image of text and is asked to return this text in the form of a
sequence of characters (e.g., in ASCII or Unicode format).
5. Machine translation:
In a machine translation task, the input already consists of a sequence of symbols in some
language, and the computer program must convert this into a sequence of symbols in
another language.
This is commonly applied to natural languages, such as to translate from English to
French.
Deep learning has recently begun to have an important impact on this kind of task
(Sutskever et al., 2014; Bahdanau et al., 2015).
6. Structured output:
Structured output tasks involve any task where the output is a vector (or other data
structure containing multiple values) with important relationships between the different
elements.
This is a broad category, and incorporates the transcription and translation tasks
described above, but also many other tasks.
One example is parsing—mapping a natural language sentence into a tree that
describes its grammatical structure and tagging nodes of the trees as being verbs,
nouns, or adverbs, and so on.
7. Anomaly detection:
In this type of task, the computer program sifts through a set of events or objects, and
flags some of them as being unusual or uncharacteristic.
An example of an anomaly detection task is credit card fraud detection.
By modeling your purchasing habits, a credit card company can detect misuse of your
cards.
If a thief steals your credit card or credit card information, the thief’s purchases will
often come from a different probability distribution over purchase types than your own.
The credit card company can prevent fraud by placing a hold on an account as soon as
that card has been used for an uncharacteristic purchase.
10. Denoising:
In this type of task, the machine learning algorithm is given in input a corrupted
example x˜ ∈ Rn obtained by an unknown corruption process from a clean example x ∈
Rn .
The learner must predict the clean example x from its corrupted version x˜, or more
generally predict the conditional probability distribution p(x | x˜).
11. Density estimation or probability mass function estimation:
In the density estimation problem, the machine learning algorithm is asked to learn a
function Pmodel : Rn → R, where Pmodel(x) can be interpreted as a probability density
function (if x is continuous) or a probability mass function (if x is discrete) on the
space that the examples were drawn from.
To do such a task well (we will specify exactly what that means when we discuss
performance measures P ), the algorithm needs to learn the structure of the data it has
seen. It must know where examples cluster tightly and where they are unlikely to
occur.
Most of the tasks described above require that the learning algorithm has at least
implicitly captured the structure of the probability distribution.
Of course, many other tasks and types of tasks are possible. The types of tasks we list
here are intended only to provide examples of what machine learning can do, not to
define a rigid taxonomy (classification ) of tasks.
The Performance Measure, P
In order to evaluate the abilities of a machine learning algorithm, we must design a
quantitative measure of its performance.
Usually this performance measure P is specific to the task T being carried out by the
system.
For tasks such as classification, classification with missing inputs, and
transcription, we often measure the accuracy of the model.
Accuracy is just the proportion of examples for which the model produces the
correct output. We can also obtain equivalent information by measuring the error rate,
the proportion of examples for which the model produces an incorrect output.
We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular
example is 0 if it is correctly classified and 1 if it is not.
For tasks such as density estimation, it does not make sense to measure accuracy, error
rate, or any other kind of 0-1 loss.
Instead, we must use a different performance metric that gives the model a continuous-
valued score for each example. The most common approach is to report the average
log-probability the model assigns to some examples
The choice of performance measure may seem straightforward and objective, but it is
often difficult to choose a performance measure that corresponds well to the desired
behavior of the system.
In some cases, this is because it is difficult to decide what should be measured.
For example, when performing a transcription task, should we measure the accuracy of
the system at transcribing entire sequences, or should we use a more fine-grained
performance measure that gives partial credit for getting some elements of the sequence
correct?
When performing a regression task, should we correct the system more if it frequently
makes medium-sized mistakes or if it rarely makes very large mistakes?
These kinds of design choices depend on the application.
The Experience, E
Machine learning algorithms can be broadly categorized as unsupervised or supervised
by what kind of experience they are allowed to have during the learning process.
Most of the learning algorithms can be understood as being allowed to experience an
entire dataset.
One of the oldest datasets studied by statisticians and machine learning researchers is
the Iris dataset (Fisher, 1936).
It is a collection of measurements of different parts of 150 iris plants. Each individual
plant corresponds to one example.
The features within each example are the measurements of each of the parts of the
plant: the sepal length, sepal width, petal length and petal width.
The dataset also records which species each plant belonged to. Three different species
are represented in the dataset.
Unsupervised learning algorithms experience a dataset containing many features, then
learn useful properties of the structure of this dataset.
In the context of deep learning, we usually want to learn the entire probability
distribution that generated a dataset, whether explicitly as in density estimation or
implicitly for tasks like synthesis or denoising.
Some other unsupervised learning algorithms perform other roles, like clustering,
which consists of dividing the dataset into clusters of similar examples.
Supervised learning algorithms experience a dataset containing features, but each
example is also associated with a label or target.
For example, the Iris dataset is annotated with the species of each iris plant.
A supervised learning algorithm can study the Iris dataset and learn to classify iris
plants into three different species based on their measurements.
Some machine learning algorithms do not just experience a fixed dataset. For
example, reinforcement learning algorithms interact with an environment, so
there is a feedback loop between the learning system and its experiences.
Most machine learning algorithms simply experience a dataset. A dataset can be
described in many ways. In all cases, a dataset is a collection of examples, which are in
turn collections of features.
There is no formal definition of supervised and unsupervised learning, there is no
rigid taxonomy of datasets or experiences. The structures described here cover
most cases, but it is always possible to design new ones for new applications.
Capacity, Overfitting and Underfitting
The central challenge in machine learning is that we must perform well on new, previously
unseen inputs—not just those on which our model was trained.
The ability to perform well on previously unobserved inputs is called generalization.
Typically, when training a machine learning model, we have access to a training set, we
can compute some error measure on the training set called the training error, and we
reduce this training error.
So far, what we have described is simply an optimization problem.
What separates machine learning from optimization is that we want the generalization
error, also called the test error, to be low as well.
The generalization error is defined as the expected value of the error on a new
input.
Here the expectation is taken across different possible inputs, drawn from the
distribution of inputs we expect the system to encounter in practice.
We typically estimate the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately from
the training set.
If the training and the test set are collected randomly.
If we are allowed to make some assumptions about how the training and test set are collected,
then we can make some progress.
The train and test data are generated by a probability distribution over datasets called
the data generating process.
We typically make a set of assumptions known collectively as the these assumptions i.i.d.
assumptions (Independent and identically distributed random variables) are that the
examples in each dataset are independent from each other, and that the train set and test set are
identically distributed, drawn from the same probability distribution as each other.
This assumption allows us to describe the data generating process with a probability
distribution over a single example.
The same distribution is then used to generate every train example and every test example.
We call that shared underlying distribution the data generating distribution, denoted
pdata.
This probabilistic framework and the i.i.d. assumptions (Independent and identically
distributed random variables) allow us to mathematically study the relationship between
training error and test error.
One immediate connection we can observe between the training and test error is that
the expected training error of a randomly selected model is equal to the expected test
error of that model.
Suppose we have a probability distribution p( x, y) and we sample from it repeatedly to
generate the train set and the test set.
For some fixed value w, the expected training set error is exactly the same as the expected
test set error, because both expectations are formed using the same dataset sampling
process.
The only difference between the two conditions is the name we assign to the dataset we
sample.
We sample the training set, then use it to choose the parameters to reduce training set error,
then sample the test set.
Under this process, the expected test error is greater than or equal to the expected value of
training error.
The factors determining how well a machine learning algorithm will perform are its ability
to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning:
underfitting and overfitting.
Underfitting occurs when the model is not able to obtain a sufficiently low error value on
the training set.
Overfitting occurs when the gap between the training error and test error is too large.
We can control whether a model is more likely to overfit or underfit by altering its
capacity. Informally, a model’s capacity is its ability to fit a wide variety of functions.
Models with low capacity may struggle to fit the training set.
Models with high capacity can overfit by memorizing properties of the training set that do
not serve them well on the test set.
One way to control the capacity of a learning algorithm is by choosing its hypothesis
space, the set of functions that the learning algorithm is allowed to select as being the
solution.
For example, the linear regression algorithm has the set of all linear functions of its input
as its hypothesis space.
We can generalize linear regression to include polynomials, rather than just linear
functions, in its hypothesis space.
Doing so increases the model’s capacity.
A polynomial of degree one gives us the linear regression model with which we are already
familiar, with prediction
yˆ = b + wx.
By introducing x2 as another feature provided to the linear regression model, we can learn a
model that is quadratic as a function of x:
yˆ = b + w1x1 + w2x2
Though this model implements a quadratic function of its input, the output is still a linear
function of the parameters, so we can still use the normal equations to train the model in closed
form. We can continue to add more powers of x as additional features, for example to obtain a
polynomial of degree 9:
Machine learning algorithms will generally perform best when their capacity is
appropriate in regard to the true complexity of the task they need to perform and the
amount of training data they are provided with.
Models with insufficient capacity are unable to solve complex tasks. Models with high
capacity can solve complex tasks, but when their capacity is higher than needed to solve the
present task they may overfit.
Fig. 5.2 shows this principle in action. We compare a linear, quadratic and degree-9
predictor attempting to fit a problem where the true underlying function is quadratic.
The linear function is unable to capture the curvature in the true underlying problem,
so it underfits.
The degree-9 predictor is capable of representing the correct function, but it is also capable
of representing infinitely many other functions that pass exactly through the training
points, because we have more parameters than training examples.
We have little chance of choosing a solution that generalizes well when so many wildly
different solutions exist.
In this example, the quadratic model is perfectly matched to the true structure of the task
so it generalizes well to new data.
Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by randomly sampling x values and choosing y deterministically
by evaluating a quadratic function.
Figure 5.2:
(Left) A linear function fit to the data suffers from underfitting—it cannot capture
the curvature that is present in the data.
(Center) A quadratic function fit to the data generalizes well to unseen points. It does
not suffer from a significant amount of overfitting or underfitting.
(Right) A polynomial of degree 9 fit to the data suffers from overfitting. Here we
used the Moore-Penrose pseudoinverse to solve the underdetermined normal
equations.
The solution passes through all of the training points exactly, but we have not been
lucky enough for it to extract the correct structure.
It now has a deep valley in between two training points that does not appear in the true
underlying function.
It also increases sharply on the left side of the data, while the true function decreases in
this area.
So far we have only described changing a model’s capacity by changing the number of
input features it has (and simultaneously adding new parameters associated with those
features).
There are in fact many ways of changing a model’s capacity. Capacity is not
determined only by the choice of model.
The model specifies which family of functions the learning algorithm can choose from
when varying the parameters in order to reduce a training objective.
This is called the representational capacity of the model. In many cases, finding the best
function within this family is a very difficult optimization problem.
In practice, the learning algorithm does not actually find the best function, but merely one
that significantly reduces the training error.
These additional limitations, such as the imperfection of the optimization algorithm, mean
that the learning algorithm’s effective capacity may be less than the representational
capacity of the model family
Figure 5.3: Typical relationship between capacity and error. Training and test error
behave differently. At the left end of the graph, training error and generalization
error are both high. This is the underfitting rule.
As we increase capacity, training error decreases, but the gap between training and
generalization error increases. Eventually, the size of this gap outweighs the decrease
in training error, and we enter the overfitting rule, where capacity is too large, above the
optimal capacity.
Figure 5.3
Regularization
The no free lunch theorem implies that we must design our machine learning algorithms
to perform well on a specific task. We do so by building a set of preferences into the
learning algorithm. When these preferences are aligned with the learning problems we ask
the algorithm to solve, it performs better.
Regularization is a technique used in machine learning and deep learning to reduce
overfitting by discouraging the model from becoming too complex.
It does this by adding a penalty to the loss function, which limits how much the model can
rely on large weights or complicated structures.
In simple terms, regularization helps the model focus on the most important patterns in the
training data and ignore noise or random fluctuations, leading to better performance on
new, unseen data.
There are different types of regularization:
L1 regularization (Lasso): Makes some weights exactly zero, encouraging sparsity.
L2 regularization (Ridge or weight decay): Keeps weights small, encouraging smoother
models.
The "No Free Lunch" (NFL) theorem in optimization and machine learning states
that no single optimization algorithm or machine learning model can universally
outperform all others across all possible problems. Essentially, if an algorithm
performs well on a specific set of problems, it must perform correspondingly worse
on others when averaged over all possible problems. This implies that there is no
"magic bullet" algorithm that works best in all situations.
Regularization is any modification we make to a learning algorithm that is intended
to reduce its generalization error but not its training error.
Regularization is one of the central concerns of the field of machine learning, rivaled
in its importance only by optimization.
The no free lunch theorem has made it clear that there is no best machine learning
algorithm, and, in particular, no best form of regularization. Instead we must choose a form
of regularization that is well-suited to the particular task we want to solve.
Hyperparameters and Validation Sets
Most machine learning algorithms have several settings that we can use to control the
behavior of the learning algorithm. These settings are called hyperparameters.
The values of hyperparameters are not adapted by the learning algorithm itself (though we can
design a nested learning procedure where one learning algorithm learns the best
hyperparameters for another learning algorithm).
Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up learning.
The prefix 'hyper_' suggests that they are 'top-level' parameters that control the learning
process and the model parameters that result from it.
What is the difference between parameters and hyperparameters?
Parameters are the internal values (like weights and biases in a neural network) that are
automatically learned from the training data and directly determine the model’s output.
Hyperparameters are the external settings (like learning rate, batch size, number of
layers) manually defined before training that guide the learning process but are not
updated during it.
Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not
learn because it is difficult to optimize
More frequently, we do not learn the hyperparameter because it is not appropriate to learn
that hyperparameter on the training set.
This applies to all hyperparameters that control model capacity. If learned on the
training set, such hyperparameters would always choose the maximum possible model
capacity, resulting in overfitting (refer to Fig. ).
For example, we can always fit the training set better with a higher degree polynomial and
a weight decay setting of λ = 0 than we could with a lower degree polynomial and a
positive weight decay setting.
To solve this problem, we need a validation set of examples that the training algorithm
does not observe.
Earlier we discussed how a held-out test set, composed of examples coming from the same
distribution as the training set, can be used to estimate the generalization error of a learner,
after the learning process has completed.
It is important that the test examples are not used in any way to make choices about the
model, including its hyperparameters.
For this reason, no example from the test set can be used in the validation set. Therefore,
we always construct the validation set from the training data.
Specifically, we split the training data into two disjoint subsets.
One of these subsets is used to learn the parameters. The other subset is our validation
set, used to estimate the generalization error during or after training, allowing for the
hyperparameters to be updated accordingly.
The subset of data used to learn the parameters is still typically called the training set,
even though this may be confused with the larger pool of data used for the entire
training process.
The subset of data used to guide the selection of hyperparameters is called the
validation set.
Typically, one uses about 80% of the training data for training and 20% for validation.
Since the validation set is used to “train” the hyperparameters, the validation set error
will underestimate the generalization error, though typically by a smaller amount than
the training error.
After all hyperparameter optimization is complete, the generalization error may be
estimated using the test set.
Cross-Validation in ML
Cross validation is a technique used in machine learning to evaluate the performance
of a model on unseen data.
It involves dividing the available data into multiple folds or subsets, using one of these
folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the
validation set.
Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
Cross validation is an important step in the machine learning process and helps to
ensure that the model selected for deployment is robust and generalizes well to new
data.
Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie, failing
to generalize a pattern.
Application:
A medical researcher is studying the relationship between a patient's age and whether
they have a particular disease (yes/no). Using a dataset of patients, the researcher uses
MLE to estimate the coefficients in a logistic regression model that predicts the
probability of disease based on age.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of
various unsupervised machine learning algorithms, which is used to determine the
local maximum likelihood estimates (MLE) or maximum a posteriori estimates
(MAP) for unobservable variables in statistical models.
Further, it is a technique to find maximum likelihood estimation when the latent
variables are present. It is also referred to as the latent variable model.
A latent variable model consists of both observable and unobservable variables
where observable can be predicted while unobserved are inferred from the
observed variable. These unobservable variables are known as latent variables.
Note:
“A priori” and “a posteriori” refer primarily to how, or on what basis, a proposition
might be known.
In general terms, a proposition is knowable a priori if it is knowable independently of
experience, while a proposition knowable a posteriori is knowable on the basis of
experience.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
Maximization step (M - step): This step involves the use of estimated data in the E-step
and updating the parameters.
Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition, e.g., if
there are two random variables that have very less difference in their probability, then they
are known as converged. In other words, whenever the values of given variables are
matched with each other, it is called convergence.
Convergence in the Expectation-Maximization (EM) algorithm refers to the point at
which the algorithm has reached a stable set of parameter estimates, meaning that
further iterations will not significantly change these estimates.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and Convergence Step. These steps are explained as
follows:
Initialization Step :1st Step: The very first step is to initialize the parameter values.
Further, the system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
Expectation Step 2nd Step: This step is known as Expectation or E-Step, which is used to
estimate or guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
Maximization Step 3rd Step: This step is known as Maximization or M-step, where we
use complete data obtained from the 2nd step to update the parameter values. Further, M-
step primarily updates the hypothesis.
Convergence Step 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or Latent Variable Model has a
broad range of real-life applications in machine learning. These are as follows:
The EM algorithm is applicable in data clustering in machine learning.
It is often used in computer vision and NLP (Natural language processing).
It is used to estimate the value of the parameter in mixed models such as the Gaussian
Mixture Model and quantitative genetics.
It is also used in psychometrics for estimating item parameters and latent abilities of item
response theory models.
It is also applicable in the medical and healthcare industry, such as in image reconstruction
and structural engineering.
It is used to determine the Gaussian density of a function.
Explore how MLE is used in classification tasks, such as object recognition or
sentiment analysis, by deriving and implementing the MLE-based classifier
Maximum Likelihood Estimation (MLE) is a statistical method used to find the parameters
of a model that maximize the probability of observing the given data. In the context of
classification tasks, MLE is used to determine the optimal model parameters that best fit
the training data, allowing for accurate predictions on new, unseen data.
The MLE Process in ClassificationModel Selection: Choose a suitable probability
distribution for the data. This choice often depends on the nature of the classification
problem (e.g., Bernoulli distribution for binary classification, multinomial distribution for
multi-class classification).
Parameter Estimation: Define the model parameters. These parameters represent the
characteristics of the distribution that influence the classification decision.
Likelihood Function: Construct the likelihood function, which expresses the probability
of observing the given training data under the chosen model and parameters.
Maximization: Find the values of the model parameters that maximize the likelihood
function. This is typically done using optimization techniques like gradient ascent or
Newton-Raphson.
Classification: Once the optimal parameters are determined, the model can be used to
classify new data points by calculating their likelihood under the learned model and
assigning them to the class with the highest probability.
MLE in Object Recognition (Deep Learning)
In more complex tasks like object recognition, MLE is used in models like convolutional
neural networks (CNNs). The process is conceptually similar:
A CNN outputs a probability distribution over different classes (e.g., cat, dog, etc.) for a
given input image.
The probability distribution is based on the model's parameters (weights and biases).
The likelihood is the probability that the CNN assigns to the correct label for each image.
The log-likelihood is maximized during training using backpropagation and stochastic
gradient descent to adjust the weights in the network.
Implementing the MLE-based classifier.
The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
Maximization step (M - step): This step involves the use of estimated data in the E-step
and updating the parameters.
Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and Convergence Step. These steps are explained as
follows:
Initialization Step :1st Step: The very first step is to initialize the parameter values.
Further, the system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
Expectation Step 2nd Step: This step is known as Expectation or E-Step, which is used to
estimate or guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
Maximization Step 3rd Step: This step is known as Maximization or M-step, where we
use complete data obtained from the 2nd step to update the parameter values. Further, M-
step primarily updates the hypothesis.
Convergence Step 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.
In classification tasks, MLE is used to estimate model parameters that maximize the
probability of correctly classifying the data.
We construct a likelihood function based on the probability of the class labels given the
data and model parameters.
MLE maximizes the log-likelihood of observing the data by adjusting the parameters
through optimization methods like gradient descent.
The Basics of Bayesian Statistics
Bayesian statistics is a branch of statistics that deals with uncertainty by incorporating
prior knowledge or beliefs into the analysis of data.
It is named after Thomas Bayes, an 18th-century mathematician and theologian. The
fundamental concept in Bayesian statistics is the Bayes' theorem, which describes how to
update our beliefs in light of new evidence. Here are the basics of Bayesian statistics:
Bayesian statistics is a powerful approach for making statistical inferences, especially
when dealing with small or complex data sets and when incorporating prior information is
valuable or necessary.
However, the choice of prior can be subjective and influence results, so it's essential to
carefully consider and justify your prior assumptions in Bayesian analysis.
Bayesian statistics is a statistical theory and approach to data analysis that uses Bayes'
theorem to describe the probability of an event based on previous knowledge and
observed and unobserved parameters.
Bayes' Theorem
Bayes' Theorem, named after 18th-century British mathematician Thomas Bayes, is a
mathematical formula for determining conditional probability. Conditional probability is
the likelihood of an outcome occurring based on a previous outcome in similar
circumstances.
The Bayes theorem is frequently referred to as the Bayes rule or Bayes Law. One of the
most well-known theories in machine learning, the Bayes theorem helps determine the
likelihood that one event will occur with unclear information while another has already
happened.
Bayes' theorem describes the probability of occurrence of an event related to any
condition. It is also considered for the case of conditional probability. Bayes theorem
is also known as the formula for the probability of “causes”
Bayes' Theorem:
At its core, Bayesian statistics is built upon Bayes' theorem, which relates conditional
probabilities. The theorem can be expressed as follows:
P(A | B) = [P(B | A) * P(A)] / P(B)
P(A | B) represents the probability of event A occurring given that event B has
occurred.
P(B | A) is the probability of event B occurring given that event A has occurred.
P(A) and P(B) are the marginal probabilities of events A and B, respectively.
In the context of Bayesian statistics:
P(A | B) is the posterior probability, which represents our updated belief in A after
observing B.
P(B | A) is the likelihood, describing the probability of observing B if A is true.
P(A) is the prior probability, representing our initial belief in A before considering B.
P(B) is the marginal likelihood or evidence, the probability of observing B without
considering A.
Prior Probability:
In Bayesian statistics, prior knowledge or beliefs about the probability of an event play a
crucial role. This is known as the prior probability (P(A)), which represents your initial
beliefs about the likelihood of an event occurring before observing any data. The choice of
the prior can significantly influence the final results.
Likelihood:
The likelihood (P(B | A)) represents the probability of observing the data (B) given a
specific hypothesis or model (A). It quantifies how well the hypothesis or model explains
the observed data. It's essential to choose an appropriate likelihood function that accurately
reflects the relationship between your data and the parameters you want to estimate.
Posterior Probability:
The posterior probability (P(A | B)) is the updated belief in the hypothesis or model A after
observing the data B. It is calculated using Bayes' theorem and combines the prior
knowledge and the likelihood of the data. The posterior probability provides a more
informed and updated estimate of the event's probability based on the observed data.
Bayesian Inference:
Bayesian inference involves using Bayes' theorem to estimate or update the parameters of
a statistical model based on observed data. This process typically includes:
Choosing a prior distribution that reflects your initial beliefs.
Defining a likelihood function that describes the data-generation process.
Applying Bayes' theorem to calculate the posterior distribution, which represents updated
parameter estimates based on the data.
Supervised Learning and Unsupervised Learning algorithms:
The algorithm learns by comparing The algorithm explores the data's inherent
Learning Process predictions to actual labels and adjusting structure, often without predefined
model parameters to minimize errors. categories or labels.
while our model is defined to have w with norm one and reconstruction function
In some cases, the cost function may be a function that we cannot actually evaluate, for
computational reasons.
In these cases, we can still approximately minimize it using iterative numerical
optimization so long as we have some way of approximating its gradients.
Most machine learning algorithms make use of this recipe, though it may not immediately
be obvious.
If a machine learning algorithm seems especially unique or hand-designed, it can usually
be understood as using a special-case optimizer.
Some models such as decision trees or k-means require special-case optimizers because
their cost functions have flat regions that make them inappropriate for minimization by
gradient-based optimizers.
Recognizing that most machine learning algorithms can be described using this recipe
helps to see the different algorithms as part of a taxonomy of methods for doing related
tasks that work for similar reasons, rather than as a long list of algorithms that each have
separate justifications
Challenges and Motivation to Deep learning.
During the development phase our focus is to select a learning algorithm and train it
on some data, the two things that might be a problem are a bad algorithm or bad
data, or perhaps both of them.
Many machine learning problems become exceedingly difficult when the number of
dimensions in the data is high. This phenomenon is known as the curse of dimensionality.
Of particular concern is that the number of possible distinct configurations of a set of
variables increases exponentially as the number of variables increases.
1. Not enough training data :
Let’s say for a child, to make him learn what an apple is, all it takes for you to point to an
apple and say apple repeatedly.
Now the child can recognize all sorts of apples.
Well, machine learning is still not up to that level yet; it takes a lot of data for most of the
algorithms to function properly.
For a simple task, it needs thousands of examples to make something out of it, and for
advanced tasks like image or speech recognition, it may need lakhs(millions) of examples
2. Poor Quality of data:
Obviously, if your training data has lots of errors, outliers, and noise, it will make it
impossible for your machine learning model to detect a proper underlying pattern. Hence,
it will not perform well.
So put in every ounce of effort in cleaning up your training data. No matter how good you
are in selecting and hyper tuning the model, this part plays a major role in helping us make
an accurate machine learning model.
“Most Data Scientists spend a significant part of their time in cleaning data”.
There are a couple of examples when you’d want to clean up the data :
If you see some of the instances are clear outliers just discard them or fix them manually.
If some of the instances are missing a feature like (E.g., 2% of user did not specify their
age), you can either ignore these instances, or fill the missing values by median age, or
train one model with the feature and train one without it to come up with a conclusion
3. Irrelevant Features:
“Garbage in, garbage out (GIGO).”
Features are relevant if they are either strongly or weakly relevant, and are irrelevant
otherwise.
Irrelevant features can never contribute to prediction accuracy, by definition.
The credit for a successful machine learning project goes to coming up with a good set of
features on which it has been trained (often referred to as feature engineering ), which
includes feature selection, extraction, and creating new features which are other interesting
topics to be covered in upcoming blogs.
4. Non-representative training data:
To make sure that our model generalizes well, we have to make sure that our training data
should be representative of the new cases that we want to generalize to.
If train our model by using a non-representative training set, it won’t be accurate in
predictions it will be biased against one class or a group.
For E.G., Let us say you are trying to build a model that recognizes the genre of music.
One way to build your training set is to search it on YouTube and use the resulting data.
Here we assume that YouTube's search engine is providing representative data but in
reality, the search will be biased towards popular artists and maybe even the artists that are
popular in your location(if you live in India you will be getting the music of SBP, etc).
So use representative data during training, so your model won’t be biased among one or
two classes when it works on testing data.
5. Overfitting and Underfitting :
Overfitting is an undesirable machine learning behavior that occurs when the machine
learning model gives accurate predictions for training data but not for new data. When data
scientists use machine learning models for making predictions, they first train the model on
a known data set.
Overfitting happens when our model is too complex.
Things which we can do to overcome this problem:
Simplify the model by selecting one with fewer parameters.
By reducing the number of attributes in training data.
Constraining the model.
Gather more training data.
Reduce the noise.
Underfitting
When a model has not learned the patterns in the training data well and is unable to
generalize well on the new data, it is known as underfitting.
An underfit model has poor performance on the training data and will result in unreliable
predictions.
Things which we can do to overcome this problem:
Select a more advanced model, one with more parameters.
Train on better and relevant features.
Reduce the constraints.
All of us do not have
equal talent.
But , all of us have an
equal opportunity to
develop our talents.
A. P. J. Abdul Kalam
89