0% found this document useful (0 votes)

5 views90 pages

DL Unit 1 Notes

The document provides an overview of deep learning and its applications, detailing the fundamentals of machine learning, including algorithms, data dependency, and model complexity. It discusses various machine learning tasks such as classification, regression, and anomaly detection, as well as the differences between machine learning and deep learning. Additionally, it highlights the importance of performance measures and the distinction between supervised and unsupervised learning experiences.

Uploaded by

NARAYANAN MADESHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views90 pages

DL Unit 1 Notes

Uploaded by

NARAYANAN MADESHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

DEEP LEARNING AND APPLICATIONS

MR20-1CS0158
UNIT I

Prepared by
Dr.M.Narayanan
Professor
Department of CSE
Malla Reddy University, Hyderabad
DEEP LEARNING AND APPLICATIONS
MR20-1CS0158
UNIT I
Machine Learning Basics: Learning Algorithms, Capacity, Over fitting, and Under
fitting, Hyper parameters and Validation Sets, Estimators, Bias and Variance, Maximum
Likelihood Estimation, Bayesian Statistics, Supervised and Unsupervised Learning
algorithms, Stochastic Gradient Descent, Building a ML algorithm, Challenges and
Motivation to Deep learning.

Text Book
1. Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016
Machine Learning
 Machine learning is a branch of artificial intelligence that empowers computers to learn
from data and improve their performance over time without explicit programming.
 It involves creating algorithms that can recognize patterns in data, make predictions, and
solve complex tasks, leading to applications in areas such as image recognition, language
processing, and autonomous systems.
Application:
 Develop a machine learning model that can analyze medical imaging data to accurately
detect and diagnose specific diseases, such as identifying early signs of diabetic
retinopathy in retinal scans.
 Construct a sentiment analysis model that can automatically determine the sentiment
(positive, negative, neutral) expressed in a given text, which could be used for
analyzing social media posts or customer reviews.
 Create a model that predicts stock market trends and prices by analyzing historical
market data and incorporating relevant economic indicators, assisting investors in
making informed decisions.
Deep Learning
 Deep learning is a subset of machine learning focused on neural networks with multiple
layers, allowing automated extraction of complex data features.
 Deep learning is a method in artificial intelligence (AI) that teaches computers to process
data in a way that is inspired by the human brain.
 Deep learning models can recognize complex patterns in pictures, text, sounds, and
other data to produce accurate insights and predictions.
 It excels in tasks like image recognition, natural language processing, and autonomous
systems, by learning hierarchical representations from data and eliminating the need for
extensive manual feature engineering.
Application:
 Design a deep learning algorithm that enables a drone to navigate through complex
environments and avoid obstacles, using onboard cameras for visual input.
 Create a deep learning model to segment and identify specific structures or anomalies in
medical images, aiding in tasks like tumor detection in MRI scans.
Aspect Machine Learning Deep Learning

Definition Subset of AI that learns patterns from data Subset of ML using deep neural networks

Works well with smaller, structured Requires large amounts of data, works with
Data Dependency
datasets unstructured data

Feature Requires manual feature extraction and Automatically extracts features from raw
Engineering selection data

Model Complex models with multiple layers,

Simpler, more interpretable models
Complexity harder to interpret

Computational High computational power, often relies on

Less computational power needed
Requirements GPUs
Recommendation systems, fraud detection, Image/video analysis, speech recognition,
Applications
customer segmentation natural language processing
Aspect Machine Learning Deep Learning

Longer training times due to complexity

Training Time Shorter training times
and data volume

Linear regression, decision trees, Convolutional Neural Networks (CNNs),

Example Algorithms
SVM, k-NN Recurrent Neural Networks (RNNs)

Performance with May not improve significantly with Performance improves with more data and
Scale more data deeper models

Generally more interpretable and Often considered a "black box" with less
Interpretability
explainable. interpretability.
Learning Algorithms
 A machine learning algorithm is an algorithm that is able to learn from data.
What is the mean by learning?
 Mitchell (1997) provides the definition
 “A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E.”
 One can imagine a very wide variety of experiences E, tasks T, and performance measures
P.
 Learning refers to the process by which a computer program (or system) becomes better at
performing a certain task T as it gains more experience E.
 The improvement in performance is measured by a performance measure P. This process
involves the program adjusting its internal parameters based on the data it encounters,
allowing it to make better predictions or decisions over time.
 The goal of learning is to enable the program to generalize from the provided data
and perform well on new, unseen data.
The Task, T
 Machine learning allows us to tackle tasks that are too difficult to solve with fixed
programs written and designed by human beings.
 From a scientific and philosophical point of view, machine learning is interesting
because developing our understanding of the principles that underlie (behind)
intelligence.
 In this relatively formal definition of the word “task,”
 “The process of learning itself is not the task. Learning is our means of attaining
the ability to perform the task”.
 For example, if we want a robot to be able to walk, then walking is the task.
 We could program the robot to learn to walk, or we could attempt to directly write a
program that specifies how to walk manually.
 Machine learning tasks are usually described in terms of how the machine learning
system should process.
 An example is a collection of features that have been quantitatively measured from
some object or event that we want the machine learning system to process.
 We typically represent an example as a vector x ∈ Rn where each entry xi of the vector
is another feature.
 For example, the features of an image are usually the values of the pixels in the image.
 Many kinds of tasks can be solved with machine learning. Some of the most
common machine learning tasks include the following:
1.Classification:
 Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data.
 In classification, the model is fully trained using the training data, and then it is
evaluated on test data before being used to perform prediction on new unseen data.
 In this type of task, the computer program is asked to specify which of k categories some
input belongs to.
 To solve this task, the learning algorithm is usually asked to produce a function
f : Rn → {1, . . . , k}.
 When y = f (x), the model assigns an input described by vector x to a category identified
by numeric code y.
 There are other variants of the classification task, for example, where f outputs a
probability distribution over classes.
 An example of a classification task is object recognition, where the input is an image
(usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image.
2. Classification with missing inputs:
 Classification becomes more challenging if the computer program is not guaranteed that every
measurement in its input vector will always be provided.
 In order to solve the classification task, the learning algorithm only has to define a single
function mapping from a vector input to a categorical output.
 When some of the inputs may be missing, rather than providing a single classification function,
the learning algorithm must learn a set of functions. Each function corresponds to classifying x
with a different subset of its inputs missing.
 This kind of situation arises frequently in medical diagnosis, because many kinds of medical
tests are expensive or invasive.
 One way to efficiently define such a large set of functions is to learn a probability distribution
over all of the relevant variables, then solve the classification task by marginalizing out the
missing variables.
3. Regression:
 Regression is a supervised machine learning technique which is used to predict
continuous values. The ultimate goal of the regression algorithm is to plot a best-fit
line or a curve between the data. The three main metrics that are used for evaluating
the trained regression model are variance, bias and error.
 In this type of task, the computer program is asked to predict a numerical value given some
input.
 To solve this task, the learning algorithm is asked to output a function f : Rn → R.
 This type of task is similar to classification, except that the format of output is different.
 An example of a regression task is the prediction of the expected claim amount that an
insured person will make (used to set insurance premiums), or the prediction of future
prices of securities.
 These kinds of predictions are also used for algorithmic trading.
4. Transcription:
 In this type of task, the machine learning system is asked to observe a relatively
unstructured representation of some kind of data and transcribe it into discrete, textual
form.
 For example, in optical character recognition, the computer program is shown a
photograph containing an image of text and is asked to return this text in the form of a
sequence of characters (e.g., in ASCII or Unicode format).

5. Machine translation:
 In a machine translation task, the input already consists of a sequence of symbols in some
language, and the computer program must convert this into a sequence of symbols in
another language.
 This is commonly applied to natural languages, such as to translate from English to
French.
 Deep learning has recently begun to have an important impact on this kind of task
(Sutskever et al., 2014; Bahdanau et al., 2015).
6. Structured output:
 Structured output tasks involve any task where the output is a vector (or other data
structure containing multiple values) with important relationships between the different
elements.
 This is a broad category, and incorporates the transcription and translation tasks
described above, but also many other tasks.
 One example is parsing—mapping a natural language sentence into a tree that
describes its grammatical structure and tagging nodes of the trees as being verbs,
nouns, or adverbs, and so on.

7. Anomaly detection:
 In this type of task, the computer program sifts through a set of events or objects, and
flags some of them as being unusual or uncharacteristic.
 An example of an anomaly detection task is credit card fraud detection.
 By modeling your purchasing habits, a credit card company can detect misuse of your
cards.
 If a thief steals your credit card or credit card information, the thief’s purchases will
often come from a different probability distribution over purchase types than your own.
 The credit card company can prevent fraud by placing a hold on an account as soon as
that card has been used for an uncharacteristic purchase.

8. Synthesis and sampling:

 In this type of task, the machine learning algorithm is asked to generate new examples
that are similar to those in the training data.
 Synthesis and sampling via machine learning can be useful for media applications
where it can be expensive or boring for an artist to generate large volumes of content by
hand.
 For example, video games can automatically generate textures for large objects or
landscapes, rather than requiring an artist to manually label each pixel
9. Imputation of missing values:
 In this type of task, the machine learning algorithm is given a new example x ∈ Rn, but
with some entries xi of x missing.
 The algorithm must provide a prediction of the values of the missing entries.

10. Denoising:
 In this type of task, the machine learning algorithm is given in input a corrupted
example x˜ ∈ Rn obtained by an unknown corruption process from a clean example x ∈
Rn .
 The learner must predict the clean example x from its corrupted version x˜, or more
generally predict the conditional probability distribution p(x | x˜).
11. Density estimation or probability mass function estimation:
 In the density estimation problem, the machine learning algorithm is asked to learn a
function Pmodel : Rn → R, where Pmodel(x) can be interpreted as a probability density
function (if x is continuous) or a probability mass function (if x is discrete) on the
space that the examples were drawn from.
 To do such a task well (we will specify exactly what that means when we discuss
performance measures P ), the algorithm needs to learn the structure of the data it has
seen. It must know where examples cluster tightly and where they are unlikely to
occur.
 Most of the tasks described above require that the learning algorithm has at least
implicitly captured the structure of the probability distribution.

 Of course, many other tasks and types of tasks are possible. The types of tasks we list
here are intended only to provide examples of what machine learning can do, not to
define a rigid taxonomy (classification ) of tasks.
 The Performance Measure, P
 In order to evaluate the abilities of a machine learning algorithm, we must design a
quantitative measure of its performance.
 Usually this performance measure P is specific to the task T being carried out by the
system.
 For tasks such as classification, classification with missing inputs, and
transcription, we often measure the accuracy of the model.
 Accuracy is just the proportion of examples for which the model produces the
correct output. We can also obtain equivalent information by measuring the error rate,
the proportion of examples for which the model produces an incorrect output.
 We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular
example is 0 if it is correctly classified and 1 if it is not.
 For tasks such as density estimation, it does not make sense to measure accuracy, error
rate, or any other kind of 0-1 loss.
 Instead, we must use a different performance metric that gives the model a continuous-
valued score for each example. The most common approach is to report the average
log-probability the model assigns to some examples
 The choice of performance measure may seem straightforward and objective, but it is
often difficult to choose a performance measure that corresponds well to the desired
behavior of the system.
 In some cases, this is because it is difficult to decide what should be measured.
 For example, when performing a transcription task, should we measure the accuracy of
the system at transcribing entire sequences, or should we use a more fine-grained
performance measure that gives partial credit for getting some elements of the sequence
correct?
 When performing a regression task, should we correct the system more if it frequently
makes medium-sized mistakes or if it rarely makes very large mistakes?
 These kinds of design choices depend on the application.
The Experience, E
 Machine learning algorithms can be broadly categorized as unsupervised or supervised
by what kind of experience they are allowed to have during the learning process.
 Most of the learning algorithms can be understood as being allowed to experience an
entire dataset.
 One of the oldest datasets studied by statisticians and machine learning researchers is
the Iris dataset (Fisher, 1936).
 It is a collection of measurements of different parts of 150 iris plants. Each individual
plant corresponds to one example.
 The features within each example are the measurements of each of the parts of the
plant: the sepal length, sepal width, petal length and petal width.
 The dataset also records which species each plant belonged to. Three different species
are represented in the dataset.
 Unsupervised learning algorithms experience a dataset containing many features, then
learn useful properties of the structure of this dataset.
 In the context of deep learning, we usually want to learn the entire probability
distribution that generated a dataset, whether explicitly as in density estimation or
implicitly for tasks like synthesis or denoising.
 Some other unsupervised learning algorithms perform other roles, like clustering,
which consists of dividing the dataset into clusters of similar examples.
 Supervised learning algorithms experience a dataset containing features, but each
example is also associated with a label or target.
 For example, the Iris dataset is annotated with the species of each iris plant.
 A supervised learning algorithm can study the Iris dataset and learn to classify iris
plants into three different species based on their measurements.
 Some machine learning algorithms do not just experience a fixed dataset. For
example, reinforcement learning algorithms interact with an environment, so
there is a feedback loop between the learning system and its experiences.
 Most machine learning algorithms simply experience a dataset. A dataset can be
described in many ways. In all cases, a dataset is a collection of examples, which are in
turn collections of features.
 There is no formal definition of supervised and unsupervised learning, there is no
rigid taxonomy of datasets or experiences. The structures described here cover
most cases, but it is always possible to design new ones for new applications.
Capacity, Overfitting and Underfitting
 The central challenge in machine learning is that we must perform well on new, previously
unseen inputs—not just those on which our model was trained.
 The ability to perform well on previously unobserved inputs is called generalization.
 Typically, when training a machine learning model, we have access to a training set, we
can compute some error measure on the training set called the training error, and we
reduce this training error.
 So far, what we have described is simply an optimization problem.
 What separates machine learning from optimization is that we want the generalization
error, also called the test error, to be low as well.
 The generalization error is defined as the expected value of the error on a new
input.
 Here the expectation is taken across different possible inputs, drawn from the
distribution of inputs we expect the system to encounter in practice.
 We typically estimate the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately from
the training set.
 If the training and the test set are collected randomly.
 If we are allowed to make some assumptions about how the training and test set are collected,
then we can make some progress.
 The train and test data are generated by a probability distribution over datasets called
the data generating process.
 We typically make a set of assumptions known collectively as the these assumptions i.i.d.
assumptions (Independent and identically distributed random variables) are that the
examples in each dataset are independent from each other, and that the train set and test set are
identically distributed, drawn from the same probability distribution as each other.
 This assumption allows us to describe the data generating process with a probability
distribution over a single example.
 The same distribution is then used to generate every train example and every test example.
 We call that shared underlying distribution the data generating distribution, denoted
pdata.
 This probabilistic framework and the i.i.d. assumptions (Independent and identically
distributed random variables) allow us to mathematically study the relationship between
training error and test error.
 One immediate connection we can observe between the training and test error is that
the expected training error of a randomly selected model is equal to the expected test
error of that model.
 Suppose we have a probability distribution p( x, y) and we sample from it repeatedly to
generate the train set and the test set.
 For some fixed value w, the expected training set error is exactly the same as the expected
test set error, because both expectations are formed using the same dataset sampling
process.
 The only difference between the two conditions is the name we assign to the dataset we
sample.
 We sample the training set, then use it to choose the parameters to reduce training set error,
then sample the test set.
 Under this process, the expected test error is greater than or equal to the expected value of
training error.
 The factors determining how well a machine learning algorithm will perform are its ability
to:
1. Make the training error small.
2. Make the gap between training and test error small.
 These two factors correspond to the two central challenges in machine learning:
underfitting and overfitting.
 Underfitting occurs when the model is not able to obtain a sufficiently low error value on
the training set.
 Overfitting occurs when the gap between the training error and test error is too large.
 We can control whether a model is more likely to overfit or underfit by altering its
capacity. Informally, a model’s capacity is its ability to fit a wide variety of functions.
 Models with low capacity may struggle to fit the training set.
 Models with high capacity can overfit by memorizing properties of the training set that do
not serve them well on the test set.
 One way to control the capacity of a learning algorithm is by choosing its hypothesis
space, the set of functions that the learning algorithm is allowed to select as being the
solution.
 For example, the linear regression algorithm has the set of all linear functions of its input
as its hypothesis space.
 We can generalize linear regression to include polynomials, rather than just linear
functions, in its hypothesis space.
 Doing so increases the model’s capacity.
 A polynomial of degree one gives us the linear regression model with which we are already
familiar, with prediction
yˆ = b + wx.
 By introducing x2 as another feature provided to the linear regression model, we can learn a
model that is quadratic as a function of x:
yˆ = b + w1x1 + w2x2
 Though this model implements a quadratic function of its input, the output is still a linear
function of the parameters, so we can still use the normal equations to train the model in closed
form. We can continue to add more powers of x as additional features, for example to obtain a
polynomial of degree 9:

 Machine learning algorithms will generally perform best when their capacity is
appropriate in regard to the true complexity of the task they need to perform and the
amount of training data they are provided with.
 Models with insufficient capacity are unable to solve complex tasks. Models with high
capacity can solve complex tasks, but when their capacity is higher than needed to solve the
present task they may overfit.
 Fig. 5.2 shows this principle in action. We compare a linear, quadratic and degree-9
predictor attempting to fit a problem where the true underlying function is quadratic.
 The linear function is unable to capture the curvature in the true underlying problem,
so it underfits.
 The degree-9 predictor is capable of representing the correct function, but it is also capable
of representing infinitely many other functions that pass exactly through the training
points, because we have more parameters than training examples.
 We have little chance of choosing a solution that generalizes well when so many wildly
different solutions exist.
 In this example, the quadratic model is perfectly matched to the true structure of the task
so it generalizes well to new data.
 Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by randomly sampling x values and choosing y deterministically
by evaluating a quadratic function.
Figure 5.2:

 (Left) A linear function fit to the data suffers from underfitting—it cannot capture
the curvature that is present in the data.
 (Center) A quadratic function fit to the data generalizes well to unseen points. It does
not suffer from a significant amount of overfitting or underfitting.
 (Right) A polynomial of degree 9 fit to the data suffers from overfitting. Here we
used the Moore-Penrose pseudoinverse to solve the underdetermined normal
equations.
 The solution passes through all of the training points exactly, but we have not been
lucky enough for it to extract the correct structure.
 It now has a deep valley in between two training points that does not appear in the true
underlying function.
 It also increases sharply on the left side of the data, while the true function decreases in
this area.
 So far we have only described changing a model’s capacity by changing the number of
input features it has (and simultaneously adding new parameters associated with those
features).
 There are in fact many ways of changing a model’s capacity. Capacity is not
determined only by the choice of model.
 The model specifies which family of functions the learning algorithm can choose from
when varying the parameters in order to reduce a training objective.
 This is called the representational capacity of the model. In many cases, finding the best
function within this family is a very difficult optimization problem.
 In practice, the learning algorithm does not actually find the best function, but merely one
that significantly reduces the training error.
 These additional limitations, such as the imperfection of the optimization algorithm, mean
that the learning algorithm’s effective capacity may be less than the representational
capacity of the model family
 Figure 5.3: Typical relationship between capacity and error. Training and test error
behave differently. At the left end of the graph, training error and generalization
error are both high. This is the underfitting rule.
 As we increase capacity, training error decreases, but the gap between training and
generalization error increases. Eventually, the size of this gap outweighs the decrease
in training error, and we enter the overfitting rule, where capacity is too large, above the
optimal capacity.

Figure 5.3
 Regularization
 The no free lunch theorem implies that we must design our machine learning algorithms
to perform well on a specific task. We do so by building a set of preferences into the
learning algorithm. When these preferences are aligned with the learning problems we ask
the algorithm to solve, it performs better.
 Regularization is a technique used in machine learning and deep learning to reduce
overfitting by discouraging the model from becoming too complex.
 It does this by adding a penalty to the loss function, which limits how much the model can
rely on large weights or complicated structures.
 In simple terms, regularization helps the model focus on the most important patterns in the
training data and ignore noise or random fluctuations, leading to better performance on
new, unseen data.
 There are different types of regularization:
 L1 regularization (Lasso): Makes some weights exactly zero, encouraging sparsity.
 L2 regularization (Ridge or weight decay): Keeps weights small, encouraging smoother
models.
 The "No Free Lunch" (NFL) theorem in optimization and machine learning states
that no single optimization algorithm or machine learning model can universally
outperform all others across all possible problems. Essentially, if an algorithm
performs well on a specific set of problems, it must perform correspondingly worse
on others when averaged over all possible problems. This implies that there is no
"magic bullet" algorithm that works best in all situations.
 Regularization is any modification we make to a learning algorithm that is intended
to reduce its generalization error but not its training error.
 Regularization is one of the central concerns of the field of machine learning, rivaled
in its importance only by optimization.
 The no free lunch theorem has made it clear that there is no best machine learning
algorithm, and, in particular, no best form of regularization. Instead we must choose a form
of regularization that is well-suited to the particular task we want to solve.
Hyperparameters and Validation Sets
 Most machine learning algorithms have several settings that we can use to control the
behavior of the learning algorithm. These settings are called hyperparameters.
 The values of hyperparameters are not adapted by the learning algorithm itself (though we can
design a nested learning procedure where one learning algorithm learns the best
hyperparameters for another learning algorithm).
 Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up learning.
The prefix 'hyper_' suggests that they are 'top-level' parameters that control the learning
process and the model parameters that result from it.
 What is the difference between parameters and hyperparameters?
 Parameters are the internal values (like weights and biases in a neural network) that are
automatically learned from the training data and directly determine the model’s output.
 Hyperparameters are the external settings (like learning rate, batch size, number of
layers) manually defined before training that guide the learning process but are not
updated during it.
 Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not
learn because it is difficult to optimize
 More frequently, we do not learn the hyperparameter because it is not appropriate to learn
that hyperparameter on the training set.
 This applies to all hyperparameters that control model capacity. If learned on the
training set, such hyperparameters would always choose the maximum possible model
capacity, resulting in overfitting (refer to Fig. ).
 For example, we can always fit the training set better with a higher degree polynomial and
a weight decay setting of λ = 0 than we could with a lower degree polynomial and a
positive weight decay setting.
 To solve this problem, we need a validation set of examples that the training algorithm
does not observe.
 Earlier we discussed how a held-out test set, composed of examples coming from the same
distribution as the training set, can be used to estimate the generalization error of a learner,
after the learning process has completed.
 It is important that the test examples are not used in any way to make choices about the
model, including its hyperparameters.
 For this reason, no example from the test set can be used in the validation set. Therefore,
we always construct the validation set from the training data.
 Specifically, we split the training data into two disjoint subsets.
 One of these subsets is used to learn the parameters. The other subset is our validation
set, used to estimate the generalization error during or after training, allowing for the
hyperparameters to be updated accordingly.
 The subset of data used to learn the parameters is still typically called the training set,
even though this may be confused with the larger pool of data used for the entire
training process.
 The subset of data used to guide the selection of hyperparameters is called the
validation set.
 Typically, one uses about 80% of the training data for training and 20% for validation.
 Since the validation set is used to “train” the hyperparameters, the validation set error
will underestimate the generalization error, though typically by a smaller amount than
the training error.
 After all hyperparameter optimization is complete, the generalization error may be
estimated using the test set.
Cross-Validation in ML
 Cross validation is a technique used in machine learning to evaluate the performance
of a model on unseen data.
 It involves dividing the available data into multiple folds or subsets, using one of these
folds as a validation set, and training the model on the remaining folds.
 This process is repeated multiple times, each time using a different fold as the
validation set.
 Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
 Cross validation is an important step in the machine learning process and helps to
ensure that the model selected for deployment is robust and generalizes well to new
data.
 Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie, failing
to generalize a pattern.

 Why do we need cross-validation?

 Cross Validation is commonly used in Machine Learning to compare different models and
select the most appropriate one for a specific problem. It is both easy to understand, easy to
implement, and less biased than other methods.

 Purpose of cross validation in machine learning

 The purpose of cross–validation is to test the ability of a machine learning model to
predict new data.
 It is also used to flag problems like overfitting or selection bias and gives insights on
how the model will generalize to an independent dataset.
 Bias and Variance in Machine Learning
 Machine learning is a branch of Artificial Intelligence, which allows machines to perform
data analysis and make predictions.

 However, if the machine learning model is not

accurate, it can make predictions errors, and these
prediction errors are usually known as Bias and
Variance.
 In machine learning, these errors will always be
present as there is always a slight difference
between the model predictions and actual
predictions.
 The main aim of ML/data science analysts is to
reduce these errors in order to get more accurate
results.
 Errors in Machine Learning?
 In machine learning, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset. On the basis of these errors, the machine
learning model is selected that can perform best on the particular dataset. There are mainly
two types of errors in machine learning, which are:

 Reducible errors: These errors can be reduced

to improve the model accuracy. Such errors can
further be classified into bias and Variance.

 Irreducible errors: These errors will always

be present in the model regardless of which
algorithm has been used. The cause of these
errors is unknown variables whose value can't
be reduced.
 What is Bias?
 In general, a machine learning model analyses the data, find patterns in it and make
predictions.
 While training, the model learns these patterns in the dataset and applies them to test data
for prediction.
 While making predictions, a difference occurs between prediction values made by the
model and actual values and expected values, and this difference is known as bias
errors or Errors due to bias.
 It can be defined as an inability of machine learning algorithms such as Linear Regression
to capture the true relationship between the data points.
 Each algorithm begins with some amount of bias because bias occurs from assumptions in
the model, which makes the target function simple to learn.
 A model has either:
 Low Bias: A low bias model will make fewer assumptions about the form of the target
function.
 High Bias: A model with a high bias makes more assumptions, and the model becomes
unable to capture the important features of our dataset. A high bias model also cannot
perform well on new data.
 Generally, a linear algorithm has a high bias, as it makes them learn fast.
 The simpler the algorithm, the higher the bias it has likely to be introduced. Whereas a
nonlinear algorithm often has low bias.
 Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines.
 At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant
Analysis and Logistic Regression.
Ways to reduce High Bias:
 High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
 Identify potential sources of bias.
 Identify accurate representative data
 Increase the input features as the model is underfitted.
 Decrease the regularization term.
 Use more complex models, such as including some polynomial features.

Regularization is a technique used in machine learning and deep learning to prevent

overfitting and improve the generalization performance of a model. It involves adding a
penalty term to the loss function during training.
What is a Variance Error?
 The variance would specify the amount of variation in the prediction if the different
training data was used.
 In simple words, variance tells that how much a random variable is different from its
expected value.
 Ideally, a model should not vary too much from one training dataset to another, which
means the algorithm should be good in understanding the hidden mapping between inputs
and output variables.
 Variance errors are either of low variance or high variance.
 How much the predicted target values from models trained using different samples
vary from each other; • Noise: How much the sample target values differ from the
true model.
 Low variance means there is a small variation in the prediction of the target
function with changes in the training data set.
 At the same time, High variance shows a large variation in the prediction of the target
function with changes in the training dataset.
 A model that shows high variance learns a lot and perform well with the training
dataset, and does not generalize well with the unseen dataset.
 As a result, such a model gives good results with the training dataset but shows high
error rates on the test dataset.
 Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
 A high variance model leads to overfitting.
 Increase model complexities.
 Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance.
 Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis.
 At the same time, algorithms with high variance are decision tree, Support Vector
Machine, and K-nearest neighbours.
 Ways to Reduce High Variance:
 Reduce the input features or number of parameters as a model is overfitted.
 Do not use a much complex model.
 Increase the training data.
 Increase the Regularization term.
 Different Combinations of Bias-Variance
 Low-Bias, Low-Variance:
 The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
 Low-Bias, High-Variance:
 With low bias and high variance, model predictions are inconsistent and accurate on
average. This case occurs when the model learns with a large number of parameters and
hence leads to an overfitting
 High-Bias, Low-Variance:
 With High bias and low variance, predictions are consistent but inaccurate on average.
This case occurs when a model does not learn well with the training dataset or uses few
numbers of the parameter. It leads to underfitting problems in the model.
 High-Bias, High-Variance:
 With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
 There are four possible combinations of bias and variances, which are represented by the
below diagram
 Maximum Likelihood Estimation
 The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan Laird,
and Donald Rubin in 1977.
 The EM (Expectation-Maximization) algorithm is one of the most commonly used
terms in machine learning to obtain maximum likelihood estimates of variables
that are sometimes observable and sometimes not observable .
 However, it is also applicable to unobserved data or sometimes called latent (hidden).
 It has various real-world applications in statistics, including obtaining the mode of the
posterior marginal distribution of parameters in machine learning and data mining
applications.
Note:
Posterior: Conditional probability distribution representing what parameters are likely
after observing the data object.
Likelihood: The probability of falling under a specific category or class.
 In most real-life time applications of machine learning, it is found that several
relevant learning features are available, but very few of them are observable, and the
rest are unobservable.
 If the variables are observable, then it can predict the value using instances.
 On the other hand, the variables which are latent (hidden) or directly not observable,
for such variables Expectation-Maximization (EM) algorithm plays a vital role to
predict the value with the condition that the general form of probability distribution
governing those latent variables is known to us.

Application:
 A medical researcher is studying the relationship between a patient's age and whether
they have a particular disease (yes/no). Using a dataset of patients, the researcher uses
MLE to estimate the coefficients in a logistic regression model that predicts the
probability of disease based on age.
 What is an EM algorithm?
 The Expectation-Maximization (EM) algorithm is defined as the combination of
various unsupervised machine learning algorithms, which is used to determine the
local maximum likelihood estimates (MLE) or maximum a posteriori estimates
(MAP) for unobservable variables in statistical models.
 Further, it is a technique to find maximum likelihood estimation when the latent
variables are present. It is also referred to as the latent variable model.
 A latent variable model consists of both observable and unobservable variables
where observable can be predicted while unobserved are inferred from the
observed variable. These unobservable variables are known as latent variables.
Note:
 “A priori” and “a posteriori” refer primarily to how, or on what basis, a proposition
might be known.
 In general terms, a proposition is knowable a priori if it is knowable independently of
experience, while a proposition knowable a posteriori is knowable on the basis of
experience.
 EM Algorithm
 The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
 Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
 Maximization step (M - step): This step involves the use of estimated data in the E-step
and updating the parameters.
 Repeat E-step and M-step until the convergence of the values occurs.
 The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?
 Convergence is defined as the specific situation in probability based on intuition, e.g., if
there are two random variables that have very less difference in their probability, then they
are known as converged. In other words, whenever the values of given variables are
matched with each other, it is called convergence.
 Convergence in the Expectation-Maximization (EM) algorithm refers to the point at
which the algorithm has reached a stable set of parameter estimates, meaning that
further iterations will not significantly change these estimates.
Steps in EM Algorithm
 The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and Convergence Step. These steps are explained as
follows:
 Initialization Step :1st Step: The very first step is to initialize the parameter values.
Further, the system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
 Expectation Step 2nd Step: This step is known as Expectation or E-Step, which is used to
estimate or guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
 Maximization Step 3rd Step: This step is known as Maximization or M-step, where we
use complete data obtained from the 2nd step to update the parameter values. Further, M-
step primarily updates the hypothesis.
 Convergence Step 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.
 Applications of EM algorithm
 The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or Latent Variable Model has a
broad range of real-life applications in machine learning. These are as follows:
 The EM algorithm is applicable in data clustering in machine learning.
 It is often used in computer vision and NLP (Natural language processing).
 It is used to estimate the value of the parameter in mixed models such as the Gaussian
Mixture Model and quantitative genetics.
 It is also used in psychometrics for estimating item parameters and latent abilities of item
response theory models.
 It is also applicable in the medical and healthcare industry, such as in image reconstruction
and structural engineering.
 It is used to determine the Gaussian density of a function.
 Explore how MLE is used in classification tasks, such as object recognition or
sentiment analysis, by deriving and implementing the MLE-based classifier
 Maximum Likelihood Estimation (MLE) is a statistical method used to find the parameters
of a model that maximize the probability of observing the given data. In the context of
classification tasks, MLE is used to determine the optimal model parameters that best fit
the training data, allowing for accurate predictions on new, unseen data.
 The MLE Process in ClassificationModel Selection: Choose a suitable probability
distribution for the data. This choice often depends on the nature of the classification
problem (e.g., Bernoulli distribution for binary classification, multinomial distribution for
multi-class classification).
 Parameter Estimation: Define the model parameters. These parameters represent the
characteristics of the distribution that influence the classification decision.
 Likelihood Function: Construct the likelihood function, which expresses the probability
of observing the given training data under the chosen model and parameters.
 Maximization: Find the values of the model parameters that maximize the likelihood
function. This is typically done using optimization techniques like gradient ascent or
Newton-Raphson.
 Classification: Once the optimal parameters are determined, the model can be used to
classify new data points by calculating their likelihood under the learned model and
assigning them to the class with the highest probability.
 MLE in Object Recognition (Deep Learning)
 In more complex tasks like object recognition, MLE is used in models like convolutional
neural networks (CNNs). The process is conceptually similar:
 A CNN outputs a probability distribution over different classes (e.g., cat, dog, etc.) for a
given input image.
 The probability distribution is based on the model's parameters (weights and biases).
 The likelihood is the probability that the CNN assigns to the correct label for each image.
 The log-likelihood is maximized during training using backpropagation and stochastic
gradient descent to adjust the weights in the network.
 Implementing the MLE-based classifier.
 The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
 Expectation step (E - step): It involves the estimation (guess) of all missing values in the
dataset so that after completing this step, there should not be any missing value.
 Maximization step (M - step): This step involves the use of estimated data in the E-step
and updating the parameters.
 Repeat E-step and M-step until the convergence of the values occurs.
 The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Steps in EM Algorithm
 The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and Convergence Step. These steps are explained as
follows:
 Initialization Step :1st Step: The very first step is to initialize the parameter values.
Further, the system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
 Expectation Step 2nd Step: This step is known as Expectation or E-Step, which is used to
estimate or guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
 Maximization Step 3rd Step: This step is known as Maximization or M-step, where we
use complete data obtained from the 2nd step to update the parameter values. Further, M-
step primarily updates the hypothesis.
 Convergence Step 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.
 In classification tasks, MLE is used to estimate model parameters that maximize the
probability of correctly classifying the data.
 We construct a likelihood function based on the probability of the class labels given the
data and model parameters.
 MLE maximizes the log-likelihood of observing the data by adjusting the parameters
through optimization methods like gradient descent.
 The Basics of Bayesian Statistics
 Bayesian statistics is a branch of statistics that deals with uncertainty by incorporating
prior knowledge or beliefs into the analysis of data.
 It is named after Thomas Bayes, an 18th-century mathematician and theologian. The
fundamental concept in Bayesian statistics is the Bayes' theorem, which describes how to
update our beliefs in light of new evidence. Here are the basics of Bayesian statistics:
 Bayesian statistics is a powerful approach for making statistical inferences, especially
when dealing with small or complex data sets and when incorporating prior information is
valuable or necessary.
 However, the choice of prior can be subjective and influence results, so it's essential to
carefully consider and justify your prior assumptions in Bayesian analysis.

Bayesian statistics is a statistical theory and approach to data analysis that uses Bayes'
theorem to describe the probability of an event based on previous knowledge and
observed and unobserved parameters.
Bayes' Theorem
 Bayes' Theorem, named after 18th-century British mathematician Thomas Bayes, is a
mathematical formula for determining conditional probability. Conditional probability is
the likelihood of an outcome occurring based on a previous outcome in similar
circumstances.
 The Bayes theorem is frequently referred to as the Bayes rule or Bayes Law. One of the
most well-known theories in machine learning, the Bayes theorem helps determine the
likelihood that one event will occur with unclear information while another has already
happened.
 Bayes' theorem describes the probability of occurrence of an event related to any
condition. It is also considered for the case of conditional probability. Bayes theorem
is also known as the formula for the probability of “causes”
Bayes' Theorem:
 At its core, Bayesian statistics is built upon Bayes' theorem, which relates conditional
probabilities. The theorem can be expressed as follows:
P(A | B) = [P(B | A) * P(A)] / P(B)
 P(A | B) represents the probability of event A occurring given that event B has
occurred.
 P(B | A) is the probability of event B occurring given that event A has occurred.
 P(A) and P(B) are the marginal probabilities of events A and B, respectively.
 In the context of Bayesian statistics:
 P(A | B) is the posterior probability, which represents our updated belief in A after
observing B.
 P(B | A) is the likelihood, describing the probability of observing B if A is true.
 P(A) is the prior probability, representing our initial belief in A before considering B.
 P(B) is the marginal likelihood or evidence, the probability of observing B without
considering A.
 Prior Probability:
 In Bayesian statistics, prior knowledge or beliefs about the probability of an event play a
crucial role. This is known as the prior probability (P(A)), which represents your initial
beliefs about the likelihood of an event occurring before observing any data. The choice of
the prior can significantly influence the final results.

 Likelihood:
 The likelihood (P(B | A)) represents the probability of observing the data (B) given a
specific hypothesis or model (A). It quantifies how well the hypothesis or model explains
the observed data. It's essential to choose an appropriate likelihood function that accurately
reflects the relationship between your data and the parameters you want to estimate.
 Posterior Probability:
 The posterior probability (P(A | B)) is the updated belief in the hypothesis or model A after
observing the data B. It is calculated using Bayes' theorem and combines the prior
knowledge and the likelihood of the data. The posterior probability provides a more
informed and updated estimate of the event's probability based on the observed data.

 Bayesian Inference:
 Bayesian inference involves using Bayes' theorem to estimate or update the parameters of
a statistical model based on observed data. This process typically includes:
 Choosing a prior distribution that reflects your initial beliefs.
 Defining a likelihood function that describes the data-generation process.
 Applying Bayes' theorem to calculate the posterior distribution, which represents updated
parameter estimates based on the data.
 Supervised Learning and Unsupervised Learning algorithms:

Aspect Supervised Learning Unsupervised Learning

Discover patterns, structures, or
Predict a target variable or label based on
Objective relationships in the data without a specific
input data.
target variable.
Training Data Labeled data with input-output pairs. Unlabeled data with only input features.

The algorithm learns by comparing The algorithm explores the data's inherent
Learning Process predictions to actual labels and adjusting structure, often without predefined
model parameters to minimize errors. categories or labels.

Typically produces clusters or

Provides predictions or classifications for
Output representations of data points that share
new, unseen data.
similarities or patterns.
Clustering, Dimensionality Reduction,
Examples Classification, Regression.
Association Rule Mining.
Aspect Supervised Learning Unsupervised Learning
Evaluation metrics depend on the
Commonly uses metrics like accuracy,
specific task, such as silhouette score
Evaluation Metrics mean squared error (MSE), precision,
for clustering, explained variance for
recall, F1-score, etc.
dimensionality reduction, etc.
Applied in situations where you want
Used for tasks where there is a clear target
to explore and uncover hidden
variable to predict, e.g., image
Use Cases patterns or group similar data points,
classification, spam detection, stock price
e.g., customer segmentation, anomaly
prediction.
detection, topic modeling.
K-Means Clustering, Principal
Linear Regression, Support Vector
Example Component Analysis (PCA),
Machines (SVM), Decision Trees, Neural
Algorithms Association Rules, Hierarchical
Networks.
Clustering.
Low or no supervision, as it works
High supervision, as the algorithm relies
Supervision Level with unlabeled data or minimal
on labeled data for training.
guidance.
Aspect Supervised Learning Unsupervised Learning

Can work with both labeled and unlabeled

Requires a significant amount of
Data Requirement data, making it more versatile in some
labeled data for training.
scenarios.

Determining the appropriate number of

Overfitting, data imbalance, labeling
Common Challenges clusters, noise handling, and interpretation of
bias.
discovered patterns.
 Building a Machine Learning Algorithm
 Nearly all deep learning algorithms can be described as particular instances of a fairly simple
recipe: combine a specification of a dataset, a cost function, an optimization procedure and a
model.
 For example, the linear regression algorithm combines a dataset consisting of X and y, the cost
function.

 the model specification and, in most cases, the optimization algorithm

defined by solving for where the gradient of the cost (slope) is zero using the normal
equations.
 By realizing that we can replace any of these components mostly independently from the
others, we can obtain a very wide variety of algorithms.
 The cost function typically includes at least one term that causes the learning process to
perform statistical estimation. The most common cost function is the negative log-
likelihood, so that minimizing the cost function causes maximum likelihood estimation.
 The cost function may also include additional terms, such as regularization terms. For
example, we can add weight decay to the linear regression cost function to obtain.

 This still allows closed-form optimization.

 If we change the model to be nonlinear, then most cost functions can no longer be
optimized in closed form. This requires us to choose an iterative numerical optimization
procedure, such as gradient descent.
 The recipe for constructing a learning algorithm by combining models, costs, and
optimization algorithms supports both supervised and unsupervised learning.
 The linear regression example shows how to support supervised learning.
 Unsupervised learning can be supported by defining a dataset that contains only X and
providing an appropriate unsupervised cost and model.
 For example, we can obtain the first PCA vector by specifying that our loss function is

 while our model is defined to have w with norm one and reconstruction function

 In some cases, the cost function may be a function that we cannot actually evaluate, for
computational reasons.
 In these cases, we can still approximately minimize it using iterative numerical
optimization so long as we have some way of approximating its gradients.
 Most machine learning algorithms make use of this recipe, though it may not immediately
be obvious.
 If a machine learning algorithm seems especially unique or hand-designed, it can usually
be understood as using a special-case optimizer.
 Some models such as decision trees or k-means require special-case optimizers because
their cost functions have flat regions that make them inappropriate for minimization by
gradient-based optimizers.
 Recognizing that most machine learning algorithms can be described using this recipe
helps to see the different algorithms as part of a taxonomy of methods for doing related
tasks that work for similar reasons, rather than as a long list of algorithms that each have
separate justifications
 Challenges and Motivation to Deep learning.
 During the development phase our focus is to select a learning algorithm and train it
on some data, the two things that might be a problem are a bad algorithm or bad
data, or perhaps both of them.
 Many machine learning problems become exceedingly difficult when the number of
dimensions in the data is high. This phenomenon is known as the curse of dimensionality.
Of particular concern is that the number of possible distinct configurations of a set of
variables increases exponentially as the number of variables increases.
1. Not enough training data :
 Let’s say for a child, to make him learn what an apple is, all it takes for you to point to an
apple and say apple repeatedly.
 Now the child can recognize all sorts of apples.
 Well, machine learning is still not up to that level yet; it takes a lot of data for most of the
algorithms to function properly.
 For a simple task, it needs thousands of examples to make something out of it, and for
advanced tasks like image or speech recognition, it may need lakhs(millions) of examples
2. Poor Quality of data:
 Obviously, if your training data has lots of errors, outliers, and noise, it will make it
impossible for your machine learning model to detect a proper underlying pattern. Hence,
it will not perform well.
 So put in every ounce of effort in cleaning up your training data. No matter how good you
are in selecting and hyper tuning the model, this part plays a major role in helping us make
an accurate machine learning model.
 “Most Data Scientists spend a significant part of their time in cleaning data”.
 There are a couple of examples when you’d want to clean up the data :
 If you see some of the instances are clear outliers just discard them or fix them manually.
 If some of the instances are missing a feature like (E.g., 2% of user did not specify their
age), you can either ignore these instances, or fill the missing values by median age, or
train one model with the feature and train one without it to come up with a conclusion
3. Irrelevant Features:
 “Garbage in, garbage out (GIGO).”
 Features are relevant if they are either strongly or weakly relevant, and are irrelevant
otherwise.
 Irrelevant features can never contribute to prediction accuracy, by definition.
 The credit for a successful machine learning project goes to coming up with a good set of
features on which it has been trained (often referred to as feature engineering ), which
includes feature selection, extraction, and creating new features which are other interesting
topics to be covered in upcoming blogs.
4. Non-representative training data:
 To make sure that our model generalizes well, we have to make sure that our training data
should be representative of the new cases that we want to generalize to.
 If train our model by using a non-representative training set, it won’t be accurate in
predictions it will be biased against one class or a group.
 For E.G., Let us say you are trying to build a model that recognizes the genre of music.
One way to build your training set is to search it on YouTube and use the resulting data.
 Here we assume that YouTube's search engine is providing representative data but in
reality, the search will be biased towards popular artists and maybe even the artists that are
popular in your location(if you live in India you will be getting the music of SBP, etc).
 So use representative data during training, so your model won’t be biased among one or
two classes when it works on testing data.
5. Overfitting and Underfitting :
 Overfitting is an undesirable machine learning behavior that occurs when the machine
learning model gives accurate predictions for training data but not for new data. When data
scientists use machine learning models for making predictions, they first train the model on
a known data set.
 Overfitting happens when our model is too complex.
 Things which we can do to overcome this problem:
 Simplify the model by selecting one with fewer parameters.
 By reducing the number of attributes in training data.
 Constraining the model.
 Gather more training data.
 Reduce the noise.
 Underfitting
 When a model has not learned the patterns in the training data well and is unable to
generalize well on the new data, it is known as underfitting.
 An underfit model has poor performance on the training data and will result in unreliable
predictions.
 Things which we can do to overcome this problem:
 Select a more advanced model, one with more parameters.
 Train on better and relevant features.
 Reduce the constraints.
All of us do not have
equal talent.
But , all of us have an
equal opportunity to
develop our talents.
A. P. J. Abdul Kalam

Module1 - Deep Learning
No ratings yet
Module1 - Deep Learning
26 pages
M1 Session 2
No ratings yet
M1 Session 2
24 pages
01-Introduction To Machine Learning
No ratings yet
01-Introduction To Machine Learning
89 pages
ML - Unit I - Final
No ratings yet
ML - Unit I - Final
132 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
270 pages
MLUnit 1
No ratings yet
MLUnit 1
131 pages
Unit 1 1. Define Machine Learning. Application of Machine Learning Applications of ML
No ratings yet
Unit 1 1. Define Machine Learning. Application of Machine Learning Applications of ML
40 pages
Lec1 Intoduction
No ratings yet
Lec1 Intoduction
34 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
W1 - Introduction To ML
No ratings yet
W1 - Introduction To ML
57 pages
Intro To Machine Learning
No ratings yet
Intro To Machine Learning
31 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
MCA - ML Question Bank Answer
No ratings yet
MCA - ML Question Bank Answer
139 pages
Chapter - 2 Machine Learning Overview
No ratings yet
Chapter - 2 Machine Learning Overview
90 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
ML Important
No ratings yet
ML Important
8 pages
Data Science & ML Course Guide
No ratings yet
Data Science & ML Course Guide
83 pages
Lecture01 Introduction To Machine Learning (Chapter1)
No ratings yet
Lecture01 Introduction To Machine Learning (Chapter1)
64 pages
Deep Learning Module 1
No ratings yet
Deep Learning Module 1
46 pages
Machine Learning Syllabus Overview
No ratings yet
Machine Learning Syllabus Overview
70 pages
Module 1
No ratings yet
Module 1
22 pages
Machine Learning
No ratings yet
Machine Learning
104 pages
ML Unit 1
No ratings yet
ML Unit 1
9 pages
Machine Learning GNIT Suggestions
No ratings yet
Machine Learning GNIT Suggestions
7 pages
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
No ratings yet
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
106 pages
ML Unit 1 Intro ML
No ratings yet
ML Unit 1 Intro ML
43 pages
ASSIGNMENT 1 Mavhine Learning
No ratings yet
ASSIGNMENT 1 Mavhine Learning
8 pages
Machine Learning & AI Textbook
No ratings yet
Machine Learning & AI Textbook
33 pages
CS 601-Machine Learning
No ratings yet
CS 601-Machine Learning
82 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
Maharana Pratap Group of Institutions, Mandhana, Kanpur: Department of Computer Science Engineering)
No ratings yet
Maharana Pratap Group of Institutions, Mandhana, Kanpur: Department of Computer Science Engineering)
115 pages
Mlanswers
No ratings yet
Mlanswers
17 pages
Mlall
No ratings yet
Mlall
186 pages
Session One Machine Learning
No ratings yet
Session One Machine Learning
18 pages
R22 Machine Learning Digital Notes Final
No ratings yet
R22 Machine Learning Digital Notes Final
143 pages
Mlfa Autumn 22 Lec 01
No ratings yet
Mlfa Autumn 22 Lec 01
43 pages
ML Final
No ratings yet
ML Final
95 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
Machine Learning
No ratings yet
Machine Learning
92 pages
Mlintro 2
No ratings yet
Mlintro 2
28 pages
Machine Learning: BE Sixth Semester 20CS610
No ratings yet
Machine Learning: BE Sixth Semester 20CS610
211 pages
MLUnit - 1 Share
No ratings yet
MLUnit - 1 Share
162 pages
Unit 1
No ratings yet
Unit 1
4 pages
ML Overview
No ratings yet
ML Overview
26 pages
Intro - Types of Machine Learning
No ratings yet
Intro - Types of Machine Learning
24 pages
Deep Learning Exam Guide
No ratings yet
Deep Learning Exam Guide
19 pages
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
43 pages
Unit 1 Introduction of Machine Learning Notes
No ratings yet
Unit 1 Introduction of Machine Learning Notes
57 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
225 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
5 pages
Intro to Machine Learning Concepts
100% (1)
Intro to Machine Learning Concepts
58 pages
Fermentec 5L
No ratings yet
Fermentec 5L
3 pages
EDIST 2024 Agenda
No ratings yet
EDIST 2024 Agenda
5 pages
Regional and Community Techniques in Food Preparation
No ratings yet
Regional and Community Techniques in Food Preparation
5 pages
MANM519 - Week 3 AI Jobs and Future of Work - Lecture Notes
No ratings yet
MANM519 - Week 3 AI Jobs and Future of Work - Lecture Notes
12 pages
Wilm's Tumor
100% (2)
Wilm's Tumor
17 pages
Assessment Task 2 Instructions: Answer
No ratings yet
Assessment Task 2 Instructions: Answer
8 pages
Tables
No ratings yet
Tables
7 pages
TOEIC Test Prep Course Syllabus
No ratings yet
TOEIC Test Prep Course Syllabus
10 pages
LLM Dissertation Handbook Edinburgh
100% (2)
LLM Dissertation Handbook Edinburgh
6 pages
ENG503 (Finals)
No ratings yet
ENG503 (Finals)
25 pages
KNHDH 2 Dec
No ratings yet
KNHDH 2 Dec
3 pages
Petroleum Machinery Installation Guide
No ratings yet
Petroleum Machinery Installation Guide
11 pages
GIS Vector Overlay for Erosion Risk
No ratings yet
GIS Vector Overlay for Erosion Risk
7 pages
English and Chinese Reader
No ratings yet
English and Chinese Reader
299 pages
GX-20 QG Eng 0307
No ratings yet
GX-20 QG Eng 0307
2 pages
Belimo NMV-D3-MFT
No ratings yet
Belimo NMV-D3-MFT
10 pages
EL Form
No ratings yet
EL Form
1 page
KG2 Marine & Winged Life Activities
No ratings yet
KG2 Marine & Winged Life Activities
1 page
B 40 C Bp Spare Parts Guide
No ratings yet
B 40 C Bp Spare Parts Guide
39 pages
It Modern App Guide
No ratings yet
It Modern App Guide
40 pages
AWS Case Study Sumit
No ratings yet
AWS Case Study Sumit
8 pages
Class X Exam Marks Register
No ratings yet
Class X Exam Marks Register
9 pages
TOC1
No ratings yet
TOC1
19 pages
FC Model Trend Catalogue 2018
No ratings yet
FC Model Trend Catalogue 2018
37 pages
STS Internal - Job - Posting - Policy
No ratings yet
STS Internal - Job - Posting - Policy
4 pages
Hastamalaka
No ratings yet
Hastamalaka
8 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
3,790 pages
Introduction To Maya Hieroglyphs - European Maya Conference - Harri Kettunen, Christophe Helmke
No ratings yet
Introduction To Maya Hieroglyphs - European Maya Conference - Harri Kettunen, Christophe Helmke
158 pages
Ways To Heal - Mind & Body
No ratings yet
Ways To Heal - Mind & Body
16 pages
DLL Science WK1 Q4 2024
No ratings yet
DLL Science WK1 Q4 2024
5 pages

DL Unit 1 Notes

Uploaded by

DL Unit 1 Notes

Uploaded by

DEEP LEARNING AND APPLICATIONS

Model Complex models with multiple layers,

Computational High computational power, often relies on

Longer training times due to complexity

Linear regression, decision trees, Convolutional Neural Networks (CNNs),

8. Synthesis and sampling:

 Why do we need cross-validation?

 Purpose of cross validation in machine learning

 However, if the machine learning model is not

 Reducible errors: These errors can be reduced

 Irreducible errors: These errors will always

Regularization is a technique used in machine learning and deep learning to prevent

Aspect Supervised Learning Unsupervised Learning

Typically produces clusters or

Can work with both labeled and unlabeled

Determining the appropriate number of

 the model specification and, in most cases, the optimization algorithm

 This still allows closed-form optimization.

You might also like