KEMBAR78
Module 5.1 | PDF | Machine Learning | Linear Regression
0% found this document useful (0 votes)
59 views43 pages

Module 5.1

Uploaded by

nehal1103sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views43 pages

Module 5.1

Uploaded by

nehal1103sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

INTRODUCTION TO

MACHINE LEARNING
Module 5.1
WHAT IS MACHINE LEARNING?
• To solve a problem on a computer, we need an algorithm.

• An algorithm is a sequence of instructions that should be carried out to transform input


to output.
• For example, one can devise an algorithm for sorting where the input is a set of numbers, and
the output is their ordered list.
• For the same task, there may be various algorithms and we may be interested in
finding the most efficient one, requiring the least number of instructions or memory or
both.
• For some tasks, however, we do not have an algorithm—for example, to tell
spam emails from legitimate emails.
• We know what the input is: an email document that in the simplest case is a file
of characters.
• We know what the output should be: a yes/no output indicating whether the
message is spam or not. 2
• We do not know how to transform the input to the output.
• What can be considered spam changes in time and from individual to individual.
• We can easily compile thousands of example messages some of which we know to be spam and what we
want is to “learn” what constitutes spam from them.
• In other words, we would like the computer (machine) to extract automatically the algorithm for
this task.
• There is no need to learn to sort numbers, we already have algorithms for that; but there
are many applications for which we do not have an algorithm but do have example data.

• Application of machine learning methods to large databases is called data


mining.
• The analogy is that a large volume of earth and raw material is extracted
from a mine, which when processed leads to a small amount of very
precious material; similarly, in data mining, a large volume of data is
processed to construct a simple model with valuable use.
3
• Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a
machine to imitate intelligent human behavior.
• Machine Learning allows software applications to become more accurate at predicting outcomes without
being explicitly programmed to do so.
• The learning system of a machine learning algorithm is broken down into 3 main parts:

1. A Decision Process: In general, machine learning algorithms are used to make a prediction or
classification. Based on some input data, which can be labeled or unlabeled, your algorithm will
produce an estimate about a pattern in the data.
2. An Error Function: An error function evaluates the prediction of the model. If there are known
examples, an error function can make a comparison to assess the accuracy of the model.
3. A Model Optimization Process: If the model can fit better to the data points in the training set, then
weights are adjusted to reduce the discrepancy between the known example and the model estimate.
The algorithm will repeat this “evaluate and optimize” process, updating weights autonomously until
a threshold of accuracy has been met.

4
MACHINE LEARNING METHODS
1. Supervised Machine Learning
• Supervised Learning, also known as supervised machine learning, is defined by its use of
labeled datasets to train algorithms to classify data or predict outcomes accurately.
• As input data is fed into the model, the model adjusts its weights until it has been fitted
appropriately.
• This occurs as part of the cross-validation process to ensure that the model avoids overfitting
or underfitting.
• Supervised learning helps organizations solve a variety of real-world problems at scale, such as
classifying spam in a separate folder from your inbox.
• Some methods used in supervised learning include neural networks, naïve bayes, linear
regression, logistic regression, random forest, and support vector machine (SVM).

5
2. Unsupervised Machine Learning
• Unsupervised Learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets.
• These algorithms discover hidden patterns or data groupings without the need for human intervention.
• This method’s ability to discover similarities and differences in information make it ideal for
exploratory data analysis, cross-selling strategies, customer segmentation, and image and pattern
recognition.
• It’s also used to reduce the number of features in a model through the process of dimensionality
reduction.
• Principal component analysis (PCA) and singular value decomposition (SVD) are two common
approaches for this.
• Other algorithms used in unsupervised learning include neural networks, k-means clustering, and
probabilistic clustering methods.

6
3. Semi-supervised Machine Learning
• Semi-supervised learning offers a happy medium between supervised and
unsupervised learning.
• During training, it uses a smaller labeled data set to guide classification and feature
extraction from a larger, unlabeled data set.
• Semi-supervised learning can solve the problem of not having enough labeled data
for a supervised learning algorithm.
• It also helps if it’s too costly to label enough data.

7
MACHINE LEARNING ALGORITHMS
1. Neural networks
• Neural networks simulate the way the human brain works, with a huge number of linked processing
nodes.
• Neural networks are good at recognizing patterns and play an important role in applications including
natural language translation, image recognition, speech recognition, and image creation.

2. Linear Regression
• This algorithm is used to predict numerical values, based on a linear relationship between different
values.
• For example, the technique could be used to predict house prices based on historical data for the area.

8
3. Logistic Regression
• This supervised learning algorithm makes predictions for categorical response variables, such as
“yes/no” answers to questions.
• It can be used for applications such as classifying spam and quality control on a production line.

4. Clustering
• Using unsupervised learning, clustering algorithms can identify patterns in data so that it can be grouped.
• Computers can help data scientists by identifying differences between data items that humans have
overlooked.

5. Random Forests
• In a random forest, the machine learning algorithm predicts a value or category by combining
the results from a number of decision trees.

9
6. Decision Trees

• Decision trees can be used for both predicting numerical values (regression) and classifying
data into categories.
• Decision trees use a branching sequence of linked decisions that can be represented with a tree
diagram.
• One of the advantages of decision trees is that they are easy to validate and audit, unlike the
black box of the neural network.

10
3. Customer service
• Online chatbots are replacing human agents along the customer journey, changing the way we
think about customer engagement across websites and social media platforms.
• Chatbots answer frequently asked questions (FAQs) about topics such as shipping, or provide
personalized advice, cross-selling products or suggesting sizes for users.
• Examples include virtual agents on e-commerce sites; messaging bots, using Slack and
Facebook Messenger; and tasks usually done by virtual assistants and voice assistants.

4. Recommendation engines
• Using past consumption behavior data, AI algorithms can help to discover data trends that can be used
to develop more effective cross-selling strategies.
• This approach is used by online retailers to make relevant product recommendations to customers
during the checkout process.

11
REAL WORLD MACHINE LEARNING
USE CASES
1. Speech Recognition
• It is also known as automatic speech recognition (ASR), computer speech recognition, or speech-
to-text, and it is a capability which uses natural language processing (NLP) to translate human
speech into a written format.
• Many mobile devices incorporate speech recognition into their systems to conduct voice search—
e.g. Siri—or improve accessibility for texting.
2. Automated stock trading
• Designed to optimize stock portfolios, AI-driven high-frequency trading platforms make thousands
or even millions of trades per day without human intervention.

12
5. Computer vision
• This AI technology enables computers to derive meaningful information from digital images,
videos, and other visual inputs, and then take the appropriate action.
• Powered by convolutional neural networks, computer vision has applications in photo tagging
on social media, radiology imaging in healthcare, and self-driving cars in the automotive
industry.

6. Fraud detection
• Banks and other financial institutions can use machine learning to spot suspicious transactions.
• Supervised learning can train a model using information about known fraudulent transactions.
Anomaly detection can identify transactions that look atypical and deserve further investigation.

13
CHALLENGES OF MACHINE LEARNING
1. Technological singularity
2. AI impact on jobs
3. Privacy
4. Bias and discrimination
5. Accountability

14
EXAMPLES OF MACHINE LEARNING
APPLICATIONS
1. Learning Associations
2. Classification
3. Regression
4. Unsupervised Learning
5. Reinforcement Learning

15
REINFORCEMENT LEARNING
• In some applications, the output of the system is a sequence of actions.
• In such a case, a single action is not important; what is important is the policy that is the
sequence of correct actions to reach the goal.
• There is no such thing as the best action in any intermediate state; an action is good if it is
part of a good policy.
• In such a case, the machine learning program should be able to assess the goodness of
policies and learn from past good action sequences to be able to generate a policy.
• Such learning methods are called reinforcement learning algorithms.
• A good example is game playing where a single move by itself is not that important; it is the
sequence of right moves that is good. A move is good if it is part of a good game playing
policy. Game playing is an important research area in both artificial intelligence and machine
learning. This is because games are easy to describe and at the same time, they are quite
difficult to play well.
16
• A game like chess has a small number of rules but it is very complex because of the
large number of possible moves at each state and the large number of moves that a
game contains. Once we have good algorithms that can learn to play games well,
we can also apply them to applications with more evident economic utility.
• One factor that makes reinforcement learning harder is when the system has
unreliable and partial sensory information.
• For example, a robot equipped with a video camera has incomplete information and
thus at any time is in a partially observable state and should decide taking into
account this uncertainty; for example, it may not know its exact location in a room
but only that there is a wall to its left. A task may also require a concurrent
operation of multiple agents that should interact and cooperate to accomplish a
common goal. An example is a team of robots playing soccer.

17
SUPERVISED
LEARNING

18
LEARNING A CLASS FROM EXAMPLES

19
VAPNIK-CHERVONENKIS (VC) DIMENSION
• VC dimension is a model capacity measurement used in statistics and machine
learning.
• It is frequently used to guide the model selection process while developing
machine learning applications.
• VC dimension is an essential metric in determining the capacity of a machine
learning algorithm.
• The capacity of a model is defined as its ability to learn from a given dataset
while accuracy is its ability to correctly identify labels for a given batch of data.
• VC dimension is useful in formal analysis of learnability.
• This is because VC dimension provides an upper bound on generalization error.

20
• Let us assume we have a dataset containing N points.
• These N points can be labeled in 2N ways as positive and negative.
• Therefore, 2N different learning problems can be defined by N data points.
• If for any of these problems, we can find a hypothesis h ∈ H that separates the
positive examples from the negative, then we say H shatters N points.
• That is, any learning problem definable by N examples can be learned with no error
by a hypothesis drawn from H.
• The maximum number of points that can be shattered by H is called the Vapnik-
Chervonenkis (VC) dimension of H, is denoted as VC(H), and measures the
capacity of H.

21
• An axis-aligned rectangle can shatter four points in two
dimensions.
• Then VC(H), when H is the hypothesis class of axis-
aligned rectangles in two dimensions, is four.
• In calculating the VC dimension, it is enough that we
find four points that can be shattered; it is not necessary
that we be able to shatter any four points in two
dimensions.
• For example, four points placed on a line cannot be
shattered by rectangles.
• However, we cannot place five points in two
dimensions anywhere such that a rectangle can separate
the positive and negative examples for all possible
labelings.

22
• VC dimension may seem pessimistic.
• It tells us that using a rectangle as our hypothesis class, we can learn only datasets
containing four points and not more.
• A learning algorithm that can learn datasets of four points is not very useful.
• However, this is because the VC dimension is independent of the probability
distribution from which instances are drawn.

23
PROBABLY APPROXIMATELY CORRECT (PAC)

24
NOISE
• Noise is any unwanted anomaly in the data and due to noise, the class may be
more difficult to learn and zero error may be infeasible with a simple hypothesis
class.
• The following are some interpretations of noise:
⮚ There may be imprecision in recording the input attributes, which may shift
the data points in the input space.
⮚ There may be errors in labeling the data points, which may relabel positive
instances as negative and vice versa. This is sometimes called teacher noise.
⮚ There may be additional attributes, which we not have considered, that affect
the label of an instance. Such attributes may be hidden or latent in that they
may be unobservable. The effect of these neglected attributes is thus modeled
as a random component and is included in “noise.”

25
• When there is noise, there is not a simple
boundary between the positive and negative
instances and to separate them, one needs a
complicated hypothesis that corresponds to a
hypothesis class with larger capacity.
• A rectangle can be defined by four numbers,
but to define a more complicated shape one
needs a more complex model with a much
larger number of parameters.
• With a complex model, one can make a perfect
fit to the data and attain zero error (the wiggly
figure).
• Another possibility is to keep the model simple
and allow some error (the rectangular figure)

26
Using the simple rectangle makes more sense because of the following:
1. It is a simple model to use. It is easy to check whether a point is inside or outside a
rectangle and we can easily check, for a future data instance, whether it is a positive or a
negative instance.
2. It is a simple model to train and has fewer parameters. It is easier to find the corner values
of a rectangle than the control points of an arbitrary shape. With a small training set when
the training instances differ a little bit, we expect the simpler model to change less than a
complex model: A simple model is thus said to have less variance. On the other hand, a too
simple model assumes more, is more rigid, and may fail if indeed the underlying class is not
that simple: A simpler model has more bias. Finding the optimal model corresponds to
minimizing both the bias and the variance.
3. It is a simple model to explain. A rectangle simply corresponds to defining intervals on the
two attributes. By learning a simple model, we can extract information from the raw data
given in the training set.
27
4. If indeed there is mislabeling or noise in input and the actual class is really a simple
model like the rectangle, then the simple rectangle, because it has less variance and is
less affected by single instances, will be a better discriminator than the wiggly shape,
although the simple one may make slightly more errors on the training set. Given
comparable empirical error, we say that a simple (but not too simple) model would
generalize better than a complex model. This principle is known as Occam’s razor,
which states that simpler explanations are more plausible, and any unnecessary
complexity should be shaved off.

28
LEARNING MULTIPLE CLASSES

29
30
REGRESSION
• Regression models are used to predict a continuous value.
• Predicting prices of a house given the features of house like size, price etc is one
of the common examples of Regression.
• It is a supervised technique.
• The ultimate goal of regression algorithm is to pot a best-fit line or a curve
between data.
• The three main metrics that are used for evaluating trained regression model are
variance, bias and error.
• If variance is high, it leads to overfitting and when bias is high, it leads to
underfitting.
• Regression helps establish a relationship among the variables by estimating how
one variable affects the other.

31
TYPES OF REGRESSION

Simple Support
Polynomial
Linear Vector

Decision Random
Tree Forest

32
SIMPLE LINEAR
• REGRESSION
This is one of the most common type of Regression technique.
• Here we predict a target variable Y based on the input variable X.
• A linear relationship should exist between target variable and predictor and so comes the name
Linear Regression.
• Consider predicting the salary of an employee based on his/her age.
• We can easily identify that there seems to be a correlation between employee’s age and salary
(more the age more is the salary).
• The hypothesis of linear regression is Y = a + bX
• Y represents salary, X is employee’s age and a and b are the coefficients of the equation.
• So, in order to predict Y (salary) given X (age), we need to know the values of a and b (the
model’s coefficients).
• While training and building a regression model, it is these coefficients which are learned and
fitted to training data.
• The aim of the training is to find the best fit line such that cost function is minimized.
• The cost function helps in measuring the error.
• During the training process, we try to minimize the error between actual and predicted values and
thus minimizing the cost function.

33
• In the figure, the red points are the data
points, and the blue line is the predicted
line for the training data.
• To get the predicted value, these data
points are projected on to the line.
• The aim is to find such values of
coefficients which will minimize the cost
function.
• The most common cost function is Mean
Squared Error (MSE) which is equal to
the average squared difference between an
observation’s actual and predicted values.

34
POLYNOMIAL REGRESSION
• In polynomial regression, we
transform the original features into
polynomial features of a given degree
and then apply Linear Regression on
it.
• The linear model Y = a + bX is
transformed into Y = a + bX +cX2
• It is still a linear model, but the curve
is now quadratic rather than a line.
• If we increase the degree to a very
high value, the curve becomes
overfitted as it learns the noise in the
data as well.

35
SUPPORT VECTOR

REGRESSION
In SVR, we identify a hyperplane with
maximum margin such that the
maximum number of data points are
within that margin.
• Instead of minimizing the error rate as in
simple linear regression, we try to fit the
error within a certain threshold.
• The objective here is to consider the
points that are within the margin.
• Our best fit line is the hyperplane that
has the maximum number of points.

36
DECISION TREE REGRESSION

37
RANDOM FOREST REGRESSION

38
MODEL SELECTION AND GENERALIZATION
• An ill-poised problem is one in which data by itself is not sufficient to find a
unique solution.
• So, because learning is ill-posed, and data by itself is not sufficient to find the
solution, we should make some extra assumptions to have a unique solution with
the data we have.
• The set of assumptions we make to have learning possible is called the inductive
bias of the learning algorithm.
• Learning is not possible without inductive bias, so it is very important to choose
the right bias.
• This is called model selection, which is choosing between possible hypothesis
class H.
• The aim of machine learning is rarely to replicate training data but prediction for
new cases.
39
• So, we would like to be able to generate the right output for an input instance
outside the training set, one for which the correct output is not given in the
training set.
• How well a model trained on the training set predicts the right output for new
instances is called generalization.
• For best generalization, we should match the complexity of the hypothesis class H
with the complexity of the function underlying the data.
• If H is less complex than the function, we have underfitting, for example, when
trying to fit a line to data sampled from a third-order polynomial.
• In such a case, as we increase the complexity, the training error decreases.
• if there is noise, an overcomplex hypothesis may learn not only the underlying
function but also the noise in the data and may make a bad fit, this is called
overfitting.

40
• In all learning algorithms that are trained from example data, there is a trade-off
between three factors:
⮚ the complexity of the hypothesis we fit to data, namely, the capacity of the
hypothesis class
⮚ the amount of training data
⮚ the generalization error on new examples
• As the amount of training data increases, the generalization error decreases.
• We divide the training set into two parts. We use one part for training (i.e., to fit a
hypothesis), and the remaining part is called the validation set and is used to test
the generalization ability.
• Assuming large enough training and validation sets, the hypothesis that is the
most accurate on the validation set is the best one (the one that has the best
inductive bias). This process is called cross-validation.
• We have used the validation set to choose the best model, and it has effectively
become a part of the training set. We need a third set, a test set, sometimes also
called the publication set, containing examples not used in training or validation.
41
DIMENSIONS OF A SUPERVISED ML ALGO

42
THANK YOU

You might also like