KEMBAR78
ML Notes Unit-1 | PDF | Machine Learning | Dependent And Independent Variables
0% found this document useful (0 votes)
28 views11 pages

ML Notes Unit-1

Notes

Uploaded by

Aryan Kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

ML Notes Unit-1

Notes

Uploaded by

Aryan Kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MACHINE LEARNING

21CSC305P

Unit-1 – Introduction

 Machine Learning what and why?


 Supervised and Unsupervised learning
 Polynomial Curve Fitting
 Probability Theory- Discrete Random Variables
 Fundamental Rules
 Bayes Rule
 Independence and Conditional Independence
 Continuous Random Variables
 Quantiles
 Mean and Variance
 Probability Densities
 Expectation and Covariance

MACHINE LEARNING
“We are drowning in information and starving for knowledge. “— John Naisbitt.
 To solve a problem on a computer, we need an algorithm.
 An algorithm is a sequence of instructions that should be carried out to transform the input to
output.
 For example, one can devise an algorithm for sorting. The input is a set of numbers and the
output is their ordered list. For the same task, there may be various algorithms and we may be
interested in finding the most efficient one, requiring the least number of instructions or
memory or both.
 For some tasks, however, we do not have an algorithm. Predicting customer behaviour is one;
another is to tell spam emails from legitimate ones. We know what the input is: an email
document that in the simplest case is a file of characters. We know what the output should be:
a yes/no output indicating whether the message is spam or not.
 Machine learning is programming computers to optimize a performance criterion using
example data or past experience. We have a model defined up to some parameters, and
learning is the execution of a computer program to optimize the parameters of the model using
the training data or past experience. The model may be predictive to make predictions in the
future, or descriptive to gain knowledge from data, or both.
 Machine learning is a subfield of artificial intelligence, which is broadly defined as “the
capability of a machine to imitate intelligent human behavior”.
 Arthur Samuel (1959) described it as: “the field of study that gives computers the ability to
learn without being explicitly programmed.”
 Mitchell (1997) provides the definition “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P’ improves with experience E.”

 Model: A model is a specific


representation learned from data by
applying some machine learning
algorithm. A model is also called
hypothesis.
 Feature: A feature is an individual
measurable property of our data. A set
of numeric features can be conveniently
described by a feature vector. Feature
vectors are fed as input to the model. For
example, in order to predict a fruit, there
may be features like color, smell, taste, etc.
 Target (Label): A target variable or label is the value to be predicted by our model. For the
fruit example discussed in the features section, the label with each set of input would be the
name of the fruit like apple, orange, banana, etc.
 Training: The idea is to give a set of inputs(features) and it’s expected outputs(labels), so
after training, we will have a model (hypothesis) that will then map new data to one of the
categories trained on.
 Prediction: Once our model is ready, it can be fed a set of inputs to which it will provide a
predicted output(label). But make sure if the machine performs well on unseen data, then only
we can say the machine performs well.

APPLICATIONS FOR MACHINE LEARNING


 Adaptive Website  Medical Diagnosis
 Affective Computing  Economics
 Bio Informatics  Natural Language Processing
 Brain Computer Interface (BCI)  Optimization
 Cheminformatics  Online Advertising
 Classify DNA Sequences  Recommender System
 Computational Anatomy  Robot Locomotion
 Computer Vision  Search Engine
 Detecting Credit Card Fraud  Sentiment Analysis
 Game Playing  Software Engineering
 Information Retrieval  Speech Recognition
 Internet Fraud  Handwriting Recognition
 Linguistics  Pattern Recognition
 Marketing  User Behaviour Analytics
 Machine Learning Control  Machine Translation
 Machine Perception
TYPES OF MACHINE LEARNING

1.Supervised Learning: A training set of examples with correct responses are provided based
on this training set, the algorithm generalizes to respond correctly to all possible inputs. This is also
called learning from examples.
In the predictive or supervised learning approach, the goal is to learn a mapping from inputs x to
outputs y, given a labeled set of input-output pairs D = {(xi, yi)}, i=1 to N. Here D is called the training
set, and N is the number of training examples.
In the simplest setting, each training input xi is a D-dimensional vector of numbers, representing, say,
the height and weight of a person. These are called features, attributes or covariates. In general,
however, xi could be a complex structured object, such as an image, a sentence, an email message, a
time series, a molecular shape, a graph, etc.
The form of the output or response variable can in principle be anything, but most methods assume that
yi is a categorical or nominal variable from some finite set, yi ∈ {1,...,C} (such as male or female),
or that yi is a real-valued scalar (such as income level).
When yi is categorical, the problem is known as classification or pattern recognition, and when yi is
real-valued, the problem is known as regression. Another variant, known as ordinal regression, occurs
where label space Y has some natural ordering, such as grades A–F.

2. Unsupervised Learning: Correct responses are not provided, instead the algorithm tries to
identify similarities between input so that inputs that have something in common. The statistical
approach to unsupervised learning is known as Density Estimation. In Descriptive or Unsupervised
learning approach, we are only given inputs, D = {x i }, where i=1 to N, and the goal is to find
“interesting patterns” in the data. This is sometimes called knowledge discovery.
Instead, we will formalize our task as one of density estimation, that is, we want to build models of
the form p(xi|θ). There are two differences from the supervised case. First, we have written p(x i|θ)
instead of p(yi|xi , θ); that is, supervised learning is conditional density estimation, whereas
unsupervised learning is unconditional density estimation. Second, xi is a vector of features, so we
need to create multivariate probability models.

3. Reinforcement Learning: How to act or behave when given occasional reward or


punishment signals. (For example, consider how a baby learns to walk.) Learn with a critic
because of this monitor that scores the answer, but does not suggest improvements.

1. SUPERVISED LEARNING
a. CLASSIFICATION
Here the goal is to learn a mapping from inputs x to outputs y, where y ∈{1,...,C}, with C being the
number of classes.
 If C = 2, this is called binary classification (in which case we often assume y ∈{0, 1});
 If C > 2, this is called multiclass classification.
 If the class labels are not mutually exclusive (e.g., somebody may be classified as tall and
strong), we call it multi-label classification, but this is best viewed as predicting multiple
related binary class labels (a so-called multiple output model).
 One way to formalize the problem is as function approximation. We assume y = f(x) for some
unknown function f, and the goal of learning is to estimate the function f given a labeled training
set, and then to make predictions using yˆ = ˆf(x). (We use the hat symbol to denote an estimate.)
 Our main goal is to make predictions on novel inputs, meaning ones that we have not seen
before (this is called generalization), since predicting the response on the training set is easy.
The need for probabilistic predictions:
Given a probabilistic output, we can always compute our “best guess” as to the “true label” using:

Applications:
 Document classification: In document classification, the goal is to classify a document, such
as a web page or email message, into one of C classes, that is, to compute p(y = c|x, D), where
x is some representation of the text.
 Email spam filtering, where the classes are spam y = 1 or ham y = 0.
 Classifying Flowers: 4 useful features or characteristics: sepal length and width, and petal
length and width. (Such feature extraction is an important, but difficult, task.

Figure: Three types of iris flowers: setosa, versicolor and virginica.


 Image Classification: Now consider the harder problem of classifying images directly, where
a human has not pre-processed the data. We might want to classify the image as a whole, e.g.,
is it an indoors or outdoors scene? is it a horizontal or vertical photo? does it contain a dog or
not? This is called image classification.
 Handwriting Recognition: In the special case that the images consist of isolated handwritten
letters and digits, for example, in a postal or ZIP code on a letter, we can use classification to
perform handwriting recognition. A standard dataset used in this area is known as MNIST,
which stands for “Modified National Institute of Standards”. (The term “modified” is used
because the images have been prepocessed to ensure the digits are mostly in the center of the
image.)
 Face Detection and Recognition: A harder problem is to find objects within an image; this is
called object detection or object localization. An important special case of this is face detection.
One approach to this problem is to divide the image into many small overlapping patches at
different locations, scales and orientations, and to classify each such patch based on whether it
contains face-like texture or not. This is called a sliding window detector.
b. REGRESSION
Regression is just like classification except the response variable is continuous. We have a single
real-valued input xi ∈R, and a single real-valued response yi∈ R.
We consider fitting two models to the data: a straight line and a quadratic function.
"Regression shows a line or curve that passes through all the datapoints on target-predictor graph in
such a way that the vertical distance between the datapoints and the regression line is minimum."

Examples of real-world regression problems.


• Predict tomorrow’s stock market price given current market conditions and other possible side
information.
• Predict the age of a viewer watching a given video on YouTube.
• Predict the location in 3d space of a robot arm end effector, given control signals (torques) sent to
its various motors.
• Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of
different clinical measurements.
• Predict the temperature at any location inside a building using weather data, time, door sensors,

Figure: (a) Linear regression on some 1d data. (b) Same data with polynomial regression (degree
2).
Terminologies Related to the Regression Analysis:
 Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
 Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.
 Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
 Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
 Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
2. UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Types of Unsupervised Learning Algorithm

 Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.

 Association: An association rule is an unsupervised learning method which is used for


finding the relationships between variables in the large database. It determines the set
of items that occurs together in the dataset. Association rule makes marketing strategy
more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.

Examples:
 Discovering Clusters: Let K denote the number of clusters. Our first goal is to estimate
the distribution over the number of clusters, p(K|D); this tells us if there are
subpopulations within the data. For simplicity, we often approximate the distribution
p(K|D) by its mode, K∗ = argmaxK p(K|D).
◦ In astronomy, the autoclass system (Cheeseman et al. 1988) discovered a new type
of star, based on clustering astrophysical measurements.
◦ In e-commerce, it is common to cluster users into groups, based on their purchasing
or web-surfing behavior, and then to send customized targeted advertising to each
group
◦ In biology, it is common to cluster flow-cytometry data into groups, to discover
different sub-populations of cells.
 Discovering Latent Factors: When dealing with high dimensional data, it is often
useful to reduce the dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data. This is called dimensionality
reduction.
The motivation behind this technique is that although the data may appear high
dimensional, there may only be a small number of degrees of variability, corresponding
to latent factors. For example, when modeling the appearance of face images, there may
only be a few underlying latent factors which describe most of the variability, such as
lighting, pose, identity, etc,
◦ In biology, it is common to use PCA to interpret gene microarray data, to account
for the fact that each measurement is usually the result of many genes which are
correlated in their behaviour by the fact that they belong to different biological
pathways.
◦ In natural language processing, it is common to use a variant of PCA called latent
semantic analysis for document retrieval.
◦ In signal processing (e.g., of acoustic or neural signals), it is common to use ICA
(which is a variant of PCA) to separate signals into their different sources.
◦ In computer graphics, it is common to project motion capture data to a low
dimensional space, and use it to create animations.
 Discovering Graph Structure: Sometimes we measure a set of correlated variables,
and we would like to discover which ones are most correlated with which others. This
can be represented by a graph G, in which nodes represent variables, and edges
represent direct dependence between variables. We can then learn this graph structure
from data, i.e., we compute Gˆ = argmax p(G|D).
 Matrix Completion: Sometimes we have missing data, that is, variables whose values
are unknown. For example, we might have conducted a survey, and some people might
not have answered certain questions. Or we might have various sensors, some of which
fail. The corresponding design matrix will then have “holes” in it; these missing entries
are often represented by NaN, which stands for “not a number”. The goal of imputation
is to infer plausible values for the missing entries. This is sometimes called matrix
completion.
 Image Inpainting: An interesting example of an imputation-like task is known as
image inpainting. The goal is to “fill in” holes (e.g., due to scratches or occlusions) in
an image with realistic texture.
 Collaborative Filtering: Another interesting example of an imputation-like task is
known as collaborative filtering. A common example of this concerns predicting which
movies people will want to watch based on how they, and other people, have rated
movies which they have already seen. The key
idea is that the prediction is not based on
features of the movie or user (although it could
be), but merely on a ratings matrix. More
precisely, we have a matrix X where X(m, u) is
the rating (say an integer between 1 and 5,
where 1 is dislike and 5 is like) by user u of
movie m.
Figure: Example of movie-rating data.
Training data is in red, test data is denoted by ?, empty cells are unknown.
 Market Basket Analysis: In commercial data mining, there is much interest in a task
called market basket analysis. The data consists of a (typically very large but sparse)
binary matrix, where each column represents an item or product, and each row
represents a transaction. We set xij= 1 if item j was purchased on the i’th transaction.
Many items are purchased together (e.g., bread and butter), so there will be correlations
amongst the bits.

3. REINFORCEMENT LEARNING
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent
agents ought to take actions in an environment in order to maximize the notion of cumulative
reward. The environment is typically stated in the form of a Markov decision process (MDP),
because many reinforcement learning algorithms for this context use dynamic programming
techniques.
Due to its generality, reinforcement learning is studied in many disciplines, such as game
theory, control theory, operations research, information theory, simulation-based optimization,
multi-agent systems, swarm intelligence, and statistics. In the operations research and control
literature, reinforcement learning is called approximate dynamic programming, or neuro-
dynamic programming.

Types of Reinforcement
1. Positive – Positive Reinforcement is defined as when an event, occurs due to a particular
behaviour, increases the strength and the frequency of the behaviour. In other words, it has a
positive effect on behaviour.
Advantages of reinforcement learning are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative – Negative Reinforcement is defined as strengthening of behaviour because a
negative condition is stopped or avoided.
Advantages of reinforcement learning:
 Increases Behaviour
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behaviour

Various Practical applications of Reinforcement Learning –


 RL can be used in robotics for industrial automation.
 RL can be used in machine learning and data processing
 RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
 RL can be used in large environments in the following situations:
 A model of the environment is known, but an analytic solution is not available;
 Only a simulation model of the environment is given (the subject of simulation-based
optimization)
 The only way to collect information about the environment is to interact with it.

Terms used in Reinforcement Learning


 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the
agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

Approaches to implement Reinforcement Learning


1. Value-based: The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based: Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular solution
or algorithm for this approach because the model representation is different for each
environment.

POLYNOMIAL CURVE FITTING


 A training data set of N = 10 points, (blue circles).
 The green curve shows the actual function sin(2πx) used to generate the data.
 Our goal is to predict the value of t for some new value of x, without knowledge of
the green curve.

 We try to fit the data using a polynomial function of the form:

 The values of the coefficients will be determined by fitting the polynomial to the
training data.
 This can be done by minimizing an error function that measures the misfit between
the function y(x,w), for any given value of w, and the training set data points.
 Error Function: the sum of the squares of the errors between the predictions y(x n,w)
for each data point xn and the corresponding target values t n.

 We can solve the curve fitting problem by choosing the value of w for which E(w) is
as small as possible.
 Since the error function is a quadratic function of the coefficients w, its derivatives with
respect to the coefficients will be linear in the elements of w, and so the minimization
of the error function has a unique solution, denoted by w*.
 The resulting polynomial is given by the function y(x,w*).
 Choosing the order M of the polynomial -> model selection.
 0th Order Polynomial
 1st Order Polynomial

 3rd Order Polynomial

 9th Order Polynomial

 The 0th order (M=0) and first order (M=1) polynomials give rather poor fits to the
data and consequently rather poor representations of the function sin(2πx).
 The third order (M=3) polynomial seems to give the best fit to the function sin(2πx)
of the examples.
 When we go to a much higher order polynomial (M=9), we obtain an excellent fit to
the training data.
 In fact, the polynomial passes exactly through each data point and E(w*) = 0.
 We obtain an excellent fit to the training data with 9th order.
 However, the fitted curve oscillates wildly and gives a very poor representation of the
function sin(2πx). This behaviour is known as over-fitting.
 Polynomial Curve Fitting: Over-Fitting: We can then evaluate the residual
value of E(w*) for the training data, and we can also evaluate E(w*) for the test data
set.

Root-Mean-Square (RMS) Error:–


in which the division by N allows us to compare different sizes of data sets, and the
square root ensures that ERMS is measured on the same scale as the target variable t.

You might also like