Introduction to Machine
Learning
Dr. R. Rekha
Associate Professor
Department of IT,
PSG College of Technology,
Coimbatore
rekha.psgtech@gmail.com
rra.it@psgtech.ac.in
Mobile: 9842163683
21NN03 APPLIED MACHINE LEARNING
3003
MATHEMATICAL BASICS AND LEARNING SYSTEM: Definition of learning systems, Goals and applications of machine learning, Probability theory, Statistical
decision theory, Learning versus design, Feasibility of learning, Training versus testing, Labeled versus unlabeled dataset, Error, Noise, Theory of
generalization, Hypothesis class, Vapnik-Chervonenkis (VC) dimension, Bias, Variance, Learning curve, Model selection, Under-fitting and over-fitting, Cross
validation, Concept representation, Function approximation.
(12)
SUPERVISED LEARNING: Learning a class from examples, Learning multiple classes, Dimensions of a supervised machine learning algorithm, Discriminant
functions, Probabilistic generative models, Probabilistic discriminative models, Logistic regression, Linear regression, Perceptron Learning Algorithm.
(12)
UNSUPERVISED AND ENSEMBLE METHODS: Clustering, Expectation maximization (EM) for soft clustering, Semi-supervised learning with EM using
labeled and unlabeled data, Ensemble learning: boosting, bagging, Sampling: Basic sampling methods - Markov Chain Monte Carlo.
(10)
REINFORCEMENT LEARNING: Model free reinforcement learning: Q Learning, Algorithm for learning Q, Convergence, Updating sequences strategies,
Model based learning: Value iteration - Policy iteration, K-Armed bandit - Elements.
(11)
Total L: 45
REFERENCES:
1. Tom Mitchell, “Machine Learning”, McGraw Hill, USA, 2017.
2. Christopher Bishop, “Pattern Recognition and Machine Learning”, Springer, USA, 2011.
3. Suresh Samudrala, "Machine Intelligence: Demystifying Machine Learning, Neural Networks
and Deep Learning", Notion Press, New Delhi, 2019
4. Abu Mostafa Y S, MagdonIsmail M and Lin H T, “Learning from Data”, AML Book Publishers,
USA, 2012.
5. EthemAlpaydm, “Introduction to Machine Learning”, 3rd Edition, PHI Learning Private, USA,
2015.
6. Kevin P. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press, USA, 2012.
Theory Courses with no Tutorial Component
(CA: 50% + FE: 50%)
• CA Distribution:
(i) Assignment Presentation 10 Marks
(ii) Objective Tests I (Surprise type) 05 Marks
(iii)Objective Tests II (Surprise type) 05 Marks
(iv) Internal Tests (Average of 2): 30 Marks
• Test I ( conducted for 50 marks) 30 Marks
• Test II ( conducted for 50 marks) 30 Marks
• Final Examination (FE) 50 Marks
Google classroom code
• q246tgu
• https://classroom.google.com/c/NTU2MDkwNzI2NDM5?cjc=q246tgu
Enormous Opportunities for
Machine Learning
• Virtual Personal Assistants. Siri, Alexa, Google Now
• Predictions while Commuting. ...
• Videos Surveillance. ...
• Social Media Services. ...
• Email Spam and Malware Filtering. ...
• Online Customer Support. ...
• Search Engine Result Refining. ...
• Product Recommendations
A Bit History
What is Machine Learning?
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
How do machines learn?
We don’t want to code the logic for our program instead we want a
machine to figure out logic from the data on its own.
Payroll of employees in the company:
To predict the salary for a person with 8 years and 16 years of experience?
if (experience < = 10)
{ salary = experience * 1.5 * 100000}
else if(experience >10)
{ salary = experience * 2 * 100000}
We don’t do this here! We don’t write if, else in ML as we are not focused on
writing Algorithms.
• Machines - find relation between experience, job level, rare skill and
Salary
The factor of 1.5 or 2 which was calculated in previous example is called weight!
The columns in yellow are called features and the column in red is called label.
So ML calculate weights of features that contribute in deciding the label
based on the Algorithm we use!
Salary = Experience * Weight_1 + JobLevel * Weight_2 + Skill * Weight_3
How does Machine Learning work?
Machine Learning ≈ Looking for a
Function
Taxonomy of Machine Learning
STAGES OF MACHINE LEARNING
• Gathering data
• Kaggle and UCI Machine learning Repository
• Data pre-processing
• 80/20 rule
• Missing data, Noisy data, Inconsistent data
• Numeric, Categorical, Ordinal
• Conversion of data, Ignoring the missing values,
Filling the missing values, Outliers detection
• Researching the model that will be best for the type
of data – Supervised, Unsupervised
• Training and testing the model
– ‘Training data’ ,‘Validation data’ and ‘Testing data’.
• Evaluation – Confusion matrix, Accuracy
True positives : These are cases in which we predicted TRUE and our
predicted output is correct.
True negatives : We predicted FALSE and our predicted output is correct.
False positives : We predicted TRUE, but the actual predicted output is
FALSE.
False negatives : We predicted FALSE, but the actual predicted output is
TRUE.
Machine Learning Framework
Basic mathematics for machine learning
Sources of data:
What is data?
• Data is a collection of facts, such as numbers, words,
measurements, observations or just descriptions of things.
Why Is Data Dirty?
• Incomplete data may come from
• “Not applicable” data value when collected
• Different considerations between the time when the data was collected and when it is analyzed.
• e.g., occupation=“ ”
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• e.g., Salary=“-10”
• Inconsistent data may come from
• Different data sources
• e.g., Was rating “1,2,3”, now rating “A, B, C”
How to Handle Incomplete data ?
• Ignore the tuple:
• usually done when class label is missing
• Not effective when the percentage of missing values per attribute varies considerably.
• Fill in the missing value manually:
• tedious + infeasible
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the most probable value:
• inference-based such as Bayesian formula or decision tree
(e.g., predict age based on the info)
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
Data Transformation
• Normalization:
• scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Min max normalization
Accuracy is a statistical measure which is
defined as the quotient of correct predictions
made by a classifier divided by the sum of
predictions made by the classifier.
The classifier in our example correctly
predicted 42 male instances and 32 female
instance.
Therefore, the accuracy can be calculated
by:
accuracy = (42 + 32) / (42 + 8 + 18 + 32)
which is 0.72
Let a spam recognition classifier is described by the following confusion matrix:
Accuracy: (TN + TP) / (TN + TP + FN + FP)
Precision is the ratio of the correctly identified positive cases to all the predicted positive cases
Precision: TP/ (TP + FP)
Recall, also known as sensitivity, is the ratio of the correctly identified positive cases to all the actual positive cases
Recall: TP/ (TP + FN)
Confusion Predicted Precision: TP/ (TP + FP)
matrix
Spam Ham
Recall: TP/ (TP + FN)
Actual Spam 12 14
Ham 0 114
precision: 0.89
recall: 1.00
When a spam mail is not recognized as "spam" and is instead presented to us as "ham".
-If the percentage is not too high, it is annoying but not a disaster.
In contrast, when a non-spam message is wrongly labeled as spam, the email will not be shown in many cases or
even automatically deleted.
-For example, this carries a high risk of losing customers and friends.
There's a risk of making each type of error in every analysis, and the amount of risk is in your control.
Symmetric vs. Skewed Data
■ Median, mean and mode of symmetric, positively and negatively
skewed data
Positively and Negatively Correlated Data
Not Correlated Data
PROBABILITY
Probability
• How likely something is to happen.
• Many events can't be predicted with total certainty.
• The best we can say is how likely they are to happen, using the idea of
probability.
Tossing a Coin
• When a coin is tossed, there are two possible outcomes:
• heads (H) or
• tails (T)
• We say that the probability of the coin landing H is ½
• And the probability of the coin landing T is ½
Throwing Dice
• When a single die is thrown, there are six possible
outcomes: 1, 2, 3, 4, 5, 6.
• The probability of any one of them is 1/6
Example: the chances of rolling a "4" with a die
• Number of ways it can happen: 1
• (there is only 1 face with a "4" on it)
• Total number of outcomes: 6
• (there are 6 faces altogether)
• So the probability = 1/6
There are 5 marbles in a bag: 4 are blue, and 1 is red. What is the
probability that a blue marble gets picked?
• Number of ways it can happen: 4 (there are 4 blues)
• Total number of outcomes: 5 (there are 5 marbles in total)
• So the probability = 4/5 = 0.8
Frequentist probability
• The frequentist probability denotes the frequency with which the
event can happen amongst many trials/events.
• Rolling a dice is frequentist as ⅙ means that out of infinitely
many trials of rolling a dice, there’s a 1/6th chance that 6 is
going to show up.
Not all scenarios are frequency related
Bayesian probability
• We say that this event could occur with a certain
probability/certainty.
• Consider the statement — there’s a 32% chance that a diabetic
patient is going to develop heart failure.
• This statement isn’t prone to repetition where we create infinite
replicas of the patient’s symptoms.
• We instead quantify with a 32% certainty that heart failure could
happen.
EVENTS
• A probability event can be defined as a set of outcomes of an experiment.
• The toss of a coin, throw of a dice are all examples of random events.
• Example Events:
• Getting a Tail when tossing a coin is an event
• Rolling a "5" is an event.
• Events can be:
• Independent (each event is not affected by other events),
• Dependent (also called "Conditional", where an event is affected by other
events)
• Mutually Exclusive (events can't happen at the same time)
Independent Events
• Events can be "Independent", meaning each event is not affected by
any other events.
• Example: You toss a coin three times and it comes up "Heads" each
time ... what is the chance that the next toss will also be a "Head"?
• The chance is simply 1/2, or 50%, just like ANY OTHER toss of the coin.
• What it did in the past will not affect the current toss!
Dependent Events
• Some events can be "dependent" ... which means they can be affected by
previous events.
• Example: Drawing 2 Cards from a Deck
• Let's look at the chances of getting a King.
• For the 1st card the chance of drawing a King is 4 out of 52
• But for the 2nd card:
• If the 1st card was a King, then the 2nd card is less likely to be a King, as only 3 of the 51
cards left are Kings.
• If the 1st card was not a King, then the 2nd card is slightly more likely to be a King, as 4
of the 51 cards left are King.
Mutually Exclusive
• Mutually Exclusive means we can't get both events at the same time.
• Examples:
• Turning left or right are Mutually Exclusive (you can't do both at the same
time)
• Heads and Tails are Mutually Exclusive
MARGINAL PROBABILITY
• It gives the probabilities of various values of the variables
without reference to the values of the other variables
P(Female) = 0.46 which completely
ignores the sport the Female prefers,
P(Rugby) = 0.25 completely ignores
the gender.
Joint Probability
• The Joint probability is a statistical measure that is used to
calculate the probability of two events occurring together at the
same time — P(A and B) or P(A,B).
The joint probability of someone being a male
and liking football is 0.24.
The Joint probability is symmetrical meaning
that P(Male and Football) = P(Football and
Male)
Joint probability
• Find the probability that a candidate has got additional certification
and also a good salary package?
• Ans: 30/105 = 0.28 W/O add. With add.
certif Total
certif
Conditional probability
• It defines the probability of one event occurring given that another
event has occurred
• If we want to calculate the probability that a person would like
Rugby given that they are a female, we must take the joint
probability that the person is female and likes rugby (P(Female
and Rugby)) and divide it by the probability of the condition.
• P(Female, Rugby) = 0.05
• P(Female) = 0.46
• P(Rugby | Female) = 0.05 / 0.46 = 0.11 (to 2 decimal places).
Bayes' Theorem
• Way of finding a probability when we know certain other
probabilities.
• Which tells us:
• how often A happens given that B happens, written P(A|B),
• When we know:
• how often B happens given that A happens, written P(B|A)
• and how likely A is on its own, written P(A)
• and how likely B is on its own, written P(B)
EXAMPLE
• Past data tells you that 10% of patients entering your clinic have liver disease. The litmus test says
that “Patient is an alcoholic.” Five percent of the clinic’s patients are alcoholics. You might also
know that among those patients diagnosed with liver disease, 7% are alcoholics. Find the chances
of a person having liver disease given that the person is alcoholic.
• Solution:
• A – Liver disease, B – Alcoholic
• P(Liver disease) = 0.10, P(Alcoholic) = 0.05
• the probability that a patient is alcoholic, given that they have liver disease, is 7%. i.e
P(Alcoholic/Liver Disease) = 0.07
• Bayes’ theorem tells you:.
• P(Liver disease|Alcoholic) = (0.07 * 0.1)/0.05 = 0.14
• In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14
(14%).
Example
• You are planning a picnic today, but the morning is cloudy
• 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40% of days start cloudy)
• And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
• What is the chance of rain during the day?
• Answer:
Example
• Can you discover P(Man|Pink) ?
Example
• Hunter says she is itchy. There is a test for Allergy to Cats, but
this test is not always right:
• For people that really do have the allergy, the test says
"Yes" 80% of the time
• For people that do not have the allergy, the test says
"Yes" 10% of the time ("false positive")
• If 1% of the population have the allergy, and Hunter's test
says "Yes", what are the chances that Hunter really has the
allergy?
• Solution:
• We want to know the chance of having the allergy when test
says "Yes", written P(Allergy|Yes)
PROBABILITY DISTRIBUTION
EXAMPLE
Types of Distributions
1.Binomial Distribution
2.Bernoulli Distribution
3.Uniform Distribution
4.Normal Distribution
5.Poisson Distribution
6.Exponential Distribution
BINOMIAL DISTRIBUTION
Which flavor people prefer most??
1. Binomial Distribution
A binomial distribution graph where the probability
of success does not equal the probability of failure
looks like
when probability of success = probability of failure,
in such a situation the graph of binomial distribution
looks like
EXPONENTIAL
DISTRIBUTION
NORMAL DISTRIBUTION
CENTRAL LIMIT THEOREM
Use of central limit theorem
• Biologists use the central limit theorem whenever they use data
from a sample of organisms to draw conclusions about the
overall population of organisms.
• For example, a biologist may measure the height of 30 randomly
selected plants and then use the sample mean height to estimate
the population mean height.
• If the biologist finds that the sample mean height of the 30 plants is 10.3
inches, then her best guess for the population mean height will also be
10.3 inches.
Surveys
• Human Resources departments often use the central limit theorem when using
surveys to draw conclusions about overall employee satisfaction at companies.
• For example, the HR department of some company may randomly select 50
employees to take a survey that assesses their overall satisfaction on a scale of 1
to 10.
• If it’s found that the average satisfaction among employees in the survey is 8.5
then the best guess for the average satisfaction rating of all employees at the
company is also 8.5.
Example
• Uniform distribution
• The probability of getting heads in a coin flip is 0.5, and the probability
of tails is also 0.5.
• In the case of a die, the probability of getting a specific number
between 1 and 6 is 1/6 = 0.16.
• In both these examples, the probabilities are uniformly distributed,
which means that each value has the same probability.
• Normal Distribution
• If you look at the distribution of heights within a population, you will find
that some heights are more common than others.
EXAMPLE - binomial distribution
• Banks use the binomial distribution to model the probability that a certain number of credit card
transactions are fraudulent.
• For example, suppose it is known that 2% of all credit card transactions in a certain region are
fraudulent. If there are 50 transactions per day in a certain region, we can use a Binomial Distribution
Calculator to find the probability that more than a certain number of fraudulent transactions occur in a
given day:
• P(X > 1 fraudulent transaction) = 0.26423
• P(X > 2 fraudulent transactions) = 0.07843
• P(X > 3 fraudulent transactions) = 0.01776
• And so on.
• This gives banks an idea of how likely it is that more than a certain number of fraudulent transactions
will occur in a given day.
Need for Probability distribution
• Probability distributions indicate the likelihood (the chance of
something happening) of an event or outcome.
• Probability distributions are used to determine the risk of certain
outcomes.
• You can use this information to make better decisions
• Probability distributions cannot guarantee an outcome. It only
gives the probability of observing any given outcome.
• If the probability distribution is correct, then repeating the same
experiment multiple times will provide results that follow the trend of the
underlying probability distribution.
Let's get some terminology straight, generally when we say 'a model' we refer to a particular method
for describing how some input data relates to what we are trying to predict. We don't generally refer to
particular instances of that method as different models. So you might say 'I have a linear regression model'
but you wouldn't call two different sets of the trained coefficients different models. At least not in the context of
model selection.
So, when you do K-fold cross validation, you are testing how well your model is able to get trained by
some data and then predict data it hasn't seen. We use cross validation for this because if you train using all
the data you have, you have none left for testing. You could do this once, say by using 80% of the data to
train and 20% to test, but what if the 20% you happened to pick to test happens to contain a bunch of points
that are particularly easy (or particularly hard) to predict? We will not have come up with the best estimate
possible of the models ability to learn and predict.
We want to use all of the data. So to continue the above example of an 80/20 split, we would do 5-
fold cross validation by training the model 5 times on 80% of the data and testing on 20%. We ensure that
each data point ends up in the 20% test set exactly once. We've therefore used every data point we have to
contribute to an understanding of how well our model performs the task of learning from some data and
predicting some new data.
But the purpose of cross-validation is not to come up with our final model. We don't use these 5 instances of
our trained model to do any real prediction. For that we want to use all the data we have to come up with the
best model possible. The purpose of cross-validation is model checking, not model building.
Now, say we have two models, say a linear regression model and a neural network. How can we say
which model is better? We can do K-fold cross-validation and see which one proves better at predicting the
test set points. But once we have used cross-validation to select the better performing model, we train that
model (whether it be the linear regression or the neural network) on all the data. We don't use the actual
GENERALIZATION
Underfitting is a scenario in data science where a data model is unable to capture the relationship
between the input and output variables accurately, generating a high error rate on both the
training set and unseen data.
Overfitting is a concept in data science, which occurs when a statistical model fits exactly against
its training data. ... When the model memorizes the noise and fits too closely to the training set, the
model becomes “overfitted,” and it is unable to generalize well to new data.
• When choosing a classifier for your data, an obvious question to
ask is “What kind of data can this classifier classify?”.
• For example, if you know your points can easily be separated
by a single line, you may opt to choose a simple linear
classifier, whereas if you know your points will be in many
separate groups, you may opt to choose a more powerful
classifier such as a random forest or multilayer perceptron.
• This fundamental question can be answered using a
classifier’s VC dimension, that formally quantifies the power of
a classification algorithm.
The VC dimension of a classifier is defined by Vapnik and
Chervonenkis to be the cardinality (size) of the largest set of
points that the classification algorithm can shatter.
In order to have a VC dimension of at least N, a classifier must
be able to shatter a single configuration of N points.
4
In classification in general, the hypothesis class is the set of possible classification
functions you're considering;
the learning algorithm picks a function from the hypothesis class.
For a decision tree learner, the hypothesis class would just be the set of all possible
decision trees.
• VC dimension is a formal measure of bias.
• The VC dimension of a representation system is defined to be
• the maximum number of datapoints that can be separated (i.e.,
grouped) in all possible ways.
• Another way of saying this is to describe it as the most datapoints that
can be `shattered' by the representation.
• More powerful representations are able to shatter larger sets of
datapoints. These have higher VC dimension.
• Less powerful representations can only shatter smaller sets of
datapoints. These then have lower VC dimension.
ERROR
• Error measures are a tool in ML that quantify the question “how
wrong was our estimation”.
• It is a function that compares the output of a learned hypothesis with
the output of the real target function.
• What this means in practice is that we compare the prediction of our
model with the real value in data.
Bias-Variance-Noise Tradeoff
• The prediction error of a machine learning model, that is, the difference
between ground truth and the learned model, is traditionally composed of
three parts:
Error = Variance + Bias + Noise
• Here, variance measures the fluctuation of learned functions given different
datasets,
• bias measures the difference between the ground truth and the best possible
function within our modeling space,
• Noisy Target: Target functions can have different outputs for two
identical observations (f(x₁ == (1,2,3)) = yes and f(x₂ == (1,2,3) = no i.e
have two different real valued outputs for x₁== x₂).
• and noise refers to the irreducible error due to non-deterministic outputs of the
ground truth function itself.
Example: Playing Dice
1. No Features
• Rolling a die is associated with generating a random
number M between one and six with an equal probability of 16.6%.
• From this perspective, the output function (the number rolled) is
completely non-deterministic, and the error is fully characterized by
the noise term.
Error = Noise
• Since we are not using any features, there is no model to learn and
thus no variance.
2. Some Features
• Let’s repeat the same experiment, but this time, let’s record the number N facing
up at the moment the die is released and the height h it is dropped from.
• Based on these two features, we can generate pairs of training data x = (h,
N) and y = M, where M is the result of the die roll.
• Armed with enough training data and a good model, we expect that for two new
inputs h and N our model is able to predict M better than chance.
• For example, if N = 1 and h = 5 cm, our model may predict M = 1 with 60%
confidence and thus show an improvement over our first model which would
have predicted M = 1 with only 16.6% confidence.
• Thus, we managed to reduce the overall prediction error by reducing the noise
term. By training a machine learning model we also introduced a bias and
variance term in our overall error.
Error = Variance + Bias + Noise
3. All Features
• Rolling a die is completely deterministic and there is absolutely no
randomness in it.
• As long as we keep track of all relevant quantities such as initial speed,
angular momentum, air resistance, drop height, etc., we can predict the
outcome of the roll with 100% accuracy.
• In this case, we expect that noise is completely eliminated and we are left
with just bias and variance.
Error = Variance + Bias
• If we consider a very complex (and suitable) modeling space, we will have
almost no bias.
• If we further assume a huge amount of training data, our variance will also
be very small. In such a case, our overall prediction error will be close to
zero.
Practical Implications
• The same techniques that reduce bias also reduce noise, and vice
versa.
• In particular, techniques that reduce variance such as collecting
more training samples won’t help reduce noise.
• Adding more features and considering more complex models will
help reduce both noise and bias.