KEMBAR78
Machine learning and deep learning algorithms | PPT
Fundamentals of Deep Learning
Dr. A.Kannan, Former Professor and Head,
Department of Information Science and
Technology, CEG Campus, Anna
University, Chennai-25.
Senior Professor, School of Computer
Science and Engineering,
VIT, Vellore-632014.
.
1
MACHINE LARNING
• “Learning is making useful changes in our
minds.” - Marvin Minsky.
• “Machine Learning (ML) refers to a system
which has the capability of autonomous
knowledge acquisition and integration of
the acquired knowledge.”
2
MACHINE LARNING
• Machine learning is an application of
Artificial Intelligence (AI) that provides the
systems with the ability to automatically
learn and improve by themselves from the
experience gained by them without being
explicitly programmed.
• It focuses on the development of intelligent
computer programs that can access the data
and use it for learning by themselves. 3
Applications of ML
• Image Processing – Face Recognition, Hand
written character recognition, Self driving
Cars, Traffic Video analysis….
• Natural Language Processing - Social
Network Analysis, Recommendation
Systems and Sentiment Analysis.
• Medical Diagnosis: Disease Identification,
Prediction on Cancer, Diabetes etc using
past history and current data. 4
Machine Learning Paradigms
• Rote Learning
• Transfer of Learning
• Learning by Taking Advice
• Learning By Analogy
• Un-Supervised Learning (Clustering)
• Supervised Learning – Classification
• Deep Learning
5
Machine Learning Tasks
• Knowledge Representation and Reasoning
• Regression
• Classification
• Clustering
• Dimensionality reduction
• Reinforcement learning (Ranking)
6
AI - Knowledge Based Systems
• Facts
• Rules
• Knowledge base
• Knowledge Based Systems
• Knowledge Representation
• Reasoning and Inference
7
FACTS AND RULES
• Pat is a man = true
• Kumar is the father of Raja=True
• Kumar is the grand father of Raja= False
• IF marks >=60 Then
Class = FIRST CLASS
8
Rules
• If A then B (Whenever A is true then B is
also true)
• A = true (now A is TRUE)
• Inference: B is true now
• Using AB and A, we can infer B.
• This rule is called Modus Ponens.
• AB, A implies B.
9
Inference
• Pat is a man = p
• Pat is a woman = q
• Pat is a man or woman = p v q = true
• Pat is not a woman = 7q = true
• Inference: p is true
• Pat is a man or woman
• Pat is not a woman.
• Inference: Pat is a man 10
Knowledge Base Vs Database
More Rules Less Rules
Facts Less More Facts
Explicit Rules and
Facts
Explicit Facts and
Implicit Rules
Experts update Clerks update
Main Memory
Based
Disk Based
11
AI Programs - Exhibit Intelligent Behavior by skillful
application of heuristics.
KBS – Make domain knowledge
explicit
Expert Systems –
Apply expert
knowledge to
difficult,
Real world problems.
12
Knowledge Representation
Techniques
• English or Natural Language
• Tables and Rules
• Logic (Propositional logic, Predicate logic)
• Semantic Networks
• Frames
• Conceptual Dependency
• Scripts
• Ontology 13
LOGIC
• First Order Logic
– Predicate Logic
– Propositional Logic
• Higher Order Logics
– Situational Logic
– Fuzzy Logic
– Temporal Logic
– Modal Logic
– Epistemic Logic
14
Searching
• Depth First (Missionaries and Cannibals)
• Breadth First (Water Jug Problem)
• Hill Climbing ( 8 – Puzzle)
• Best First
• A* Algorithm
• AO* Algorithm
• Mini-Max Algorithm
15
REASONING METHODS
• Reasoning By Analogy – Frames.
• Temporal Reasoning – Higher order logic
(Temporoal Logic).
• Fuzzy Reasoning – Higher Order Logic
(Fuzzy Logic).
• Non-monotonic Reasoning – Higher Order
Logics (Non-monotonic Logic).
• Reasoning Agents – Epistemic Logic.
16
AI and ML
• Roughly speaking, AI and ML are good ways
to ask a computer to provide an answer to a
problem based on some past experience.
(Prediction, Learning, Explanation and Finding
Temporal Dependencies)
• It might be challenging to tell a computer what
a cat is, for instance. ( Computers don’t have
common sense – General Problem Solver-
Human Intelligence).
17
AI and ML
• Still, if you show a neural network enough images
of cats and tell it they are cats, then the computer
will be able to correctly identify other cats that it did
not see before.
• It appears that some of the most prominent and
widely used AI and ML algorithms can be speeded-
up significantly if they are run on quantum
computers. (Example: Bayesian Networks, Graph
Search Algorithms- Shortest path algorithms –
Heuristic Search Algorithms- and Swarm
Intelligence algorithms …).
18
Learning Methods
• Learning from examples
• Winston’s Program
• Explanation based Learning
• Learning by Observation
• Knowledge Acquisition from experts
19
Machine Learning Methods
• Un-Supervised Learning
• Clustering
• K-means clustering
• Supervised learning
• Classification Algorithms
20
K-means Clustering
• Strengths
– Simple iterative method
– User provides “K”
• Weaknesses
– Often too simple  bad results
– Difficult to guess the correct “K”
21
K-means Clustering
Basic Algorithm:
• Step 0: Select K
• Step 1: Randomly select initial cluster seeds
Seed 1
650
Seed 2
200
22
K-means Clustering
• An initial cluster seed represents the “mean
value” of its cluster.
• In the preceding figure:
– Cluster seed 1 = 650
– Cluster seed 2 = 200
23
K-means Clustering
• Step 2: Calculate distance from each object
to each cluster seed.
• What type of distance should we use?
– Squared Euclidean distance
• Step 3: Assign each object to the closest
cluster
24
K-means Clustering
Seed 1
Seed 2
25
K-means Clustering
• Iterate:
– Calculate distance from objects to cluster
centroids.
– Assign objects to closest cluster
– Recalculate new centroids
• Stop based on convergence criteria
– No change in clusters
– Max iterations
26
Regression
• The regression task comes from Supervised
machine learning.
• It can help us to predict (expect continues
values) and explains the objects based on a
given set of numerical and categorical data.
• For example, we can predict the house
prices based on the house attributes such as
number of rooms, size, and location.
27
Regression
• In mathematical terms, the regression
method provides us a linear line with the
equation of Y = mX+c to model a dataset.
• Here we are taking the X (Dependent
variable) and Y (Independent variable) data
points to train the linear regression model.
The best observation line can be found by
calculating the slope (m) and y-intercept (c)
values. 28
Applications
• Risk assessment – Insurance, Banking
• Score prediction – Cricket, Elections
• Market forecasting – Share Market
• Weather forecasting
• Housing and product price prediction
• Analysing engine performance in
Automobiles.
29
Regression Analysis
• The regression analysis is performed with
various effective algorithms namely
• Simple linear regression
• Multiple linear regression
• Decision trees
• Radom forest
• Support Vector Machines (SVM)
30
Decision Trees
31
Name Debt Income Married? Risk
--------------------------------------------------------
Joe High High Yes Good
Sue Low High Yes Good
John Low High No Poor
Mary High Low Yes Poor
Fred Low Low Yes Poor
32
Decision Tree Classification
• Example
D1
D2
Decision Tree
Income
=High
Income
=low
D2
D1
33
Decision Trees Classification (cont.)
• Example
D1a
D2
1 2
D1b
Decision Tree
Income
=High
Income
=low
D2
D1
D1a D1b
Random Forest
• Random forest (or random forests) is an
ensemble classifier that consists of many
decision trees and outputs the class that is
the mode of the class's output by individual
trees. It combines "bagging" and the
random selection of features.
• For many data sets, it produces a highly
accurate classification results than decision
trees. 34
Bagging
• Bagging, also known as bootstrap
aggregation, is the ensemble learning
method that is commonly used to
reduce variance within a noisy dataset.
• In bagging, a random sample of data in
a training set is selected with
replacement—meaning that the
individual data points can be chosen
more than once. 35
Regression Vs Classification
• The most significant difference between
regression vs classification is that
while regression helps predict a continuous
quantity, classification predicts discrete
class labels. There are also some
overlaps between the two types of machine
learning algorithms.
36
Support Vector Machines
• It is a linear classifier.
• It classifies the dataset into two groups for
binary classification problems.
• The multi-class SVM classifies the data set
into multiple groups.
• It is a supervised learning method.
37
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
w x + b<0
w x + b>0
38
Linear Classifiers
f
x
a
yest
Denotes +1
Denotes -1
f(x,w,b) = sign(w x + b)
How will you
classify this data?
39
40
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x + b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Support Vectors
are those
datapoints that the
margin pushes up
against
Linear SVM
Probabilistic Models
• Uncertainty
• ABC Murder Story
• Bayesian Classification
• Neural Networks
• Feature Selection and Classification
• Deep Learning
41
Probabilistic Models
• The probabilistic framework to machine
learning is that learning can be thought of
as inferring plausible models to explain
observed data.
• A machine can use such models to make
predictions about future data, and take
decisions that are rational given these
predictions.
42
BAYESIYAN
CLASSIFICATION
• CONDITIONAL PROBABILITY
• BAYES THEOREM
• NAÏVE BAYES CLASSIFIER
• BELIEF NETWORK
• APPLICATION OF BAYESIAN NETWORK -
CYBER CRIME DETECTION
43
BAYESIAN CLASSIFICATION
• Probabilistic learning: Calculates explicit
probabilities for hypothesis, among the most
practical approaches to certain types of
learning problems.
• Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
44
BAYESIAN THEOREM
• A special case of Bayesian
Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] /
P(B)
       
A
B
P
A
P
A
B
P
A
P
A
B
P
A
P
B
P
A
B
P
A
P
B
A
P
|
|
)
|
(
)
(
)
(
)
|
(
)
(
)
|
(



A B
45
BAYESIAN THEOREM
• Example 1: A medical cancer diagnosis
problem
There are 2 possible outcomes of a diagnosis:
+ ve, - ve. We know .8% of world population has
cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time.
If a patient’s test returns +ve, should we
diagnose the patient as having cancer?
46
BAYESIAN THEOREM
P(cancer) = .008 P(-cancer) = .992
P(+ve|cancer) = .98 P(-ve|cancer) = .02
P(+ve|-cancer) = .03 P(-ve|-cancer) = .97
Using Bayes Formula:
P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = .0078 / P(+ve)
P(-cancer|+ve) = P(+ve|-cancer)xP(-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve).
So, the patient most likely does not have cancer.
47
NAÏVE BAYES CLASSIFIER
• A simplified assumption: attributes are
conditionally independent.
• Greatly reduces the computation cost, only
count the class distribution.
48
NAÏVE BAYES CLASSIFIER
The probabilistic model of NBC is to find the probability of a
certain class given multiple dijoint (assumed) events.
The naïve Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f(x) can take on any
value from some finite set V.
A set of training examples of the target function is provided,
and a new instance is presented, described by the tuple
of attribute values <a1,a2,…,an>. The learner is asked to
predict the target value, or classification, for this new
instance.
49
NAÏVE BAYES CLASSIFIER
Abstractly, probability model for a classifier is a
conditional model
P(C|F1,F2,…,Fn)
Over a dependent class variable C with a small
number of outcome or classes conditional over
several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x P(F2|C)
x…x P(Fn|C)] / P(F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabilities, we
need not evaluate the denominator for comparisons.
50
NAÏVE BAYES CLASSIFIER
Tennis-Example
51
NAÏVE BAYES CLASSIFIER
• Problem:
Use training data from above to classify the
following instances:
a) <Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong>
b) <Outlook=overcast, Temperature=cool,
Humidity=high, Wind=strong>
52
NAÏVE BAYES CLASSIFIER
Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
53
NAÏVE BAYES CLASSIFIER
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes) x
P(strong|yes) = 0.0053
P(no)xP(sunny|no)xP(cool|no)xP(high|no) x
P(strong|no) = 0.0206
So the class for this instance is ‘no’. We can
normalize the probility by:
[0.0206]/[0.0206+0.0053] = 0.795
54
NAÏVE BAYES CLASSIFIER
Answer to (b):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
55
NAÏVE BAYES CLASSIFIER
Estimating Probabilities:
In the previous example, P(overcast|no) = 0 which
causes the formula-
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.0
This causes problems in comparing because the
other probabilities are not considered. We can
avoid this difficulty by using m-estimate.
56
NAÏVE BAYES CLASSIFIER
M-Estimate Formula:
[c + k] / [n + m] where c/n is the original
probability used before, k=1 and
m= equivalent sample size.
Using this method our new values of
probability is given below-
57
NAÏVE BAYES CLASSIFIER
New answer to (b):
P(PlayTennis=yes) = 10/16 = 0.63
P(PlayTennis=no) = 6/16 = 0.37
P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42
P(Outlook=overcast|PlayTennis=no) = 1/8 = .13
P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33
P(Temperature=cool|PlayTennis=no) = 2/8 = .25
P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36
P(Humidity=high|PlayTennis=no) = 5/7 = 0.71
P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36
P(Wind=strong|PlayTennis=no) = 4/7 = 0.57
58
NAÏVE BAYES CLASSIFIER
P(yes)xP(overcast|yes)xP(cool|yes)xP(high|ye
s)xP(strong|yes) = 0.011
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.00486
So the class of this instance is ‘yes’
59
NAÏVE BAYES CLASSIFIER
• The conditional probability values of all the
attributes with respect to the class are
pre-computed and stored on disk.
• This prevents the classifier from computing
the conditional probabilities every time it
runs.
• This stored data can be reused to reduce the
latency of the classifier. 60
Bayesian Belief Networks
• In Naïve Bayes Classifier we make the
assumption of class conditional
independence, that is given the class label
of a sample, the value of the attributes are
conditionally independent of one another.
• However, there can be dependences
between the value of attributes. To avoid
this, we use Bayesian Belief Network
which provides the joint conditional
probability distribution. 61
Bayesian Belief Networks
• A Bayesian network is a form of probabilistic
graphical model.
• Specifically, a Bayesian network is a directed acyclic
graph of nodes representing variables and arcs
representing dependence relations among the
variables.
• They provide a graphical method for getting the inferred
results through joint probabilities.
62
63
BAYESIAN BELIEF
NETWORK
64
BELIEF NETWORKS
• By the chaining rule of probability, the joint
probability of all the nodes in the graph above is:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
W=Wet Grass, C=Cloudy, R=Rain, S=Sprinkler
Example: P(W∩-R∩S∩C)
= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)
= 0.9*0.2*0.1*0.5 = 0.009
65
BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day - P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
66
BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day - P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
67
Advantages of Bayesian Approach
• Bayesian networks can readily handle
incomplete data sets.
• Bayesian networks allow one to learn
about causal relationships
• Bayesian networks readily facilitate use of
prior knowledge.
68
ML Resources (Books)
1. Stephen Marsland, “Machine Learning –
An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series,
2014.
2. Tom M Mitchell, ―Machine Learning‖,
First Edition, McGraw Hill Education,
2013.
69
ML Resources (Books)
3. Nilsson, N. (2004). Introduction to Machine
Learning.
http://robotics.stanford.edu/people/nilsson/
mlbook.html.
4. Russell, S. (1997). Machine Learning.
Handbook of Perception and Cognition,
Vol. 14, Chap. 4.
5. Ethem Alpaydin, “Introduction to Machine
Learning”, (Adaptive Computation and
Machine Learning Series), Third Edition,
MIT Press, 2014. 70
Journals - IEEE
• IEEE Transactions on Neural Networks.
• IEEE Transactions on Pattern Analysis
and Machine Intelligence.
• IEEE Transactions on Neural Networks and
Learning Systems.
• IEEE Transactions on Artificial Intelligence
• IEEE Transactions on Knowledge and Data
Engineering.
71
ML Journals - Elsevier
• Machine Learning with Applications
• Expert Systems With Applications
• Applied Soft Computing
• Knowledge-based Systems
• Neural Networks
• Data & Knowledge Engineering
• Artificial Intelligence
72
Neural Networks
Similarity with biological network
Fundamental processing elements of a neural network
is a neuron
1.Receives inputs from other source
2.Combines them in someway
3.Performs a generally nonlinear operation on the result
4.Outputs the final result
•Biologically motivated approach to
machine learning
73
Similarity with Biological Network
• Fundamental processing element of a
neural network is a neuron
• A human brain has 100 billion neurons
• An ant brain has 250,000 neurons 74
Neural Network
• Neural Network is a set of connected
INPUT/OUTPUT UNITS, where each connection
has a WEIGHT associated with it.
• Neural Network learning is also called CONNECTIONIST
learning due to the connections between units.
• It is a case of SUPERVISED or
CLASSIFICATION learning.
75
Neural Network
• Neural Network learns by adjusting the weights so
as to be able to correctly classify the training data
and hence, after testing phase, to classify unknown
data.
• Neural Network needs long time for training.
• Neural Network has a high tolerance to noisy and
incomplete data.
76
Neural Network Classifier
• Input: Classification data
It contains classification attribute
• Data is divided, as in any classification problem.
[Training data and Testing data]
• All data must be normalized.
(i.e. all values of attributes in the database are changed to contain values
in the internal [0,1] or[-1,1])
Neural Network can work with data in the range of (0,1) or (-1,1)
77
One Neuron as a Network
– The neuron receives the weighted sum as input and calculates the
output as a function of input as follows :
• y = f(x) , where f(x) is defined as
• f(x) = 0 { when x< 0.5 } and f(x) = 1 {
when x >= 0.5 }
• For eg, if x is 0.55, then y = 1 , the
input values are classified in class 1.
• If x = 0.45 , f(x) =0, the input values
are classified to class 0.
78
Bias of a Neuron
• We need the bias value to be added to the weighted sum
∑wixi so that we can transform it from the origin.
v = ∑wixi + b, here b is the bias
x1-x2=0
x1-x2= 1
x1
x2
x1-x2= -1
79
Bias as extra input
Input
Attribute
values
weights
Summing function
Activation
function
v
Output
class
y
x1
x2
xm
w2
wm
W1
 
 )
(

w0
x0 = +1
b
w
x
w
v j
m
j
j

 

0
0
80
Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm
2. An adder function (linear combiner) for computing the
weighted sum of the inputs (real numbers):
3 Activation function : for limiting the amplitude of the
neuron output.



m
1
j
jx
w
u
j
)
(u
y b

 81
k
O
jk
w
Output nodes
Input nodes
Hidden nodes
Output Class
Input Record : xi
wij - weights
Network is fully connected
j
O
A Multilayer Feed-Forward
Neural Network
82
Neural Network Learning
• The inputs are fed simultaneously into the input
layer.
• The weighted outputs of these units are fed into
hidden layer.
• The weighted outputs of the last hidden layer are
inputs to units making up the output layer.
83
A Multilayer Feed Forward Network
• The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units.
• A network containing two hidden layers is called a three-
layer neural network, and so on.
• The network is feed-forward in that none of the weights
cycles back to an input unit or to an output unit of a
previous layer.
84
A Multilayered Feed – Forward Network
• INPUT: records without class attribute with normalized
attributes values.
• INPUT VECTOR: X = { x1, x2, …. xn}
where n is the number of (non class) attributes.
• INPUT LAYER – there are as many nodes as non-class
attributes i.e. as the length of the input vector.
• HIDDEN LAYER – the number of nodes in the hidden
layer and the number of hidden layers depends on
implementation.
85
A Multilayered Feed–Forward
Network
• OUTPUT LAYER – corresponds to the
class attribute.
• There are as many nodes as classes (values
of the class attribute).
k
O k= 1, 2,.. #classes
• Network is fully connected, i.e. each unit provides input
to each unit in the next forward layer.
86
Classification by Back propagation
• Back Propagation learns by iteratively processing a
set of training data (samples).
• For each sample, weights are modified to
minimize the error between network’s
classification and actual classification.
87
Steps in Back propagation
Algorithm
• STEP ONE: initialize the weights and biases.
• The weights in the network are initialized to random
numbers from the interval [-1,1].
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random
numbers from the interval [-1,1].
• STEP TWO: feed the training sample.
88
Steps in Back propagation Algorithm
( cont..)
• STEP THREE: Propagate the inputs forward; we
compute the net input and output of each unit in
the hidden and output layers.
• STEP FOUR: back propagate the error.
• STEP FIVE: update weights and biases to reflect
the propagated errors.
• STEP SIX: terminating conditions.
89
Propagation through Hidden Layer
( One Node )
• The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted
sum, which is added to the bias associated with unit j.
• A nonlinear activation function f is applied to the net input.
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector
w

w0j
w1j
wnj
x0
x1
xn
Bias j
90
Propagate the inputs forward
• For unit j in the input layer, its output is
equal to its input, that is,
j
j I
O 
for input unit j.
• The net input to each unit in the hidden and output layers is
computed as follows.
•Given a unit j in a hidden or output layer, the net input is
 

i
j
i
ij
j O
w
I 
where wij is the weight of the connection from unit i in the
previous layer to unit j; Oi is the output of unit I from the
previous layer;
j
 is the bias of the unit
91
Propagate the inputs forward
• Each unit in the hidden and output layers takes its net
input and then applies an activation function. The
function symbolizes the activation of the neuron
represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
• Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed as j
I
j
e
O 


1
1
92
Back propagate the error
• When reaching the Output layer, the error is
computed and propagated backwards.
• For a unit k in the output layer the error is computed
by a formula:
)
)(
1
( k
k
k
k
k O
T
O
O
Err 


•
Where O k – actual output of unit k ( computed by activation function.
Tk – True output based of known class label; classification of training
sample
Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.
k
I
k
e
O 


1
1
93
Back propagate the error
• The error is propagated backwards by updating
weights and biases to reflect the error of the network
classification .
• For a unit j in the hidden layer the error is computed
by a formula:
•
jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
where wjk is the weight of the connection from unit j to unit k in
the next higher layer, and Errk is the error of unit k.
94
Update weights and biases
• Weights are updated by the following equations,
where l is a constant between 0.0 and 1.0 reflecting
the learning rate, this learning rate is fixed for
implementation.
i
j
ij O
Err
l
w )
(


ij
ij
ij w
w
w 


• Biases are updated by the following equations
j
j
j 

 


j
j Err
l)
(


95
Update weights and biases
• We are updating weights and biases after the presentation
of each sample.
• This is called case updating.
• Epoch --- One iteration through the training set is called an epoch.
• Epoch updating ------------
• Alternatively, the weight and bias increments could be
accumulated in variables and the weights and biases updated after
all of the samples of the training set have been presented.
• Case updating is more accurate
96
Terminating Conditions
• Training stops
ij
w

• All in the previous epoch are below some threshold, or
•The percentage of samples misclassified in the previous epoch
is below some threshold, or
• a pre specified number of epochs has expired.
• In practice, several hundreds of thousands of epochs may be
required before the weights will converge.
97
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
 

i
j
i
ij
j O
w
I 
)
)(
1
( k
k
k
k
k O
T
O
O
Err 


jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
i
j
ij
ij O
Err
l
w
w )
(


j
j
j Err
l)
(



j
I
j
e
O 


1
1
Backpropagation Formulas
98
Example of Back propagation
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2
Initial Input and
weight
Initialize weights :
Input = 3, Hidden
Neuron = 2 Output
=1
Random Numbers
from -1.0 to 1.0
99
Example ( cont.. )
• Bias added to Hidden
• + Output nodes
• Initialize Bias
• Random Values from
• -1.0 to 1.0
• Bias ( Random )
θ4 θ5 θ6
-0.4 0.2 0.1
100
Net Input and Output Calculation
Unitj Net Input Ij Output Oj
4 0.2 + 0 - 0.5 -0.4 = -0.7
5 -0.3 + 0 + 0.2 + 0.2 =0.1
6 (-0.3)0.332-
(0.2)(0.525)+0.1= -0.105
1
.
0
1
1



e
Oj
7
.
0
1
1
e
Oj


105
.
0
1
1
e
Oj


= 0.332
= 0.525
= 0.475
101
Calculation of Error at Each
Node
Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1
5 0.525 x (1- 0.525)x 0.1311x
(-0.2) = 0.0065
4 0.332 x (1-0.332) x 0.1311 x
(-0.3) = -0.0087
102
Calculation of weights and Bias Updating
Learning Rate l =0.9
Weight New Values
w46 -0.3 + 0.9(0.1311)(0.332) = -
0.261
w56 -0.2 + (0.9)(0.1311)(0.525) = -
0.138
w14 0.2 + 0.9(-0.0087)(1) = 0.192
w15 -0.3 + (0.9)(-0.0065)(1) = -0.306
……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218
……..similarly ………similarly 103
DEEP LEARNING
• Deep learning is a subset of ML in
AI that has networks capable of
learning unsupervised from data that is
unstructured or unlabeled.
• Deep Learning is a subfield of Machine
Learning that involves the use of neural
networks to model and solve complex
problems.
104
DEEP LEARNING
• The key characteristic of Deep Learning is the use
of deep neural networks, which have multiple
layers of interconnected nodes.
• These networks can learn complex
representations of data by discovering hierarchical
patterns and features in the data.
• Deep Learning algorithms can automatically learn
and improve from data without the need for
manual feature engineering.
105
DEEP LEARNING
• Deep Learning has achieved significant success in
various fields, including image recognition,
natural language processing, speech recognition,
and recommendation systems.
• Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep
Belief Networks (DBNs).
106
DEEP LEARNING
• Training deep neural networks typically
requires a large amount of data and
computational resources.
• However, the availability of cloud
computing and the development of
specialized hardware, such as Graphics
Processing Units (GPUs), has made it easier
to train deep neural networks.
107
Convolutional Neural Networks
• A Convolutional Neural Network (CNN) is
a type of deep learning algorithm that is
particularly well-suited for image
recognition and processing tasks.
• It is made up of multiple layers, including
convolutional layers, pooling layers, and
fully connected layers.
108
Convolutional Neural Networks
• The convolutional layers are the key component of a CNN,
where filters are applied to the input image to extract
features such as edges, textures, and shapes.
• The output of the convolutional layers is then passed
through pooling layers, which are used to down-sample the
feature maps, reducing the spatial dimensions while
retaining the most important information.
• The output of the pooling layers is then passed through one
or more fully connected layers, which are used to make a
prediction or classify the image.
109
CNN
• There are certain steps/operations that are
involved CNN. These can be categorized as
follows:
• Convolution operation
• Pooling
• Flattening
• Fully connected layers
110
CNN
111
CNN
• Convolution operations is the first and one
of the most important step in the
functioning of a CNN. Convolution
operation focuses on extracting/preserving
important features from the input (image
etc).
112
CNN
• To understand this operation, let us consider
image as input to our CNN.
• Now when image is given as input, they are
in the form of matrices of pixels.
• If the image is grayscale, then the image is
considered as a matrix, each value in the
matrix ranges from 0-255.
113
CNN
• We can even normalize these values lets say
in range 0-1. 0 represents white and 1
represents black.
• If the image is colored, then three matrices
representing the RGB colors with each
value in range 0-255. The same can be seen
in the images below:
114
Fig 1: Colored image matrices
115
Fig 2: Grascale image matrix
116
Convolution Operation
• MATHEMATICAL OPERATION
Coming to convolution operation, let us
consider an input image. Now for
convolution operation, filters or kernels are
used.
117
Convolution Operation
• The following mathematical operation is
performed:
Let size of image: NxN
Let size of filer: FxF
Then (NxN)*(FxF)= (N-F+1)x(N-F+1)
*= Convolution operation
All these kernels, input channels etc are the hyper
parameters. The result of each layer is passed on
to the next one.
118
Example
119
Example
• Here, input of size 6x6 is given and kernel
of size 3x3 is used. The feature map
obtained in of size 4x4.
To increase non-linearity in the image,
Rectifier function can be applied to the
feature map.
Finally, after the convolution step is
completed and feature map is obtained, this
map is given as input to the pooling layer. 120
CNN Architecture
• A common CNN model architecture is to
have a number of convolution and pooling
layers stacked one after the other.
121
CNN Architecture
122
Pooling Layers
• Pooling layers are used to reduce the
dimensions of the feature maps.
• Thus, it reduces the number of parameters
to learn and the amount of computation
performed in the network.
123
Pooling Layers
• The pooling layer summarizes the features present
in a region of the feature map generated by a
convolution layer.
• So, further operations are performed on
summarized features instead of precisely
positioned features generated by the convolution
layer.
• This makes the model more robust to variations in
the position of the features in the input image.
124
Max Pooling
• Types of Pooling Layers:Max, Min and
Average pooling.
Max Pooling
• Max pooling is a pooling operation that selects the
maximum element from the region of the feature
map covered by the filter.
• Thus, the output after max-pooling layer would be
a feature map containing the most prominent
features of the previous feature map.
125
Max Pooling
126
Average Pooling
• Average pooling computes the average of
the elements present in the region of feature
map covered by the filter.
• Thus, while max pooling gives the most
prominent feature in a particular patch of
the feature map, average pooling gives the
average of features present in a patch.
127
Average Pooling
128
Flattening
• Flattening is converting the data into a 1-
dimensional array for inputting it to the next
layer.
• We flatten the output of the convolutional
layers to create a single long feature vector.
• And it is connected to the final
classification model, which is called a fully-
connected layer.
129
CNN
• In other words, we put all the pixel data in
one line and make connections with the
final layer.
• And once again. What is the final layer for?
The classification of ‘the cats and dogs.’
130
Flattening
131
Recurrent Neural Network
• Recurrent Neural Network(RNN) is a type of Neural
Network where the output from the previous step is fed as
input to the current step.
• In traditional neural networks, all the inputs and outputs
are independent of each other.
• But in cases when it is required to predict the next word of
a sentence, the previous words are required.
• NLP
• Hence, there is a need to remember the previous words.
132
Recurrent Neural Network
• Thus RNN came into existence, which
solved this issue with the help of a Hidden
Layer.
• The main and most important feature of
RNN is its Hidden state, which remembers
some information about a sequence.
133
Recurrent Neural Network
• The state is also referred to as Memory
State since it remembers the previous input
to the network.
• It uses the same parameters for each input
as it performs the same task on all the inputs
or hidden layers to produce the output.
• This reduces the complexity of parameters,
unlike other neural networks.
134
RNN
135
How RNN works
• The Recurrent Neural Network consists of
multiple fixed activation function units, one
for each time step.
• Each unit has an internal state which is
called the hidden state of the unit.
• This hidden state signifies the past
knowledge that the network currently holds
at a given time step.
136
RNN
• This hidden state is updated at every time
step to signify the change in the knowledge
of the network about the past.
• The hidden state is updated using the
following recurrence relation:-
137
RNN
• The formula for calculating the current
state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
138
RNN
• Training through RNN
• A single-time step of the input is provided
to the network.
• Then it calculates its current state using a
set of current input and the previous state.
• The current ht becomes ht-1 for the next
time step.
139
RNN
• One can go as many time steps according to
the problem and join the information from
all the previous states.
• Once all the time steps are completed the
final current state is used to calculate the
output.
140
RNN
• The output is then compared to the actual
output i.e the target output and the error is
generated.
• The error is then back-propagated to the
network to update the weights and hence the
network (RNN) is trained
using Backpropagation through time.
141
Conclusions
• Machine Learning
• Supervised and Unsupervised Learning
• Neural Networks
• CNN
• RNN
142

Machine learning and deep learning algorithms

  • 1.
    Fundamentals of DeepLearning Dr. A.Kannan, Former Professor and Head, Department of Information Science and Technology, CEG Campus, Anna University, Chennai-25. Senior Professor, School of Computer Science and Engineering, VIT, Vellore-632014. . 1
  • 2.
    MACHINE LARNING • “Learningis making useful changes in our minds.” - Marvin Minsky. • “Machine Learning (ML) refers to a system which has the capability of autonomous knowledge acquisition and integration of the acquired knowledge.” 2
  • 3.
    MACHINE LARNING • Machinelearning is an application of Artificial Intelligence (AI) that provides the systems with the ability to automatically learn and improve by themselves from the experience gained by them without being explicitly programmed. • It focuses on the development of intelligent computer programs that can access the data and use it for learning by themselves. 3
  • 4.
    Applications of ML •Image Processing – Face Recognition, Hand written character recognition, Self driving Cars, Traffic Video analysis…. • Natural Language Processing - Social Network Analysis, Recommendation Systems and Sentiment Analysis. • Medical Diagnosis: Disease Identification, Prediction on Cancer, Diabetes etc using past history and current data. 4
  • 5.
    Machine Learning Paradigms •Rote Learning • Transfer of Learning • Learning by Taking Advice • Learning By Analogy • Un-Supervised Learning (Clustering) • Supervised Learning – Classification • Deep Learning 5
  • 6.
    Machine Learning Tasks •Knowledge Representation and Reasoning • Regression • Classification • Clustering • Dimensionality reduction • Reinforcement learning (Ranking) 6
  • 7.
    AI - KnowledgeBased Systems • Facts • Rules • Knowledge base • Knowledge Based Systems • Knowledge Representation • Reasoning and Inference 7
  • 8.
    FACTS AND RULES •Pat is a man = true • Kumar is the father of Raja=True • Kumar is the grand father of Raja= False • IF marks >=60 Then Class = FIRST CLASS 8
  • 9.
    Rules • If Athen B (Whenever A is true then B is also true) • A = true (now A is TRUE) • Inference: B is true now • Using AB and A, we can infer B. • This rule is called Modus Ponens. • AB, A implies B. 9
  • 10.
    Inference • Pat isa man = p • Pat is a woman = q • Pat is a man or woman = p v q = true • Pat is not a woman = 7q = true • Inference: p is true • Pat is a man or woman • Pat is not a woman. • Inference: Pat is a man 10
  • 11.
    Knowledge Base VsDatabase More Rules Less Rules Facts Less More Facts Explicit Rules and Facts Explicit Facts and Implicit Rules Experts update Clerks update Main Memory Based Disk Based 11
  • 12.
    AI Programs -Exhibit Intelligent Behavior by skillful application of heuristics. KBS – Make domain knowledge explicit Expert Systems – Apply expert knowledge to difficult, Real world problems. 12
  • 13.
    Knowledge Representation Techniques • Englishor Natural Language • Tables and Rules • Logic (Propositional logic, Predicate logic) • Semantic Networks • Frames • Conceptual Dependency • Scripts • Ontology 13
  • 14.
    LOGIC • First OrderLogic – Predicate Logic – Propositional Logic • Higher Order Logics – Situational Logic – Fuzzy Logic – Temporal Logic – Modal Logic – Epistemic Logic 14
  • 15.
    Searching • Depth First(Missionaries and Cannibals) • Breadth First (Water Jug Problem) • Hill Climbing ( 8 – Puzzle) • Best First • A* Algorithm • AO* Algorithm • Mini-Max Algorithm 15
  • 16.
    REASONING METHODS • ReasoningBy Analogy – Frames. • Temporal Reasoning – Higher order logic (Temporoal Logic). • Fuzzy Reasoning – Higher Order Logic (Fuzzy Logic). • Non-monotonic Reasoning – Higher Order Logics (Non-monotonic Logic). • Reasoning Agents – Epistemic Logic. 16
  • 17.
    AI and ML •Roughly speaking, AI and ML are good ways to ask a computer to provide an answer to a problem based on some past experience. (Prediction, Learning, Explanation and Finding Temporal Dependencies) • It might be challenging to tell a computer what a cat is, for instance. ( Computers don’t have common sense – General Problem Solver- Human Intelligence). 17
  • 18.
    AI and ML •Still, if you show a neural network enough images of cats and tell it they are cats, then the computer will be able to correctly identify other cats that it did not see before. • It appears that some of the most prominent and widely used AI and ML algorithms can be speeded- up significantly if they are run on quantum computers. (Example: Bayesian Networks, Graph Search Algorithms- Shortest path algorithms – Heuristic Search Algorithms- and Swarm Intelligence algorithms …). 18
  • 19.
    Learning Methods • Learningfrom examples • Winston’s Program • Explanation based Learning • Learning by Observation • Knowledge Acquisition from experts 19
  • 20.
    Machine Learning Methods •Un-Supervised Learning • Clustering • K-means clustering • Supervised learning • Classification Algorithms 20
  • 21.
    K-means Clustering • Strengths –Simple iterative method – User provides “K” • Weaknesses – Often too simple  bad results – Difficult to guess the correct “K” 21
  • 22.
    K-means Clustering Basic Algorithm: •Step 0: Select K • Step 1: Randomly select initial cluster seeds Seed 1 650 Seed 2 200 22
  • 23.
    K-means Clustering • Aninitial cluster seed represents the “mean value” of its cluster. • In the preceding figure: – Cluster seed 1 = 650 – Cluster seed 2 = 200 23
  • 24.
    K-means Clustering • Step2: Calculate distance from each object to each cluster seed. • What type of distance should we use? – Squared Euclidean distance • Step 3: Assign each object to the closest cluster 24
  • 25.
  • 26.
    K-means Clustering • Iterate: –Calculate distance from objects to cluster centroids. – Assign objects to closest cluster – Recalculate new centroids • Stop based on convergence criteria – No change in clusters – Max iterations 26
  • 27.
    Regression • The regressiontask comes from Supervised machine learning. • It can help us to predict (expect continues values) and explains the objects based on a given set of numerical and categorical data. • For example, we can predict the house prices based on the house attributes such as number of rooms, size, and location. 27
  • 28.
    Regression • In mathematicalterms, the regression method provides us a linear line with the equation of Y = mX+c to model a dataset. • Here we are taking the X (Dependent variable) and Y (Independent variable) data points to train the linear regression model. The best observation line can be found by calculating the slope (m) and y-intercept (c) values. 28
  • 29.
    Applications • Risk assessment– Insurance, Banking • Score prediction – Cricket, Elections • Market forecasting – Share Market • Weather forecasting • Housing and product price prediction • Analysing engine performance in Automobiles. 29
  • 30.
    Regression Analysis • Theregression analysis is performed with various effective algorithms namely • Simple linear regression • Multiple linear regression • Decision trees • Radom forest • Support Vector Machines (SVM) 30
  • 31.
    Decision Trees 31 Name DebtIncome Married? Risk -------------------------------------------------------- Joe High High Yes Good Sue Low High Yes Good John Low High No Poor Mary High Low Yes Poor Fred Low Low Yes Poor
  • 32.
    32 Decision Tree Classification •Example D1 D2 Decision Tree Income =High Income =low D2 D1
  • 33.
    33 Decision Trees Classification(cont.) • Example D1a D2 1 2 D1b Decision Tree Income =High Income =low D2 D1 D1a D1b
  • 34.
    Random Forest • Randomforest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. It combines "bagging" and the random selection of features. • For many data sets, it produces a highly accurate classification results than decision trees. 34
  • 35.
    Bagging • Bagging, alsoknown as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. • In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. 35
  • 36.
    Regression Vs Classification •The most significant difference between regression vs classification is that while regression helps predict a continuous quantity, classification predicts discrete class labels. There are also some overlaps between the two types of machine learning algorithms. 36
  • 37.
    Support Vector Machines •It is a linear classifier. • It classifies the dataset into two groups for binary classification problems. • The multi-class SVM classifies the data set into multiple groups. • It is a supervised learning method. 37
  • 38.
    Linear Classifiers f x a yest denotes +1 denotes-1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b<0 w x + b>0 38
  • 39.
    Linear Classifiers f x a yest Denotes +1 Denotes-1 f(x,w,b) = sign(w x + b) How will you classify this data? 39
  • 40.
    40 Maximum Margin f x a yest denotes +1 denotes-1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM
  • 41.
    Probabilistic Models • Uncertainty •ABC Murder Story • Bayesian Classification • Neural Networks • Feature Selection and Classification • Deep Learning 41
  • 42.
    Probabilistic Models • Theprobabilistic framework to machine learning is that learning can be thought of as inferring plausible models to explain observed data. • A machine can use such models to make predictions about future data, and take decisions that are rational given these predictions. 42
  • 43.
    BAYESIYAN CLASSIFICATION • CONDITIONAL PROBABILITY •BAYES THEOREM • NAÏVE BAYES CLASSIFIER • BELIEF NETWORK • APPLICATION OF BAYESIAN NETWORK - CYBER CRIME DETECTION 43
  • 44.
    BAYESIAN CLASSIFICATION • Probabilisticlearning: Calculates explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems. • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. 44
  • 45.
    BAYESIAN THEOREM • Aspecial case of Bayesian Theorem: P(A∩B) = P(B) x P(A|B) P(B∩A) = P(A) x P(B|A) Since P(A∩B) = P(B∩A), P(B) x P(A|B) = P(A) x P(B|A) => P(A|B) = [P(A) x P(B|A)] / P(B)         A B P A P A B P A P A B P A P B P A B P A P B A P | | ) | ( ) ( ) ( ) | ( ) ( ) | (    A B 45
  • 46.
    BAYESIAN THEOREM • Example1: A medical cancer diagnosis problem There are 2 possible outcomes of a diagnosis: + ve, - ve. We know .8% of world population has cancer. Test gives correct +ve result 98% of the time and gives correct –ve result 97% of the time. If a patient’s test returns +ve, should we diagnose the patient as having cancer? 46
  • 47.
    BAYESIAN THEOREM P(cancer) =.008 P(-cancer) = .992 P(+ve|cancer) = .98 P(-ve|cancer) = .02 P(+ve|-cancer) = .03 P(-ve|-cancer) = .97 Using Bayes Formula: P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve) = 0.98 x 0.008 = .0078 / P(+ve) P(-cancer|+ve) = P(+ve|-cancer)xP(-cancer) / P(+ve) = 0.03 x 0.992 = 0.0298 / P(+ve). So, the patient most likely does not have cancer. 47
  • 48.
    NAÏVE BAYES CLASSIFIER •A simplified assumption: attributes are conditionally independent. • Greatly reduces the computation cost, only count the class distribution. 48
  • 49.
    NAÏVE BAYES CLASSIFIER Theprobabilistic model of NBC is to find the probability of a certain class given multiple dijoint (assumed) events. The naïve Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values <a1,a2,…,an>. The learner is asked to predict the target value, or classification, for this new instance. 49
  • 50.
    NAÏVE BAYES CLASSIFIER Abstractly,probability model for a classifier is a conditional model P(C|F1,F2,…,Fn) Over a dependent class variable C with a small number of outcome or classes conditional over several feature variables F1,…,Fn. Naïve Bayes Formula: P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x P(F2|C) x…x P(Fn|C)] / P(F1,F2,…,Fn) Since P(F1,F2,…,Fn) is common to all probabilities, we need not evaluate the denominator for comparisons. 50
  • 51.
  • 52.
    NAÏVE BAYES CLASSIFIER •Problem: Use training data from above to classify the following instances: a) <Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong> b) <Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong> 52
  • 53.
    NAÏVE BAYES CLASSIFIER Answerto (a): P(PlayTennis=yes) = 9/14 = 0.64 P(PlayTennis=n) = 5/14 = 0.36 P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22 P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60 P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33 P(Temperature=cool|PlayTennis=no) = 1/5 = .20 P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33 P(Humidity=high|PlayTennis=no) = 4/5 = 0.80 P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33 P(Wind=strong|PlayTennis=no) = 3/5 = 0.60 53
  • 54.
    NAÏVE BAYES CLASSIFIER P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes)x P(strong|yes) = 0.0053 P(no)xP(sunny|no)xP(cool|no)xP(high|no) x P(strong|no) = 0.0206 So the class for this instance is ‘no’. We can normalize the probility by: [0.0206]/[0.0206+0.0053] = 0.795 54
  • 55.
    NAÏVE BAYES CLASSIFIER Answerto (b): P(PlayTennis=yes) = 9/14 = 0.64 P(PlayTennis=no) = 5/14 = 0.36 P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44 P(Outlook=overcast|PlayTennis=no) = 0/5 = 0 P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33 P(Temperature=cool|PlayTennis=no) = 1/5 = .20 P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33 P(Humidity=high|PlayTennis=no) = 4/5 = 0.80 P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33 P(Wind=strong|PlayTennis=no) = 3/5 = 0.60 55
  • 56.
    NAÏVE BAYES CLASSIFIER EstimatingProbabilities: In the previous example, P(overcast|no) = 0 which causes the formula- P(no)xP(overcast|no)xP(cool|no)xP(high|no) x P(strong|nno) = 0.0 This causes problems in comparing because the other probabilities are not considered. We can avoid this difficulty by using m-estimate. 56
  • 57.
    NAÏVE BAYES CLASSIFIER M-EstimateFormula: [c + k] / [n + m] where c/n is the original probability used before, k=1 and m= equivalent sample size. Using this method our new values of probability is given below- 57
  • 58.
    NAÏVE BAYES CLASSIFIER Newanswer to (b): P(PlayTennis=yes) = 10/16 = 0.63 P(PlayTennis=no) = 6/16 = 0.37 P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42 P(Outlook=overcast|PlayTennis=no) = 1/8 = .13 P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33 P(Temperature=cool|PlayTennis=no) = 2/8 = .25 P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36 P(Humidity=high|PlayTennis=no) = 5/7 = 0.71 P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36 P(Wind=strong|PlayTennis=no) = 4/7 = 0.57 58
  • 59.
    NAÏVE BAYES CLASSIFIER P(yes)xP(overcast|yes)xP(cool|yes)xP(high|ye s)xP(strong|yes)= 0.011 P(no)xP(overcast|no)xP(cool|no)xP(high|no) x P(strong|nno) = 0.00486 So the class of this instance is ‘yes’ 59
  • 60.
    NAÏVE BAYES CLASSIFIER •The conditional probability values of all the attributes with respect to the class are pre-computed and stored on disk. • This prevents the classifier from computing the conditional probabilities every time it runs. • This stored data can be reused to reduce the latency of the classifier. 60
  • 61.
    Bayesian Belief Networks •In Naïve Bayes Classifier we make the assumption of class conditional independence, that is given the class label of a sample, the value of the attributes are conditionally independent of one another. • However, there can be dependences between the value of attributes. To avoid this, we use Bayesian Belief Network which provides the joint conditional probability distribution. 61
  • 62.
    Bayesian Belief Networks •A Bayesian network is a form of probabilistic graphical model. • Specifically, a Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables. • They provide a graphical method for getting the inferred results through joint probabilities. 62
  • 63.
  • 64.
  • 65.
    BELIEF NETWORKS • Bythe chaining rule of probability, the joint probability of all the nodes in the graph above is: P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) W=Wet Grass, C=Cloudy, R=Rain, S=Sprinkler Example: P(W∩-R∩S∩C) = P(W|S,-R)*P(-R|C)*P(S|C)*P(C) = 0.9*0.2*0.1*0.5 = 0.009 65
  • 66.
    BAYESIAN BELIEF NETWORK Whatis the probability of wet grass on a given day - P(W)? P(W) = P(W|SR) * P(S) * P(R) + P(W|S-R) * P(S) * P(-R) + P(W|-SR) * P(-S) * P(R) + P(W|-S-R) * P(-S) * P(-R) Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C) P(R) = P(R|C) * P(C) + P(R|-C) * P(-C) P(W)= 0.5985 66
  • 67.
    BAYESIAN BELIEF NETWORK Whatis the probability of wet grass on a given day - P(W)? P(W) = P(W|SR) * P(S) * P(R) + P(W|S-R) * P(S) * P(-R) + P(W|-SR) * P(-S) * P(R) + P(W|-S-R) * P(-S) * P(-R) Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C) P(R) = P(R|C) * P(C) + P(R|-C) * P(-C) P(W)= 0.5985 67
  • 68.
    Advantages of BayesianApproach • Bayesian networks can readily handle incomplete data sets. • Bayesian networks allow one to learn about causal relationships • Bayesian networks readily facilitate use of prior knowledge. 68
  • 69.
    ML Resources (Books) 1.Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014. 2. Tom M Mitchell, ―Machine Learning‖, First Edition, McGraw Hill Education, 2013. 69
  • 70.
    ML Resources (Books) 3.Nilsson, N. (2004). Introduction to Machine Learning. http://robotics.stanford.edu/people/nilsson/ mlbook.html. 4. Russell, S. (1997). Machine Learning. Handbook of Perception and Cognition, Vol. 14, Chap. 4. 5. Ethem Alpaydin, “Introduction to Machine Learning”, (Adaptive Computation and Machine Learning Series), Third Edition, MIT Press, 2014. 70
  • 71.
    Journals - IEEE •IEEE Transactions on Neural Networks. • IEEE Transactions on Pattern Analysis and Machine Intelligence. • IEEE Transactions on Neural Networks and Learning Systems. • IEEE Transactions on Artificial Intelligence • IEEE Transactions on Knowledge and Data Engineering. 71
  • 72.
    ML Journals -Elsevier • Machine Learning with Applications • Expert Systems With Applications • Applied Soft Computing • Knowledge-based Systems • Neural Networks • Data & Knowledge Engineering • Artificial Intelligence 72
  • 73.
    Neural Networks Similarity withbiological network Fundamental processing elements of a neural network is a neuron 1.Receives inputs from other source 2.Combines them in someway 3.Performs a generally nonlinear operation on the result 4.Outputs the final result •Biologically motivated approach to machine learning 73
  • 74.
    Similarity with BiologicalNetwork • Fundamental processing element of a neural network is a neuron • A human brain has 100 billion neurons • An ant brain has 250,000 neurons 74
  • 75.
    Neural Network • NeuralNetwork is a set of connected INPUT/OUTPUT UNITS, where each connection has a WEIGHT associated with it. • Neural Network learning is also called CONNECTIONIST learning due to the connections between units. • It is a case of SUPERVISED or CLASSIFICATION learning. 75
  • 76.
    Neural Network • NeuralNetwork learns by adjusting the weights so as to be able to correctly classify the training data and hence, after testing phase, to classify unknown data. • Neural Network needs long time for training. • Neural Network has a high tolerance to noisy and incomplete data. 76
  • 77.
    Neural Network Classifier •Input: Classification data It contains classification attribute • Data is divided, as in any classification problem. [Training data and Testing data] • All data must be normalized. (i.e. all values of attributes in the database are changed to contain values in the internal [0,1] or[-1,1]) Neural Network can work with data in the range of (0,1) or (-1,1) 77
  • 78.
    One Neuron asa Network – The neuron receives the weighted sum as input and calculates the output as a function of input as follows : • y = f(x) , where f(x) is defined as • f(x) = 0 { when x< 0.5 } and f(x) = 1 { when x >= 0.5 } • For eg, if x is 0.55, then y = 1 , the input values are classified in class 1. • If x = 0.45 , f(x) =0, the input values are classified to class 0. 78
  • 79.
    Bias of aNeuron • We need the bias value to be added to the weighted sum ∑wixi so that we can transform it from the origin. v = ∑wixi + b, here b is the bias x1-x2=0 x1-x2= 1 x1 x2 x1-x2= -1 79
  • 80.
    Bias as extrainput Input Attribute values weights Summing function Activation function v Output class y x1 x2 xm w2 wm W1    ) (  w0 x0 = +1 b w x w v j m j j     0 0 80
  • 81.
    Neuron with Activation •The neuron is the basic information processing unit of a NN. It consists of: 1 A set of links, describing the neuron inputs, with weights W1, W2, …, Wm 2. An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers): 3 Activation function : for limiting the amplitude of the neuron output.    m 1 j jx w u j ) (u y b   81
  • 82.
    k O jk w Output nodes Input nodes Hiddennodes Output Class Input Record : xi wij - weights Network is fully connected j O A Multilayer Feed-Forward Neural Network 82
  • 83.
    Neural Network Learning •The inputs are fed simultaneously into the input layer. • The weighted outputs of these units are fed into hidden layer. • The weighted outputs of the last hidden layer are inputs to units making up the output layer. 83
  • 84.
    A Multilayer FeedForward Network • The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. • A network containing two hidden layers is called a three- layer neural network, and so on. • The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a previous layer. 84
  • 85.
    A Multilayered Feed– Forward Network • INPUT: records without class attribute with normalized attributes values. • INPUT VECTOR: X = { x1, x2, …. xn} where n is the number of (non class) attributes. • INPUT LAYER – there are as many nodes as non-class attributes i.e. as the length of the input vector. • HIDDEN LAYER – the number of nodes in the hidden layer and the number of hidden layers depends on implementation. 85
  • 86.
    A Multilayered Feed–Forward Network •OUTPUT LAYER – corresponds to the class attribute. • There are as many nodes as classes (values of the class attribute). k O k= 1, 2,.. #classes • Network is fully connected, i.e. each unit provides input to each unit in the next forward layer. 86
  • 87.
    Classification by Backpropagation • Back Propagation learns by iteratively processing a set of training data (samples). • For each sample, weights are modified to minimize the error between network’s classification and actual classification. 87
  • 88.
    Steps in Backpropagation Algorithm • STEP ONE: initialize the weights and biases. • The weights in the network are initialized to random numbers from the interval [-1,1]. • Each unit has a BIAS associated with it • The biases are similarly initialized to random numbers from the interval [-1,1]. • STEP TWO: feed the training sample. 88
  • 89.
    Steps in Backpropagation Algorithm ( cont..) • STEP THREE: Propagate the inputs forward; we compute the net input and output of each unit in the hidden and output layers. • STEP FOUR: back propagate the error. • STEP FIVE: update weights and biases to reflect the propagated errors. • STEP SIX: terminating conditions. 89
  • 90.
    Propagation through HiddenLayer ( One Node ) • The inputs to unit j are outputs from the previous layer. These are multiplied by their corresponding weights in order to form a weighted sum, which is added to the bias associated with unit j. • A nonlinear activation function f is applied to the net input. - f weighted sum Input vector x output y Activation function weight vector w  w0j w1j wnj x0 x1 xn Bias j 90
  • 91.
    Propagate the inputsforward • For unit j in the input layer, its output is equal to its input, that is, j j I O  for input unit j. • The net input to each unit in the hidden and output layers is computed as follows. •Given a unit j in a hidden or output layer, the net input is    i j i ij j O w I  where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit I from the previous layer; j  is the bias of the unit 91
  • 92.
    Propagate the inputsforward • Each unit in the hidden and output layers takes its net input and then applies an activation function. The function symbolizes the activation of the neuron represented by the unit. It is also called a logistic, sigmoid, or squashing function. • Given a net input Ij to unit j, then Oj = f(Ij), the output of unit j, is computed as j I j e O    1 1 92
  • 93.
    Back propagate theerror • When reaching the Output layer, the error is computed and propagated backwards. • For a unit k in the output layer the error is computed by a formula: ) )( 1 ( k k k k k O T O O Err    • Where O k – actual output of unit k ( computed by activation function. Tk – True output based of known class label; classification of training sample Ok(1-Ok) – is a Derivative ( rate of change ) of activation function. k I k e O    1 1 93
  • 94.
    Back propagate theerror • The error is propagated backwards by updating weights and biases to reflect the error of the network classification . • For a unit j in the hidden layer the error is computed by a formula: • jk k k j j j w Err O O Err    ) 1 ( where wjk is the weight of the connection from unit j to unit k in the next higher layer, and Errk is the error of unit k. 94
  • 95.
    Update weights andbiases • Weights are updated by the following equations, where l is a constant between 0.0 and 1.0 reflecting the learning rate, this learning rate is fixed for implementation. i j ij O Err l w ) (   ij ij ij w w w    • Biases are updated by the following equations j j j       j j Err l) (   95
  • 96.
    Update weights andbiases • We are updating weights and biases after the presentation of each sample. • This is called case updating. • Epoch --- One iteration through the training set is called an epoch. • Epoch updating ------------ • Alternatively, the weight and bias increments could be accumulated in variables and the weights and biases updated after all of the samples of the training set have been presented. • Case updating is more accurate 96
  • 97.
    Terminating Conditions • Trainingstops ij w  • All in the previous epoch are below some threshold, or •The percentage of samples misclassified in the previous epoch is below some threshold, or • a pre specified number of epochs has expired. • In practice, several hundreds of thousands of epochs may be required before the weights will converge. 97
  • 98.
    Output nodes Input nodes Hiddennodes Output vector Input vector: xi wij    i j i ij j O w I  ) )( 1 ( k k k k k O T O O Err    jk k k j j j w Err O O Err    ) 1 ( i j ij ij O Err l w w ) (   j j j Err l) (    j I j e O    1 1 Backpropagation Formulas 98
  • 99.
    Example of Backpropagation x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 Initial Input and weight Initialize weights : Input = 3, Hidden Neuron = 2 Output =1 Random Numbers from -1.0 to 1.0 99
  • 100.
    Example ( cont..) • Bias added to Hidden • + Output nodes • Initialize Bias • Random Values from • -1.0 to 1.0 • Bias ( Random ) θ4 θ5 θ6 -0.4 0.2 0.1 100
  • 101.
    Net Input andOutput Calculation Unitj Net Input Ij Output Oj 4 0.2 + 0 - 0.5 -0.4 = -0.7 5 -0.3 + 0 + 0.2 + 0.2 =0.1 6 (-0.3)0.332- (0.2)(0.525)+0.1= -0.105 1 . 0 1 1    e Oj 7 . 0 1 1 e Oj   105 . 0 1 1 e Oj   = 0.332 = 0.525 = 0.475 101
  • 102.
    Calculation of Errorat Each Node Unit j Error j 6 0.475(1-0.475)(1-0.475) =0.1311 We assume T 6 = 1 5 0.525 x (1- 0.525)x 0.1311x (-0.2) = 0.0065 4 0.332 x (1-0.332) x 0.1311 x (-0.3) = -0.0087 102
  • 103.
    Calculation of weightsand Bias Updating Learning Rate l =0.9 Weight New Values w46 -0.3 + 0.9(0.1311)(0.332) = - 0.261 w56 -0.2 + (0.9)(0.1311)(0.525) = - 0.138 w14 0.2 + 0.9(-0.0087)(1) = 0.192 w15 -0.3 + (0.9)(-0.0065)(1) = -0.306 ……..similarly ………similarly θ6 0.1 +(0.9)(0.1311)=0.218 ……..similarly ………similarly 103
  • 104.
    DEEP LEARNING • Deeplearning is a subset of ML in AI that has networks capable of learning unsupervised from data that is unstructured or unlabeled. • Deep Learning is a subfield of Machine Learning that involves the use of neural networks to model and solve complex problems. 104
  • 105.
    DEEP LEARNING • Thekey characteristic of Deep Learning is the use of deep neural networks, which have multiple layers of interconnected nodes. • These networks can learn complex representations of data by discovering hierarchical patterns and features in the data. • Deep Learning algorithms can automatically learn and improve from data without the need for manual feature engineering. 105
  • 106.
    DEEP LEARNING • DeepLearning has achieved significant success in various fields, including image recognition, natural language processing, speech recognition, and recommendation systems. • Some of the popular Deep Learning architectures include Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Deep Belief Networks (DBNs). 106
  • 107.
    DEEP LEARNING • Trainingdeep neural networks typically requires a large amount of data and computational resources. • However, the availability of cloud computing and the development of specialized hardware, such as Graphics Processing Units (GPUs), has made it easier to train deep neural networks. 107
  • 108.
    Convolutional Neural Networks •A Convolutional Neural Network (CNN) is a type of deep learning algorithm that is particularly well-suited for image recognition and processing tasks. • It is made up of multiple layers, including convolutional layers, pooling layers, and fully connected layers. 108
  • 109.
    Convolutional Neural Networks •The convolutional layers are the key component of a CNN, where filters are applied to the input image to extract features such as edges, textures, and shapes. • The output of the convolutional layers is then passed through pooling layers, which are used to down-sample the feature maps, reducing the spatial dimensions while retaining the most important information. • The output of the pooling layers is then passed through one or more fully connected layers, which are used to make a prediction or classify the image. 109
  • 110.
    CNN • There arecertain steps/operations that are involved CNN. These can be categorized as follows: • Convolution operation • Pooling • Flattening • Fully connected layers 110
  • 111.
  • 112.
    CNN • Convolution operationsis the first and one of the most important step in the functioning of a CNN. Convolution operation focuses on extracting/preserving important features from the input (image etc). 112
  • 113.
    CNN • To understandthis operation, let us consider image as input to our CNN. • Now when image is given as input, they are in the form of matrices of pixels. • If the image is grayscale, then the image is considered as a matrix, each value in the matrix ranges from 0-255. 113
  • 114.
    CNN • We caneven normalize these values lets say in range 0-1. 0 represents white and 1 represents black. • If the image is colored, then three matrices representing the RGB colors with each value in range 0-255. The same can be seen in the images below: 114
  • 115.
    Fig 1: Coloredimage matrices 115
  • 116.
    Fig 2: Grascaleimage matrix 116
  • 117.
    Convolution Operation • MATHEMATICALOPERATION Coming to convolution operation, let us consider an input image. Now for convolution operation, filters or kernels are used. 117
  • 118.
    Convolution Operation • Thefollowing mathematical operation is performed: Let size of image: NxN Let size of filer: FxF Then (NxN)*(FxF)= (N-F+1)x(N-F+1) *= Convolution operation All these kernels, input channels etc are the hyper parameters. The result of each layer is passed on to the next one. 118
  • 119.
  • 120.
    Example • Here, inputof size 6x6 is given and kernel of size 3x3 is used. The feature map obtained in of size 4x4. To increase non-linearity in the image, Rectifier function can be applied to the feature map. Finally, after the convolution step is completed and feature map is obtained, this map is given as input to the pooling layer. 120
  • 121.
    CNN Architecture • Acommon CNN model architecture is to have a number of convolution and pooling layers stacked one after the other. 121
  • 122.
  • 123.
    Pooling Layers • Poolinglayers are used to reduce the dimensions of the feature maps. • Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. 123
  • 124.
    Pooling Layers • Thepooling layer summarizes the features present in a region of the feature map generated by a convolution layer. • So, further operations are performed on summarized features instead of precisely positioned features generated by the convolution layer. • This makes the model more robust to variations in the position of the features in the input image. 124
  • 125.
    Max Pooling • Typesof Pooling Layers:Max, Min and Average pooling. Max Pooling • Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. • Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map. 125
  • 126.
  • 127.
    Average Pooling • Averagepooling computes the average of the elements present in the region of feature map covered by the filter. • Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features present in a patch. 127
  • 128.
  • 129.
    Flattening • Flattening isconverting the data into a 1- dimensional array for inputting it to the next layer. • We flatten the output of the convolutional layers to create a single long feature vector. • And it is connected to the final classification model, which is called a fully- connected layer. 129
  • 130.
    CNN • In otherwords, we put all the pixel data in one line and make connections with the final layer. • And once again. What is the final layer for? The classification of ‘the cats and dogs.’ 130
  • 131.
  • 132.
    Recurrent Neural Network •Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. • In traditional neural networks, all the inputs and outputs are independent of each other. • But in cases when it is required to predict the next word of a sentence, the previous words are required. • NLP • Hence, there is a need to remember the previous words. 132
  • 133.
    Recurrent Neural Network •Thus RNN came into existence, which solved this issue with the help of a Hidden Layer. • The main and most important feature of RNN is its Hidden state, which remembers some information about a sequence. 133
  • 134.
    Recurrent Neural Network •The state is also referred to as Memory State since it remembers the previous input to the network. • It uses the same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output. • This reduces the complexity of parameters, unlike other neural networks. 134
  • 135.
  • 136.
    How RNN works •The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step. • Each unit has an internal state which is called the hidden state of the unit. • This hidden state signifies the past knowledge that the network currently holds at a given time step. 136
  • 137.
    RNN • This hiddenstate is updated at every time step to signify the change in the knowledge of the network about the past. • The hidden state is updated using the following recurrence relation:- 137
  • 138.
    RNN • The formulafor calculating the current state: where: ht -> current state ht-1 -> previous state xt -> input state 138
  • 139.
    RNN • Training throughRNN • A single-time step of the input is provided to the network. • Then it calculates its current state using a set of current input and the previous state. • The current ht becomes ht-1 for the next time step. 139
  • 140.
    RNN • One cango as many time steps according to the problem and join the information from all the previous states. • Once all the time steps are completed the final current state is used to calculate the output. 140
  • 141.
    RNN • The outputis then compared to the actual output i.e the target output and the error is generated. • The error is then back-propagated to the network to update the weights and hence the network (RNN) is trained using Backpropagation through time. 141
  • 142.
    Conclusions • Machine Learning •Supervised and Unsupervised Learning • Neural Networks • CNN • RNN 142