Machine learning and deep learning algorithms

Fundamentals of Deep Learning
Dr. A.Kannan, Former Professor and Head,
Department of Information Science and
Technology, CEG Campus, Anna
University, Chennai-25.
Senior Professor, School of Computer
Science and Engineering,
VIT, Vellore-632014.
.
1

MACHINE LARNING
• “Learning is making useful changes in our
minds.” - Marvin Minsky.
• “Machine Learning (ML) refers to a system
which has the capability of autonomous
knowledge acquisition and integration of
the acquired knowledge.”
2

MACHINE LARNING
• Machine learning is an application of
Artificial Intelligence (AI) that provides the
systems with the ability to automatically
learn and improve by themselves from the
experience gained by them without being
explicitly programmed.
• It focuses on the development of intelligent
computer programs that can access the data
and use it for learning by themselves. 3

Applications of ML
• Image Processing – Face Recognition, Hand
written character recognition, Self driving
Cars, Traffic Video analysis….
• Natural Language Processing - Social
Network Analysis, Recommendation
Systems and Sentiment Analysis.
• Medical Diagnosis: Disease Identification,
Prediction on Cancer, Diabetes etc using
past history and current data. 4

Machine Learning Paradigms
• Rote Learning
• Transfer of Learning
• Learning by Taking Advice
• Learning By Analogy
• Un-Supervised Learning (Clustering)
• Supervised Learning – Classification
• Deep Learning
5

Machine Learning Tasks
• Knowledge Representation and Reasoning
• Regression
• Classification
• Clustering
• Dimensionality reduction
• Reinforcement learning (Ranking)
6

AI - Knowledge Based Systems
• Facts
• Rules
• Knowledge base
• Knowledge Based Systems
• Knowledge Representation
• Reasoning and Inference
7

FACTS AND RULES
• Pat is a man = true
• Kumar is the father of Raja=True
• Kumar is the grand father of Raja= False
• IF marks >=60 Then
Class = FIRST CLASS
8

Rules
• If A then B (Whenever A is true then B is
also true)
• A = true (now A is TRUE)
• Inference: B is true now
• Using AB and A, we can infer B.
• This rule is called Modus Ponens.
• AB, A implies B.
9

Inference
• Pat is a man = p
• Pat is a woman = q
• Pat is a man or woman = p v q = true
• Pat is not a woman = 7q = true
• Inference: p is true
• Pat is a man or woman
• Pat is not a woman.
• Inference: Pat is a man 10

Knowledge Base Vs Database
More Rules Less Rules
Facts Less More Facts
Explicit Rules and
Facts
Explicit Facts and
Implicit Rules
Experts update Clerks update
Main Memory
Based
Disk Based
11

AI Programs - Exhibit Intelligent Behavior by skillful
application of heuristics.
KBS – Make domain knowledge
explicit
Expert Systems –
Apply expert
knowledge to
difficult,
Real world problems.
12

Knowledge Representation
Techniques
• English or Natural Language
• Tables and Rules
• Logic (Propositional logic, Predicate logic)
• Semantic Networks
• Frames
• Conceptual Dependency
• Scripts
• Ontology 13

LOGIC
• First Order Logic
– Predicate Logic
– Propositional Logic
• Higher Order Logics
– Situational Logic
– Fuzzy Logic
– Temporal Logic
– Modal Logic
– Epistemic Logic
14

Searching
• Depth First (Missionaries and Cannibals)
• Breadth First (Water Jug Problem)
• Hill Climbing ( 8 – Puzzle)
• Best First
• A* Algorithm
• AO* Algorithm
• Mini-Max Algorithm
15

REASONING METHODS
• Reasoning By Analogy – Frames.
• Temporal Reasoning – Higher order logic
(Temporoal Logic).
• Fuzzy Reasoning – Higher Order Logic
(Fuzzy Logic).
• Non-monotonic Reasoning – Higher Order
Logics (Non-monotonic Logic).
• Reasoning Agents – Epistemic Logic.
16

AI and ML
• Roughly speaking, AI and ML are good ways
to ask a computer to provide an answer to a
problem based on some past experience.
(Prediction, Learning, Explanation and Finding
Temporal Dependencies)
• It might be challenging to tell a computer what
a cat is, for instance. ( Computers don’t have
common sense – General Problem Solver-
Human Intelligence).
17

AI and ML
• Still, if you show a neural network enough images
of cats and tell it they are cats, then the computer
will be able to correctly identify other cats that it did
not see before.
• It appears that some of the most prominent and
widely used AI and ML algorithms can be speeded-
up significantly if they are run on quantum
computers. (Example: Bayesian Networks, Graph
Search Algorithms- Shortest path algorithms –
Heuristic Search Algorithms- and Swarm
Intelligence algorithms …).
18

Learning Methods
• Learning from examples
• Winston’s Program
• Explanation based Learning
• Learning by Observation
• Knowledge Acquisition from experts
19

Machine Learning Methods
• Un-Supervised Learning
• Clustering
• K-means clustering
• Supervised learning
• Classification Algorithms
20

K-means Clustering
• Strengths
– Simple iterative method
– User provides “K”
• Weaknesses
– Often too simple  bad results
– Difficult to guess the correct “K”
21

K-means Clustering
Basic Algorithm:
• Step 0: Select K
• Step 1: Randomly select initial cluster seeds
Seed 1
650
Seed 2
200
22

K-means Clustering
• An initial cluster seed represents the “mean
value” of its cluster.
• In the preceding figure:
– Cluster seed 1 = 650
– Cluster seed 2 = 200
23

K-means Clustering
• Step 2: Calculate distance from each object
to each cluster seed.
• What type of distance should we use?
– Squared Euclidean distance
• Step 3: Assign each object to the closest
cluster
24

K-means Clustering
Seed 1
Seed 2
25

K-means Clustering
• Iterate:
– Calculate distance from objects to cluster
centroids.
– Assign objects to closest cluster
– Recalculate new centroids
• Stop based on convergence criteria
– No change in clusters
– Max iterations
26

Regression
• The regression task comes from Supervised
machine learning.
• It can help us to predict (expect continues
values) and explains the objects based on a
given set of numerical and categorical data.
• For example, we can predict the house
prices based on the house attributes such as
number of rooms, size, and location.
27

Regression
• In mathematical terms, the regression
method provides us a linear line with the
equation of Y = mX+c to model a dataset.
• Here we are taking the X (Dependent
variable) and Y (Independent variable) data
points to train the linear regression model.
The best observation line can be found by
calculating the slope (m) and y-intercept (c)
values. 28

Applications
• Risk assessment – Insurance, Banking
• Score prediction – Cricket, Elections
• Market forecasting – Share Market
• Weather forecasting
• Housing and product price prediction
• Analysing engine performance in
Automobiles.
29

Regression Analysis
• The regression analysis is performed with
various effective algorithms namely
• Simple linear regression
• Multiple linear regression
• Decision trees
• Radom forest
• Support Vector Machines (SVM)
30

Decision Trees
31
Name Debt Income Married? Risk
--------------------------------------------------------
Joe High High Yes Good
Sue Low High Yes Good
John Low High No Poor
Mary High Low Yes Poor
Fred Low Low Yes Poor

32
Decision Tree Classification
• Example
D1
D2
Decision Tree
Income
=High
Income
=low
D2
D1

33
Decision Trees Classification (cont.)
• Example
D1a
D2
1 2
D1b
Decision Tree
Income
=High
Income
=low
D2
D1
D1a D1b

Random Forest
• Random forest (or random forests) is an
ensemble classifier that consists of many
decision trees and outputs the class that is
the mode of the class's output by individual
trees. It combines "bagging" and the
random selection of features.
• For many data sets, it produces a highly
accurate classification results than decision
trees. 34

Bagging
• Bagging, also known as bootstrap
aggregation, is the ensemble learning
method that is commonly used to
reduce variance within a noisy dataset.
• In bagging, a random sample of data in
a training set is selected with
replacement—meaning that the
individual data points can be chosen
more than once. 35

Regression Vs Classification
• The most significant difference between
regression vs classification is that
while regression helps predict a continuous
quantity, classification predicts discrete
class labels. There are also some
overlaps between the two types of machine
learning algorithms.
36

Support Vector Machines
• It is a linear classifier.
• It classifies the dataset into two groups for
binary classification problems.
• The multi-class SVM classifies the data set
into multiple groups.
• It is a supervised learning method.
37

Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
w x + b<0
w x + b>0
38

Linear Classifiers
f
x
a
yest
Denotes +1
Denotes -1
f(x,w,b) = sign(w x + b)
How will you
classify this data?
39

40
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x + b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Support Vectors
are those
datapoints that the
margin pushes up
against
Linear SVM

Probabilistic Models
• Uncertainty
• ABC Murder Story
• Bayesian Classification
• Neural Networks
• Feature Selection and Classification
• Deep Learning
41

Probabilistic Models
• The probabilistic framework to machine
learning is that learning can be thought of
as inferring plausible models to explain
observed data.
• A machine can use such models to make
predictions about future data, and take
decisions that are rational given these
predictions.
42

BAYESIYAN
CLASSIFICATION
• CONDITIONAL PROBABILITY
• BAYES THEOREM
• NAÏVE BAYES CLASSIFIER
• BELIEF NETWORK
• APPLICATION OF BAYESIAN NETWORK -
CYBER CRIME DETECTION
43

BAYESIAN CLASSIFICATION
• Probabilistic learning: Calculates explicit
probabilities for hypothesis, among the most
practical approaches to certain types of
learning problems.
• Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
44

BAYESIAN THEOREM
• A special case of Bayesian
Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] /
P(B)
       
A
B
P
A
P
A
B
P
A
P
A
B
P
A
P
B
P
A
B
P
A
P
B
A
P
|
|
)
|
(
)
(
)
(
)
|
(
)
(
)
|
(



A B
45

BAYESIAN THEOREM
• Example 1: A medical cancer diagnosis
problem
There are 2 possible outcomes of a diagnosis:
+ ve, - ve. We know .8% of world population has
cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time.
If a patient’s test returns +ve, should we
diagnose the patient as having cancer?
46

NAÏVE BAYES CLASSIFIER
• A simplified assumption: attributes are
conditionally independent.
• Greatly reduces the computation cost, only
count the class distribution.
48

The probabilistic model of NBC is to find the probability of a
certain class given multiple dijoint (assumed) events.
The naïve Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f(x) can take on any
value from some finite set V.
A set of training examples of the target function is provided,
and a new instance is presented, described by the tuple
of attribute values <a1,a2,…,an>. The learner is asked to
predict the target value, or classification, for this new
instance.
49

Abstractly, probability model for a classifier is a
conditional model
P(C|F1,F2,…,Fn)
Over a dependent class variable C with a small
number of outcome or classes conditional over
several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x P(F2|C)
x…x P(Fn|C)] / P(F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabilities, we
need not evaluate the denominator for comparisons.
50

Tennis-Example
51

• Problem:
Use training data from above to classify the
following instances:
a) <Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong>
b) <Outlook=overcast, Temperature=cool,
Humidity=high, Wind=strong>
52

Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
53

Answer to (b):
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
55

Estimating Probabilities:
In the previous example, P(overcast|no) = 0 which
causes the formula-
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.0
This causes problems in comparing because the
other probabilities are not considered. We can
avoid this difficulty by using m-estimate.
56

M-Estimate Formula:
[c + k] / [n + m] where c/n is the original
probability used before, k=1 and
m= equivalent sample size.
Using this method our new values of
probability is given below-
57

New answer to (b):
P(PlayTennis=no) = 6/16 = 0.37
P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42
P(Outlook=overcast|PlayTennis=no) = 1/8 = .13
58

• The conditional probability values of all the
attributes with respect to the class are
pre-computed and stored on disk.
• This prevents the classifier from computing
the conditional probabilities every time it
runs.
• This stored data can be reused to reduce the
latency of the classifier. 60

Bayesian Belief Networks
• In Naïve Bayes Classifier we make the
assumption of class conditional
independence, that is given the class label
of a sample, the value of the attributes are
conditionally independent of one another.
• However, there can be dependences
between the value of attributes. To avoid
this, we use Bayesian Belief Network
which provides the joint conditional
probability distribution. 61

Bayesian Belief Networks
• A Bayesian network is a form of probabilistic
graphical model.
• Specifically, a Bayesian network is a directed acyclic
graph of nodes representing variables and arcs
representing dependence relations among the
variables.
• They provide a graphical method for getting the inferred
results through joint probabilities.
62

BELIEF NETWORKS
• By the chaining rule of probability, the joint
probability of all the nodes in the graph above is:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
W=Wet Grass, C=Cloudy, R=Rain, S=Sprinkler
Example: P(W∩-R∩S∩C)
= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)
= 0.9*0.2*0.1*0.5 = 0.009
65

Advantages of Bayesian Approach
• Bayesian networks can readily handle
incomplete data sets.
• Bayesian networks allow one to learn
about causal relationships
• Bayesian networks readily facilitate use of
prior knowledge.
68

ML Resources (Books)
1. Stephen Marsland, “Machine Learning –
An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series,
2014.
2. Tom M Mitchell, ―Machine Learning‖,
First Edition, McGraw Hill Education,
2013.
69

ML Resources (Books)
3. Nilsson, N. (2004). Introduction to Machine
Learning.
http://robotics.stanford.edu/people/nilsson/
mlbook.html.
4. Russell, S. (1997). Machine Learning.
Handbook of Perception and Cognition,
Vol. 14, Chap. 4.
5. Ethem Alpaydin, “Introduction to Machine
Learning”, (Adaptive Computation and
Machine Learning Series), Third Edition,
MIT Press, 2014. 70

Journals - IEEE
• IEEE Transactions on Neural Networks.
• IEEE Transactions on Pattern Analysis
and Machine Intelligence.
• IEEE Transactions on Neural Networks and
Learning Systems.
• IEEE Transactions on Artificial Intelligence
• IEEE Transactions on Knowledge and Data
Engineering.
71

ML Journals - Elsevier
• Machine Learning with Applications
• Expert Systems With Applications
• Applied Soft Computing
• Knowledge-based Systems
• Neural Networks
• Data & Knowledge Engineering
• Artificial Intelligence
72

Neural Networks
Similarity with biological network
Fundamental processing elements of a neural network
is a neuron
1.Receives inputs from other source
2.Combines them in someway
3.Performs a generally nonlinear operation on the result
4.Outputs the final result
•Biologically motivated approach to
machine learning
73

Similarity with Biological Network
• Fundamental processing element of a
neural network is a neuron
• A human brain has 100 billion neurons
• An ant brain has 250,000 neurons 74

Neural Network
• Neural Network is a set of connected
INPUT/OUTPUT UNITS, where each connection
has a WEIGHT associated with it.
• Neural Network learning is also called CONNECTIONIST
learning due to the connections between units.
• It is a case of SUPERVISED or
CLASSIFICATION learning.
75

Neural Network
• Neural Network learns by adjusting the weights so
as to be able to correctly classify the training data
and hence, after testing phase, to classify unknown
data.
• Neural Network needs long time for training.
• Neural Network has a high tolerance to noisy and
incomplete data.
76

Neural Network Classifier
• Input: Classification data
It contains classification attribute
• Data is divided, as in any classification problem.
[Training data and Testing data]
• All data must be normalized.
(i.e. all values of attributes in the database are changed to contain values
in the internal [0,1] or[-1,1])
Neural Network can work with data in the range of (0,1) or (-1,1)
77

One Neuron as a Network
– The neuron receives the weighted sum as input and calculates the
output as a function of input as follows :
• y = f(x) , where f(x) is defined as
• f(x) = 0 { when x< 0.5 } and f(x) = 1 {
when x >= 0.5 }
• For eg, if x is 0.55, then y = 1 , the
input values are classified in class 1.
• If x = 0.45 , f(x) =0, the input values
are classified to class 0.
78

Bias of a Neuron
• We need the bias value to be added to the weighted sum
∑wixi so that we can transform it from the origin.
v = ∑wixi + b, here b is the bias
x1-x2=0
x1-x2= 1
x1
x2
x1-x2= -1
79

Bias as extra input
Input
Attribute
values
weights
Summing function
Activation
function
v
Output
class
y
x1
x2
xm
w2
wm
W1
 
 )
(

w0
x0 = +1
b
w
x
w
v j
m
j
j

 

0
0
80

Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm
2. An adder function (linear combiner) for computing the
weighted sum of the inputs (real numbers):
3 Activation function : for limiting the amplitude of the
neuron output.



m
1
j
jx
w
u
j
)
(u
y b

 81

k
O
jk
w
Output nodes
Input nodes
Hidden nodes
Output Class
Input Record : xi
wij - weights
Network is fully connected
j
O
A Multilayer Feed-Forward
Neural Network
82

Neural Network Learning
• The inputs are fed simultaneously into the input
layer.
• The weighted outputs of these units are fed into
hidden layer.
• The weighted outputs of the last hidden layer are
inputs to units making up the output layer.
83

A Multilayer Feed Forward Network
• The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units.
• A network containing two hidden layers is called a three-
layer neural network, and so on.
• The network is feed-forward in that none of the weights
cycles back to an input unit or to an output unit of a
previous layer.
84

A Multilayered Feed – Forward Network
• INPUT: records without class attribute with normalized
attributes values.
• INPUT VECTOR: X = { x1, x2, …. xn}
where n is the number of (non class) attributes.
• INPUT LAYER – there are as many nodes as non-class
attributes i.e. as the length of the input vector.
• HIDDEN LAYER – the number of nodes in the hidden
layer and the number of hidden layers depends on
implementation.
85

A Multilayered Feed–Forward
Network
• OUTPUT LAYER – corresponds to the
class attribute.
• There are as many nodes as classes (values
of the class attribute).
k
O k= 1, 2,.. #classes
• Network is fully connected, i.e. each unit provides input
to each unit in the next forward layer.
86

Classification by Back propagation
• Back Propagation learns by iteratively processing a
set of training data (samples).
• For each sample, weights are modified to
minimize the error between network’s
classification and actual classification.
87

Steps in Back propagation
Algorithm
• STEP ONE: initialize the weights and biases.
• The weights in the network are initialized to random
numbers from the interval [-1,1].
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random
numbers from the interval [-1,1].
• STEP TWO: feed the training sample.
88

Steps in Back propagation Algorithm
( cont..)
• STEP THREE: Propagate the inputs forward; we
compute the net input and output of each unit in
the hidden and output layers.
• STEP FOUR: back propagate the error.
• STEP FIVE: update weights and biases to reflect
the propagated errors.
• STEP SIX: terminating conditions.
89

Propagation through Hidden Layer
( One Node )
• The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted
sum, which is added to the bias associated with unit j.
• A nonlinear activation function f is applied to the net input.
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector
w

w0j
w1j
wnj
x0
x1
xn
Bias j
90

Propagate the inputs forward
• For unit j in the input layer, its output is
equal to its input, that is,
j
j I
O 
for input unit j.
• The net input to each unit in the hidden and output layers is
computed as follows.
•Given a unit j in a hidden or output layer, the net input is
 

i
j
i
ij
j O
w
I 
where wij is the weight of the connection from unit i in the
previous layer to unit j; Oi is the output of unit I from the
previous layer;
j
 is the bias of the unit
91

Propagate the inputs forward
• Each unit in the hidden and output layers takes its net
input and then applies an activation function. The
function symbolizes the activation of the neuron
represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
• Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed as j
I
j
e
O 


1
1
92

Back propagate the error
• When reaching the Output layer, the error is
computed and propagated backwards.
• For a unit k in the output layer the error is computed
by a formula:
)
)(
1
( k
k
k
k
k O
T
O
O
Err 


•
Where O k – actual output of unit k ( computed by activation function.
Tk – True output based of known class label; classification of training
sample
Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.
k
I
k
e
O 


1
1
93

Back propagate the error
• The error is propagated backwards by updating
weights and biases to reflect the error of the network
classification .
• For a unit j in the hidden layer the error is computed
by a formula:
•
jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
where wjk is the weight of the connection from unit j to unit k in
the next higher layer, and Errk is the error of unit k.
94

Update weights and biases
• Weights are updated by the following equations,
where l is a constant between 0.0 and 1.0 reflecting
the learning rate, this learning rate is fixed for
implementation.
i
j
ij O
Err
l
w )
(


ij
ij
ij w
w
w 


• Biases are updated by the following equations
j
j
j 

 


j
j Err
l)
(


95

Update weights and biases
• We are updating weights and biases after the presentation
of each sample.
• This is called case updating.
• Epoch --- One iteration through the training set is called an epoch.
• Epoch updating ------------
• Alternatively, the weight and bias increments could be
accumulated in variables and the weights and biases updated after
all of the samples of the training set have been presented.
• Case updating is more accurate
96

Terminating Conditions
• Training stops
ij
w

• All in the previous epoch are below some threshold, or
•The percentage of samples misclassified in the previous epoch
is below some threshold, or
• a pre specified number of epochs has expired.
• In practice, several hundreds of thousands of epochs may be
required before the weights will converge.
97

Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
 

i
j
i
ij
j O
w
I 
)
)(
1
( k
k
k
k
k O
T
O
O
Err 


jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
i
j
ij
ij O
Err
l
w
w )
(


j
j
j Err
l)
(



j
I
j
e
O 


1
1
Backpropagation Formulas
98

Example of Back propagation
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2
Initial Input and
weight
Initialize weights :
Input = 3, Hidden
Neuron = 2 Output
=1
Random Numbers
from -1.0 to 1.0
99

Example ( cont.. )
• Bias added to Hidden
• + Output nodes
• Initialize Bias
• Random Values from
• -1.0 to 1.0
• Bias ( Random )
θ4 θ5 θ6
-0.4 0.2 0.1
100

Net Input and Output Calculation
Unitj Net Input Ij Output Oj
4 0.2 + 0 - 0.5 -0.4 = -0.7
5 -0.3 + 0 + 0.2 + 0.2 =0.1
6 (-0.3)0.332-
(0.2)(0.525)+0.1= -0.105
1
.
0
1
1



e
Oj
7
.
0
1
1
e
Oj


105
.
0
1
1
e
Oj


= 0.332
= 0.525
= 0.475
101

Calculation of Error at Each
Node
Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1
5 0.525 x (1- 0.525)x 0.1311x
(-0.2) = 0.0065
4 0.332 x (1-0.332) x 0.1311 x
(-0.3) = -0.0087
102

Calculation of weights and Bias Updating
Learning Rate l =0.9
Weight New Values
w46 -0.3 + 0.9(0.1311)(0.332) = -
0.261
w56 -0.2 + (0.9)(0.1311)(0.525) = -
0.138
w14 0.2 + 0.9(-0.0087)(1) = 0.192
w15 -0.3 + (0.9)(-0.0065)(1) = -0.306
……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218
……..similarly ………similarly 103

DEEP LEARNING
• Deep learning is a subset of ML in
AI that has networks capable of
learning unsupervised from data that is
unstructured or unlabeled.
• Deep Learning is a subfield of Machine
Learning that involves the use of neural
networks to model and solve complex
problems.
104

DEEP LEARNING
• The key characteristic of Deep Learning is the use
of deep neural networks, which have multiple
layers of interconnected nodes.
• These networks can learn complex
representations of data by discovering hierarchical
patterns and features in the data.
• Deep Learning algorithms can automatically learn
and improve from data without the need for
manual feature engineering.
105

DEEP LEARNING
• Deep Learning has achieved significant success in
various fields, including image recognition,
natural language processing, speech recognition,
and recommendation systems.
• Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep
Belief Networks (DBNs).
106

DEEP LEARNING
• Training deep neural networks typically
requires a large amount of data and
computational resources.
• However, the availability of cloud
computing and the development of
specialized hardware, such as Graphics
Processing Units (GPUs), has made it easier
to train deep neural networks.
107

Convolutional Neural Networks
• A Convolutional Neural Network (CNN) is
a type of deep learning algorithm that is
particularly well-suited for image
recognition and processing tasks.
• It is made up of multiple layers, including
convolutional layers, pooling layers, and
fully connected layers.
108

Convolutional Neural Networks
• The convolutional layers are the key component of a CNN,
where filters are applied to the input image to extract
features such as edges, textures, and shapes.
• The output of the convolutional layers is then passed
through pooling layers, which are used to down-sample the
feature maps, reducing the spatial dimensions while
retaining the most important information.
• The output of the pooling layers is then passed through one
or more fully connected layers, which are used to make a
prediction or classify the image.
109

CNN
• There are certain steps/operations that are
involved CNN. These can be categorized as
follows:
• Convolution operation
• Pooling
• Flattening
• Fully connected layers
110

CNN
• Convolution operations is the first and one
of the most important step in the
functioning of a CNN. Convolution
operation focuses on extracting/preserving
important features from the input (image
etc).
112

CNN
• To understand this operation, let us consider
image as input to our CNN.
• Now when image is given as input, they are
in the form of matrices of pixels.
• If the image is grayscale, then the image is
considered as a matrix, each value in the
matrix ranges from 0-255.
113

CNN
• We can even normalize these values lets say
in range 0-1. 0 represents white and 1
represents black.
• If the image is colored, then three matrices
representing the RGB colors with each
value in range 0-255. The same can be seen
in the images below:
114

Fig 1: Colored image matrices
115

Fig 2: Grascale image matrix
116

Convolution Operation
• MATHEMATICAL OPERATION
Coming to convolution operation, let us
consider an input image. Now for
convolution operation, filters or kernels are
used.
117

Convolution Operation
• The following mathematical operation is
performed:
Let size of image: NxN
Let size of filer: FxF
Then (NxN)*(FxF)= (N-F+1)x(N-F+1)
*= Convolution operation
All these kernels, input channels etc are the hyper
parameters. The result of each layer is passed on
to the next one.
118

Example
• Here, input of size 6x6 is given and kernel
of size 3x3 is used. The feature map
obtained in of size 4x4.
To increase non-linearity in the image,
Rectifier function can be applied to the
feature map.
Finally, after the convolution step is
completed and feature map is obtained, this
map is given as input to the pooling layer. 120

CNN Architecture
• A common CNN model architecture is to
have a number of convolution and pooling
layers stacked one after the other.
121

Pooling Layers
• Pooling layers are used to reduce the
dimensions of the feature maps.
• Thus, it reduces the number of parameters
to learn and the amount of computation
performed in the network.
123

Pooling Layers
• The pooling layer summarizes the features present
in a region of the feature map generated by a
convolution layer.
• So, further operations are performed on
summarized features instead of precisely
positioned features generated by the convolution
layer.
• This makes the model more robust to variations in
the position of the features in the input image.
124

Max Pooling
• Types of Pooling Layers:Max, Min and
Average pooling.
Max Pooling
• Max pooling is a pooling operation that selects the
maximum element from the region of the feature
map covered by the filter.
• Thus, the output after max-pooling layer would be
a feature map containing the most prominent
features of the previous feature map.
125

Average Pooling
• Average pooling computes the average of
the elements present in the region of feature
map covered by the filter.
• Thus, while max pooling gives the most
prominent feature in a particular patch of
the feature map, average pooling gives the
average of features present in a patch.
127

Flattening
• Flattening is converting the data into a 1-
dimensional array for inputting it to the next
layer.
• We flatten the output of the convolutional
layers to create a single long feature vector.
• And it is connected to the final
classification model, which is called a fully-
connected layer.
129

CNN
• In other words, we put all the pixel data in
one line and make connections with the
final layer.
• And once again. What is the final layer for?
The classification of ‘the cats and dogs.’
130

Recurrent Neural Network
• Recurrent Neural Network(RNN) is a type of Neural
Network where the output from the previous step is fed as
input to the current step.
• In traditional neural networks, all the inputs and outputs
are independent of each other.
• But in cases when it is required to predict the next word of
a sentence, the previous words are required.
• NLP
• Hence, there is a need to remember the previous words.
132

• Thus RNN came into existence, which
solved this issue with the help of a Hidden
Layer.
• The main and most important feature of
RNN is its Hidden state, which remembers
some information about a sequence.
133

• The state is also referred to as Memory
State since it remembers the previous input
to the network.
• It uses the same parameters for each input
as it performs the same task on all the inputs
or hidden layers to produce the output.
• This reduces the complexity of parameters,
unlike other neural networks.
134

How RNN works
• The Recurrent Neural Network consists of
multiple fixed activation function units, one
for each time step.
• Each unit has an internal state which is
called the hidden state of the unit.
• This hidden state signifies the past
knowledge that the network currently holds
at a given time step.
136

RNN
• This hidden state is updated at every time
step to signify the change in the knowledge
of the network about the past.
• The hidden state is updated using the
following recurrence relation:-
137

RNN
• The formula for calculating the current
state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
138

RNN
• Training through RNN
• A single-time step of the input is provided
to the network.
• Then it calculates its current state using a
set of current input and the previous state.
• The current ht becomes ht-1 for the next
time step.
139

RNN
• One can go as many time steps according to
the problem and join the information from
all the previous states.
• Once all the time steps are completed the
final current state is used to calculate the
output.
140

RNN
• The output is then compared to the actual
output i.e the target output and the error is
generated.
• The error is then back-propagated to the
network to update the weights and hence the
network (RNN) is trained
using Backpropagation through time.
141

Conclusions
• Machine Learning
• Supervised and Unsupervised Learning
• Neural Networks
• CNN
• RNN
142

Machine learning and deep learning algorithms

More Related Content

Similar to Machine learning and deep learning algorithms

Recently uploaded

Machine learning and deep learning algorithms