CS-402 Parallel and Distributed Systems
Spring 2025
Lecture No. 12
Artificial Neural Networks
what is artificial neural network ANN?
An Artificial Neural Network (ANN) is a computational model inspired by the way biological neural
networks in the human brain process information. ANNs are designed to recognize patterns and
interpret data through a series of layers that simulate how neurons in the brain communicate with
each other.
Here’s a detailed breakdown:
Key Components Neurons (Nodes): Basic units that receive input, process it, and pass it on. Similar to
biological neurons, they apply an activation function to the input.
Layers:
Input Layer: Receives the initial data.
Hidden Layers: Intermediate layers where data is transformed and processed.
Output Layer: Produces the final result or prediction.
How ANN works?
Input Data: Data is fed into the input layer.
Weight and Bias: Each connection between neurons has a weight and each neuron has a bias, both
of which are adjusted during training.
Activation Function: Neurons apply an activation function (like ReLU, Sigmoid, or Tanh) to the input
data to decide if it should be passed to the next layer.
Forward Propagation: Data moves through the network from input to output, being processed and
transformed along the way.
Output: The output layer provides the final prediction or classification result.
Training an ANN
Backpropagation: A technique used to train ANNs by adjusting weights and biases to
minimize the difference between the predicted output and the actual output (known as the
loss function).
Epochs: One full cycle through the training dataset.
Learning Rate: A parameter that controls how much the weights and biases are adjusted
during training.
Applications
Image and Speech Recognition: Used in facial recognition systems, voice-activated
assistants, and more.
Natural Language Processing (NLP): Powers applications like language translation and
sentiment analysis.
Medical Diagnosis: Assists in predicting diseases and analyzing medical images.
Financial Forecasting: Helps in stock market predictions and risk management.
Example
Imagine training an ANN to recognize handwritten digits. The input layer receives pixel
values of images, hidden layers process these values, and the output layer predicts the digit
(0-9).
ANNs are incredibly powerful and versatile tools for solving a wide range of problems where
traditional algorithms might struggle.
What do (deep) neural networks do?
Learning (highly) non-linear functions.
⨁
0 0 0
(0, 1) (1, 1)
0 1 1
1 0 1
1 1 0
(0, 0) (1, 0)
Logic XOR (⨁) operation
Artificial neural network example
A neural network consists of
Input Hidden Hidden Hidden Output
layer layer 1 layer 2 layer 3 layer layers of artificial neurons and
connections between them.
Each connection is associated
with a weight.
Training of a neural network is
to get to the right weights (and
biases) such that the error
across the training data is
minimized.
Training a neural network
A neural network is trained with m training samples
( ) ( ) ( ) ( ) ( ) ( )
( , ), ( , ), ……( , )
() is an input vector, () is an output vector
Training objective: minimize the prediction error (loss)
( − ( ))
()
( ) is the predicted output vector for the input vector
Approach: Gradient descent (stochastic gradient descent, batch gradient descent, mini-
batch gradient descent).
o Use error to adjust the weight value to reduce the loss. The adjustment amount is proportional to the
contribution of each weight to the loss – Given an error, adjust the weight a little to reduce the error.
Stochastic gradient descent
() ()
Given one training sample ( , )
Compute the output of the neural network ( )
Training objective: minimize the prediction error (loss) – there are different ways to
define error. The following is an example:
1
= ( − ( ))
2
Estimate how much each weight in contributes to the error:
Update the weight by = −α . Here α is the learning rate.
Algorithm for learning artificial neural network
Initialize the weights =[ , ,……, ]
Training
o For each training data ( , ( ) ), Using forward propagation to compute the neural
network output vector ( )
o Compute the error (various definitions)
o Use backward propagation to compute for each weight
o Update = −α
o Repeat until E is sufficiently small.
A single neuron
b
()
w1
Neuron
() wm
() () Activation function
1∗ + ⋯+ ∗ +b
An artificial neuron has two components: (1) weighted sum and
activation function.
o Many activation functions: Sigmoid, ReLU, etc.
Sigmoid function
•Shape: S-shaped curve.
•Range: Outputs values between 0 and 1.
•Usage: Often used in binary classification problems.
•Behavior:
• Small input values (negative) result in outputs close to 0.
• Large input values (positive) result in outputs close to 1.
• Middle input values result in outputs around 0.5.
Example: Imagine a light dimmer switch that smoothly transitions from off (0) to fully on (1).
Sigmoid function
σ = =
The derivative of the sigmoid
function: =
1−
σ = σ = σ (1 − σ )
ReLU (Rectified Linear Unit)
•Shape: Linear for positive values, flat for negative values.
•Range: Outputs values between 0 and infinity.
•Usage: Commonly used in hidden layers of neural networks.
•Behavior:
• Negative input values result in outputs of 0.
• Positive input values result in outputs equal to the input value.
Example: Think of a water tap that only lets water flow when turned on (positive
values), but stops completely when turned off (negative values).
Tanh (Hyperbolic Tangent)
•Shape: S-shaped curve, similar to Sigmoid but steeper.
•Range: Outputs values between -1 and 1.
•Usage: Often used in hidden layers of neural networks.
•Behavior:
•Small input values (negative) result in outputs close to -1.
•Large input values (positive) result in outputs close to 1.
•Middle input values result in outputs around 0.
•Example: Imagine a seesaw that smoothly transitions from one side (-1) to the other
side (1).
Comparison
•Sigmoid: Good for outputs that need to be between 0 and 1, but can suffer from
vanishing gradients (slow learning).
•ReLU: Efficient and widely used, but can suffer from "dead neurons" (outputs stuck at
0).
•Tanh: Similar to Sigmoid but outputs range from -1 to 1, providing stronger gradients
for learning.
Training for the logic AND with a single neuron
In general, one neuron can be trained to realize a linear function.
Logic AND function is a linear function:
⋀
0 0 0
(0, 1) (1, 1)
0 1 0
1 0 0
1 1 1
(0, 0) (1, 0)
Logic AND (⋀) operation
Training for the logic AND with a single neuron
b=0
=0 =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1 =0
Activation function
= + + =0
Consider training data input ( =0, = 1), output Y=0.
NN Output = 0.5
Error: = ( − ) = 0.125
To update , , and b, gradient descent needs to compute , , and
Chain rules for calculating , , and
b=0
=0 =0
= + + =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1 =0
Activation function
If a variable z depends on the variable y, which itself depends on the
variable x, then z depends on x as well, via the intermediate variable y.
The chain rule is a formula that expresses the derivative as : =
=
Training for the logic AND with a single neuron
b=0
=0 =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1 =0
Activation function
= + + =0
( ( ) )
= = = O − Y = 0.5 − 0 = 0.5
( ) ( + )
= = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, = = =0
To update : = − ∗ = 0 – 0.1*0.5*0.25*0 = 0
Assume rate = 0.1
Training for the logic AND with a single neuron
b=0
=0 =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1 =0
Activation function
= + + =0
( ( ) )
= = = O − Y = 0.5 − 0 = 0.5
( ) ( + )
= = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, = = =1
To update : = − ∗ = 0 – 0.1*0.5*0.25*1 = -0.0125
Training for the logic AND with a single neuron
b=0
=0 =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1 =0
Activation function
= + + =0
( ( ) )
= = = O − Y = 0.5 − 0 = 0.5
( ) ( + )
= = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, = = 1
To update b: b= − ∗ = 0 – 0.1*0.5*0.25*1 = -0.0125
Training for the logic AND with a single neuron
b=-0.0125
=0 =0
∑ Sigmoid(s) O=Sigmoid(0)=0.5
=1
=-0.0125
Activation function
= + + =0
This process is repeated until the error is sufficiently small
The initial weight should be randomized. Gradient descent can get stuck in the local
optimal.
See lect7/one.cpp for training the logic AND operation with a single neuron.
Note: Logic XOR operation is non-linear and cannot be trained with one neuron.
Multi-level feedforward neural networks
A multi-level feedforward neural network is a neural network that
consists of multiple levels of neurons. Each level can have many neurons
and connections between neurons in different levels do not form loops.
o Information moves in one direction (forward) from input nodes, through hidden
nodes, to output nodes.
One artificial neuron can only realize a linear function
Many levels of neurons can combine linear functions can train arbitrarily
complex functions.
o One hidden layer (with infinite number of neurons) can train for any continuous
function.
Multi-level feedforward neural networks examples
A layer of neurons that do not directly connect to outputs is called a
hidden layer.
Input Hidden Hidden Hidden Output
layer layer 1 layer 2 layer 3 layer
Build a 3-level neural network from scratch
3 levels: Input level, hidden level, output level
o Other assumptions: fully connected between layers, all neurons use sigmoid
(σ ) as the activation function.
Notations:
o N0: size of the input level. Input: 0 =[ , ,…, ]
o N1: size of the hidden layer
o N2: size of the output layer. Output: OO 2 =[ , ,…, ]
Build a 3-level neural network from scratch
Notations:
o N0, N1, N2: sizes of the input layer, hidden layer, and output layer, respectively
o N0×N1 weights from input layer to hidden layer. 0 : the weight from input unit i to
hidden unit j. B0[N1] biases. B0 1 = [ 0 , 0 , … , 0 ]
o N1×N2 weights from hidden layer to output layer. 1 : the weight from hidden unit i to
output unit j. B1[N2] biases. B1 2 = [ 1 , 1 , … , 1 ]
0 , ⋯ 0 , 0 , ⋯ 0 ,
o 0 0 1 = ⋮ ⋱ ⋮ , 1 1 2 = ⋮ ⋱ ⋮
0 , ⋯ 0 , 0 , ⋯ 0 ,
3-level feedforward neural network
Output: OO[N2]
Output layer 1 2 N2 Hidden layer biases: B2[N2]
Weight: W1[N1][N2]
Hidden layer
Output: HO[N1]
Hidden layer 1 2 N1 Hidden layer biases: B1[N1]
Weight: W0[N0][N1]
Input layer 1 2 N0
Hidden layer weighted sum: HS[N1]
Input: IN[N0] Output layer weighted sum: HS[N2]
Forward propagation (compute OO and E)
Compute hidden layer weighted sum: HS 1 =[ , ,…, ]
o = × 0 , + × 0 , + ⋯+ × 0 , + 1
o In matrix form: = × 0+ 1
Compute hidden layer output: HO 1 =[ , ,…, ]
o = σ( )
o In matrix form: = σ(HS)
Forward propagation
From input (IN[N0]), compute output (OO[N2]) and error E.
Compute output layer weighted sum: OS 2 =[ , ,…, ]
o = × 1 , + × 1 , + ⋯+ × 1 , + 2
o In matrix form: = × 1+ 2
Compute final output: OO 2 =[ , ,…, ]
o = σ( )
o In matrix form: O = σ(OS)
Let us use mean square error: = ∑ ( − )
Backward propagation
To goal is to compute , , , and .
, ,
= , ,…, =[ − , ( −
), … , − ]
In matrix form: = ( − )
This can be stored in an array dE_OO[N2];
Backward propagation
To goal is to compute , , , and .
, ,
is done
= , ,…, =
σ( )(1 − σ( )) , … , σ( )(1 − σ( ))
In matrix form: = ( − )⊙ ⊙ (1 − )
This can be stored in an array dE_OS[N2];
Backward propagation
To goal is to compute , , , and .
, ,
, are done
= , ,…,
= × 1 , + × 1 , +⋯+ × 1 , + 2
Hence, = 1.
= , ,…, =
Backward propagation
To goal is to compute , , , and .
, ,
, , are done
i
1 , 1 ,
⋯
, ,
= ⋮ ⋱ ⋮
⋯
, ,,
= × 1 , + × 1 , + ⋯+ × 1 , + 2
Hence, ,
= .
Backward propagation
To goal is to compute , , , and .
, ,
, , are done
⋯ ⋯
, ,
= ⋮ ⋱ ⋮ = ⋮ ⋱ ⋮
⋯ ⋯
, ,,
In matrix form: =
Backward propagation
To goal is to compute , , , and .
, ,
, , , are done
=[ , ,……, ]
= + + …+ = 1, +
1 , + …+ 1,
Backward propagation
To goal is to compute , , , and .
, ,
, , , are done
=[ , ,……, ]= 1
Backward propagation
To goal is to compute , , , and .
, ,
, , , , are done
Once is computed, we can repeat the process for the hidden
layer by replacing OO with HO, OS with HS, B2 with B1 and W2 with
W1, in the differential equation. Also the input is IN[N0] and the
output is HO[N1].
Summary
H1 H2
IN layer O X Layer 1 Layer 2 Layer 3 Y
layer
The output of a layer is the input of the next layer.
Backward propagation uses results from forward propagation.
o = , = , =
Training for the logic XOR and AND with a 6-unit 2-level neural network
Logic XOR function is not a linear function.
See 3level.cpp
⨁
0 0 0 (0, 1) (1, 1) AND
0 1 1
1 0 1
XOR
1 1 0
(0, 0) (1, 0)
Logic XOR (⨁) operation
Summary
Briefly discuss multi-level feedforward neural networks
The training of neural networks
Following 3level.cpp, one should be able to write a program for any multi-level
feedforward neural networks.