Activation Function
In Artificial Neural Network
An Activation Function decides whether a neuron should be activated or
not. This means that it will decide whether the neuron’s input to the network
is important or not in the process of prediction using simpler mathematical
operations.
The role of the Activation Function is to derive output from a set of input values
fed to a node (or a layer).
Well, if we compare the neural network to our brain, a node is a replica of a
neuron that receives a set of input signals—external stimuli.
Depending on the nature and intensity of these input signals, the brain
processes them and decides whether the neuron should be activated (“fired”)
or not.
In deep learning, this is also the role of the Activation Function—that’s why it’s
often referred to as a Transfer Function in Artificial Neural Network.
The primary role of the Activation Function is to transform the summed
weighted input from the node into an output value to be fed to the next hidden
layer or as output.
Why do Neural Networks Need an Activation Function?
The purpose of an activation function is to add non-linearity to the neural
network. A neural network without an activation function is essentially just a
linear regression model. The activation function does the non-linear
transformation to the input making it capable to learn and perform more
complex tasks.
Let’s suppose we have a neural network working without the activation
functions.
In that case, every neuron will only be performing a linear transformation on
the inputs using the weights and biases. It’s because it doesn’t matter how
many hidden layers we attach in the neural network; all layers will behave in
the same way because the composition of two linear functions is a linear
function itself.
It is used to determine the output of neural network like yes or no. It maps
the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the
function).
The Activation Functions can be basically divided into 2 types
1. Linear Activation Function
2. Non-linear Activation Functions
The choice of activation function has a large impact on the capability and
performance of the neural network, and different activation functions may
be used in different parts of the model.
Linear Activation Function
The linear activation function, also known as "no activation," or "identity
function" (multiplied x1.0), is where the activation is proportional to the input.
The function doesn't do anything to the weighted sum of the input, it simply
spits out the value it was given.
Mathematically it can be represented as:
However, a linear activation function has two major problems:
It’s not possible to use backpropagation as the derivative of the function
is a constant and has no relation to the input x.
All layers of the neural network will collapse into one if a linear activation
function is used. No matter the number of layers in the neural network,
the last layer will still be a linear function of the first layer. So, essentially,
a linear activation function turns the neural network into just one layer.
Non-Linear Activation Functions
The linear activation function shown above is simply a linear regression
model.
Because of its limited power, this does not allow the model to create complex
mappings between the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear
activation functions:
They allow backpropagation because now the derivative function would
be related to the input, and it’s possible to go back and understand
which weights in the input neurons can provide a better prediction.
They allow the stacking of multiple layers of neurons as the output
would now be a non-linear combination of input passed through multiple
layers. Any output can be represented as a functional computation in a
neural network.
Now, let’s have a look at different non-linear neural networks activation
functions and their characteristics.
Sigmoid / Logistic Activation Function
This function takes any real value as input and outputs values in the range of
0 to 1.
The larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to
0.0, as shown below.
Mathematically it can be represented as:
Here’s why sigmoid/logistic activation function is one of the most widely used
functions:
It is commonly used for models where we have to predict the probability
as an output. Since probability of anything exists only between the
range of 0 and 1, sigmoid is the right choice because of its range.
The function is differentiable and provides a smooth gradient, i.e.,
preventing jumps in output values. This is represented by an S-shape
of the sigmoid activation function.
Uses: Usually used in output layer of a binary classification, where
result is either 0 or 1, as value for sigmoid function lies between 0
and 1 only so, result can be predicted easily to be 1 if value is
greater than 0.5 and 0 otherwise.
Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation function, and
even has the same S-shape with the difference in output range of -1 to 1. In
Tanh, the larger the input (more positive), the closer the output value will be
to 1.0, whereas the smaller the input (more negative), the closer the output
will be to -1.0.
Mathematically it can be represented as:
Advantages of using this activation function are:
The output of the tanh activation function is Zero centred; hence we can
easily map the output values as strongly negative, neutral, or strongly
positive.
Usually used in hidden layers of a neural network as its values lie
between -1 to; therefore, the mean for the hidden layer comes out to be
0 or very close to it. It helps in centring the data and makes learning for
the next layer much easier.
ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative
function and allows for backpropagation while simultaneously making it
computationally efficient.
The main catch here is that the ReLU function does not activate all the
neurons at the same time.
The neurons will only be deactivated if the output of the linear transformation
is less than 0.
Mathematically it can be represented as:
The advantages of using ReLU as an activation function are as follows:
Since only a certain number of neurons are activated, the ReLU
function is far more computationally efficient when compared to the
sigmoid and tanh functions.
ReLU accelerates the convergence of gradient descent towards the
global minimum of the loss function due to its linear, non-saturating
property.