Their primary role is to transform the summed weighted input from a neuron into an
output value. This output is then fed to the next layer of neurons or becomes the final output
of the network. Without them, a neural network would just be a linear regression model, no
matter how many layers it had, and couldn't learn complex patterns.
Let's explore different types of activation functions:
1. Binary Step Function
Imagine a light switch: it's either ON or OFF.
How it works: This function is like a simple "if-else" statement. You set a
"threshold" value. If the input to the neuron is greater than this threshold, the neuron
"activates" (outputs 1). Otherwise, it "deactivates" (outputs 0).
Equation:
o f(x)=1 if x≥threshold
o f(x)=0 if x<threshold
Example: Let's say your threshold is 0.5.
o If the input is 0.7, output is 1.
o If the input is 0.3, output is 0.
When to use: It's mainly used in binary classification problems (e.g., spam or not
spam) where you need a clear 0 or 1 output.
Limitations:
o Cannot handle multi-class classification: If you have more than two
categories (e.g., cat, dog, bird), this function won't work.
o No gradient: It's not differentiable, which means you can't use it with
gradient-based optimization algorithms (like backpropagation) that are crucial
for training neural networks. This is its biggest drawback.
2. Linear Activation Function
Think of a direct pipeline where what goes in, comes straight out, perhaps scaled.
How it works: This function simply outputs a value that is directly proportional to the
input. It doesn't really "activate" anything in a non-linear way. It's also called "no
activation" or "identity function."
Equation: y=Wx+b (where W is weight, x is input, b is bias). In the context of an
activation function, it simplifies to f(x)=x or f(x)=cx (where c is a constant).
Example:
o If the input is 5, output is 5 (if f(x)=x).
o If the input is 2 and f(x)=2x, output is 4.
When to use: While simple, it's not commonly used as an activation function in
hidden layers of deep neural networks because stacking multiple linear layers still
results in a single linear transformation. It might be used in the output layer for
regression problems where you need to predict a continuous value.
Limitations:
o Cannot learn complex patterns: Since it's linear, it can only model linear
relationships. Neural networks need non-linear activation functions to learn
from complex, real-world data.
o Vanishing/Exploding Gradients: In deep networks, the gradients can either
shrink to zero or grow infinitely large during training, making learning very
difficult.
3. Non-Linear Activation Functions
These are the real stars of neural networks! They introduce non-linearity, allowing networks
to learn complex relationships and patterns in data.
a) Sigmoid / Logistic Activation Function
Imagine a smooth "S" curve that squishes any input value into a range between 0 and 1. This
is great for probabilities.
How it works: Takes any real-valued input and transforms it into an output between 0
and 1. Larger positive inputs get closer to 1, while larger negative inputs get closer to
0.
Equation: f(x)=1+e−x1
Example:
o If input is very large positive (e.g., 100), output is very close to 1 (e.g.,
0.999...).
o If input is 0, output is 0.5.
o If input is very large negative (e.g., -100), output is very close to 0 (e.g.,
0.000...).
When to use:
o Output layer for binary classification: Since its output is between 0 and 1,
it's perfect for predicting probabilities (e.g., the probability of an email being
spam).
o Historically popular in hidden layers: While less common now due to
limitations, it was widely used in hidden layers.
Advantages:
o Smooth gradient: It's differentiable everywhere, providing a smooth gradient
for backpropagation.
o Probabilistic output: Maps outputs to probabilities, which is intuitive for
many problems.
Limitations:
o Vanishing Gradient Problem: The most significant issue. As inputs move
further away from 0 (either very positive or very negative), the gradient of the
function becomes very small, almost flat. This means during backpropagation,
the gradients become tiny, and the network learns very slowly or stops
learning altogether in earlier layers. Imagine trying to learn by taking tiny, tiny
steps.
o Not Zero-centered: The output is always positive (between 0 and 1). This can
lead to issues with gradient updates where all gradients are either positive or
negative, causing a "zigzagging" effect during optimization.
b) Tanh Function (Hyperbolic Tangent)
Similar to Sigmoid, but "zero-centered," meaning its output ranges from -1 to 1.
How it works: Like the sigmoid, it has an S-shape, but its output values range from -
1 to 1. Larger positive inputs are closer to 1, and larger negative inputs are closer to -
1.
Equation: f(x)=ex+e−xex−e−x
Example:
o If input is very large positive, output is very close to 1.
o If input is 0, output is 0.
o If input is very large negative, output is very close to -1.
When to use: Often preferred over sigmoid in hidden layers because of its zero-
centered output. This helps with the optimization process.
Advantages:
o Zero-centered output: This addresses one of the sigmoid's limitations,
making training generally more stable and faster. It means the mean of the
hidden layer outputs is closer to zero, which helps in centering the data for the
next layer.
o Stronger gradients near the origin: Compared to sigmoid, it has a larger
range of significant gradients.
Limitations:
o Still suffers from Vanishing Gradient Problem: Although better than
sigmoid, it still faces the vanishing gradient issue for very large or very small
inputs, as its ends are also flat.
c) ReLU Activation Function (Rectified Linear Unit)
The most popular choice for hidden layers in deep learning today. It's simple and efficient.
How it works: It's straightforward: if the input is positive, it outputs the input
directly. If the input is negative, it outputs zero.
Equation: f(x)=max(0,x)
Example:
o If input is 5, output is 5.
o If input is -2, output is 0.
When to use: Widely used in hidden layers of deep neural networks for tasks like
image classification, natural language processing, etc.
Advantages:
o Solves Vanishing Gradient Problem (for positive inputs): For positive
inputs, the gradient is always 1, preventing the vanishing gradient issue.
o Computationally efficient: Simple calculation (just a comparison and a
return). This makes training much faster.
o Sparsity: It introduces sparsity in the network by setting negative activations
to zero. This can lead to more efficient representations.
Limitations:
o Dying ReLU Problem: This is its main drawback. If a neuron consistently
receives negative inputs, its output will always be zero, and its gradient will
also be zero. This means the neuron will stop learning and effectively "die,"
never activating again. Imagine a light that's permanently off.
d) Leaky ReLU Function
An improvement over ReLU to address the "Dying ReLU" problem.
How it works: Instead of setting negative inputs to exactly zero, Leaky ReLU allows
a small, non-zero slope for negative inputs (e.g., 0.01 * x). This small slope ensures
that the neuron can still learn, even if the input is negative.
Equation: f(x)=max(0.01x,x)
Example:
o If input is 5, output is 5.
o If input is -2, output is 0.01∗−2=−0.02.
When to use: When you suspect your ReLU neurons are "dying" and you want to
ensure some gradient flow for negative inputs.
Advantages:
o Addresses Dying ReLU: By allowing a small gradient for negative values, it
prevents neurons from becoming permanently inactive.
o Computationally efficient: Still very fast to compute.
o Allows negative values during backpropagation: Ensures that information
flows even for negative inputs.
Limitations:
o The "leak" (the small slope) is a fixed hyperparameter (e.g., 0.01). While
better than 0, it might not be optimal for all situations.
e) Parameterized ReLU (PReLU)
An even more flexible version of Leaky ReLU.
How it works: Instead of fixing the small slope for negative inputs (like 0.01 in
Leaky ReLU), PReLU makes this slope 'a' a learnable parameter. The network itself
learns the optimal value for 'a' during training.
Equation:
o f(x)=x if x≥0
o f(x)=ax if x<0 where 'a' is a trainable parameter.
Example:
o If input is 5, output is 5.
o If input is -2, output is a∗−2. The value of 'a' will be learned by the network.
When to use: When you want the network to learn the optimal "leakiness" for
negative activations, potentially leading to faster and more optimal convergence.
Advantages:
o Learns optimal 'a': Allows the network to adapt the activation function to the
specific data, potentially leading to better performance than a fixed Leaky
ReLU.
o Addresses Dying ReLU: Like Leaky ReLU, it prevents dead neurons.
Limitations:
o Adds another parameter to learn, slightly increasing computational
complexity, but often worth it for the performance gains.
f) Softmax Function
The go-to for multi-class classification problems, providing probabilities for each class.
How it works: Unlike the previous functions that operate on a single input, Softmax
operates on a vector of inputs (typically the output of the last hidden layer). It
squashes these values into a range between 0 and 1, and crucially, all the outputs
sum up to 1. This makes them interpretable as probabilities for each class.
Equation (for a single output yi in a vector of K outputs): yi=∑j=1Kezjezi where
zi is the input for class i, and the sum is over all K classes.
Example: Imagine you're classifying an image as a "cat," "dog," or "bird." The raw
outputs from the previous layer might be [2.0, 1.0, 0.1]. Applying Softmax could give
you [0.7, 0.2, 0.1]. This means the network predicts a 70% chance of being a cat, 20%
a dog, and 10% a bird.
When to use: Almost exclusively used in the output layer of neural networks for
multi-class classification problems.
Advantages:
o Probabilistic output: Provides well-normalized probabilities for each class,
making the output easily interpretable.
o Handles multiple classes: Essential for problems with more than two
categories.
Limitations:
o Can be computationally expensive for a very large number of classes.
g) Swish Activation Function
A relatively newer, self-gated activation function showing promising results.
How it works: It's a smooth, non-monotonic function that allows small negative
values to pass through, unlike ReLU which zeros them out. This "self-gating"
mechanism helps in learning.
Equation: f(x)=x⋅sigmoid(x) or f(x)=x⋅1+e−x1
Example:
o If input is very positive, it behaves similarly to ReLU (output close to input).
o If input is 0, output is 0.
o If input is slightly negative (e.g., -0.5), it allows a small negative value to pass
through, unlike ReLU which would output 0.
o If input is very negative, output approaches 0, similar to ReLU.
When to use: Being explored as an alternative to ReLU in deep networks, especially
in challenging domains like image classification and machine translation.
Advantages:
o Smooth function: Unlike ReLU, it doesn't have an abrupt change at x=0,
which can help with training stability.
o Allows small negative values: This can preserve information that might be
lost by ReLU, as small negative values might still contain relevant patterns.
o Non-monotonicity: Its non-monotonic behavior can enhance the expression
of input data and weight learning.
o Outperforms ReLU: Researchers have shown it often matches or
outperforms ReLU on deep networks.
Limitations:
o Slightly more computationally expensive than ReLU due to the sigmoid
calculation.
o Still relatively new compared to ReLU, so its long-term performance and
common pitfalls are still being researched and understood