GCapsNet: Multi-Feature Aware Pose and Geometry
based Facial Expression Recognition using Deep
Learning
1. Abstract
Facial expressions are the primary way to express intentions and emotions. In real life,
computers can understand the emotions of humans by analyzing their facial expressions.
Facial Expression Recognition (FER) plays an important role in human-computer
interaction necessity and medical field [1] . In the past, the facial features are extracted
manually for recognizing the expressions. In the present, it is an important hotspot in
computer vision, Internet of Things, and artificial intelligence fields. Certain processes
are involved to recognize the facial expression in an efficient manner such as,
Preprocessing
Segmentation
Feature Extraction
Classification
Various algorithms are designed manually for feature extraction and other feature
extraction algorithms such as Local Binary Pattern (LBP), Gabor wavelet, Histogram of
Oriented Gradient (HOG) [2],[3] . Various challenges are involved in the FER to
recognize accurate expressions of facial images. For robust classification of facial
expression, consideration of illumination and pose of the facial image is important. The
poses and facial identity learning are essential to get accurate results. Several existing
works faced challenges regarding identity, pose variation, and inter-subject variation.
For estimating the pose of the facial images existing methods used hand-crafted
features. For detecting the pose of the facial images, pose normalization is performed by
considering the angle in the existing works [3] . Previous works also considering pixel-
based normalization to increase the accuracy for recognizing the facial expression [2] .
On the other hand, general illumination effect of the images also affects the accuracy,
by contrast, occlusion, etc.
Segmentation is one of the major processes to partition the facial images to extract the
features. Various existing methods used different techniques to segment the facial
images such as bounding box-based segmentation, region-based segmentation, cluster-
based segmentation, etc., by using various algorithms such as Discrete Cosine
Transform (DCT), K-means Clustering algorithm, etc. For extracting the features after
segmenting the facial images various Machine learning (ML) algorithms such as
Support Vector Machines (SVM), Naive Bayes, K-Nearest Neighbors (KNN), and Deep
Learning (DL) algorithms such as Convolutional Neural network (CNN), Generative
Adversarial Network (GAN), Long-Short Term Memory Networks (LSTM) [2], [3],
[4]. After extracting the features from the facial images, classification is performed to
identify the facial expressions (happy, sad, anger etc.). Various classifiers are used in
the previous works for classification processes such as VGG-16, VGG-19, ResNet, etc
[2], [4] . These classifiers get the extracted feature as input and classify the expressions.
For training and testing purposes various datasets are used but, mostly FER-2013, CK+,
and JAFFE datasets are used [1], [2], [3] .
2. Research Aim and Objective
The aim of this research is to identify the facial expression from the facial images by
using Deep learning technique. In addition, this research identifies the problems of
considering adequate features, high false-positive rate, less accuracy, and so on.
The main objective of this research is to recognize the facial expressions from the facial
images with low false positive rate and high accuracy. The remaining research
objectives are described as follows,
To increase the facial image quality preprocessing is initialized by performing
normalization to estimate the pose and angle of the facial image and also reduce
the illumination effects to increase the accuracy of the facial expression.
To extract the features efficiently, cluster-based segmentation is performed to
estimate the facial objects such as eyebrows, eyes, nose, and mouth which
reduces the loss of small expressions.
Multi-feature is extracted in the feature extraction process by considering both
low-level and high-level features of the facial images to increase the accuracy of
facial expression recognition.
For increasing the quality of images and identify the pose of the facial images
we perform bi-level preprocessing by normalizing the images considering
illumination and pose of the facial image. For illumination normalization
grayscale algorithm is used to reduce the occlusion, contrast etc. For pose
normalization geometry-based polar transformation algorithm is used to estimate
the pose for obtaining accurate results.
To reduce the false positive rate, cluster-based segmentation is performed by
using Improved Fuzzy C-means clustering algorithm and detects the facial
landmarks precisely.
For improving the recognition of facial expression multiple features are
considered (both high level and low level). For feature extraction and
classification by Graph-based Capsule Neural Network (GCapsNet) deep
learning algorithm which reduce the redundancy and increase the FER accuracy.
3. Introduction
In this research, we have given importance to recognize facial expressions by using
hybrid deep learning method. The accuracy of the facial expression is improved by
performing preprocessing, segmentation, landmark detection, and feature extraction
from the facial images. The important aim of this research is to develop expression
recognition with high accuracy based on effective segmentation and classification. For
facial expression recognition, we have taken CK+ (Cohn Kanade) and Japanese Female
Facial Expressions (JAFFE) datasets. It have three sequential phases such as,
Bi-level preprocessing
Clustering-based Segmentation and Facial Landmark Detection
Multi-Feature Extraction and Facial Expression Classification
3.1 Bi-level Preprocessing
We have taken the facial image as input from the given dataset i.e. CK+ dataset.
Initially, bi-level preprocessing is performed for the facial images such as illumination
normalization and pose normalization. In the illumination normalization occlusion,
blurring effect; contrast, and shadowing effect are reduced by using Grayscale
algorithm to convert the normal facial image to grayscale image to eliminate these
respective effects. After performing illumination normalization, facial angle and
geometry are normalized by performing pose normalization by using Geometry-based
Polar transformation to remove the background from the facial image and obtain
frontal facial view for better facial expression recognition. The angle of the face is
rotated within the image to get the frontal face pose by using polar transformation. At
the same time, the position of the face is also important to recognize accurate facial
expressions by using geometry transformation.
3.2 Clustering-based Segmentation and Facial Landmark Detection
After performing successful bi-level preprocessing, segmentation and facial landmark
detection are performed. Segmentation of facial images is mainly performed to
recognize the facial objects such as eyebrows, eyes, nose, mouth, and lips by using
Improved Fuzzy C-means clustering algorithm. Traditional fuzzy c-means clustering
algorithm is used for partitioning the images into segments but, it can’t determine the
center of the cluster. To overcome such drawbacks improved fuzzy c-means clustering
algorithm is used to initialize the cluster’s center by using Firefly algorithm along with
fuzzy c-means clustering algorithm. After fulfilling the segmentation process, detection
of facial landmarks is also performed effectively. 44 Feature points are used in the facial
landmark for better facial expression recognition. 10 feature points are used for
eyebrows (5 feature points for each eyebrow), 12 feature points for eyes (6 feature
points for each eye), 9 feature points for nose, and 13 feature points for mouth (7 feature
points for lips).
3.3 Multi-Feature Extraction and Facial Expression Classification
Various features are extracted from the landmark detected image and it is classified into
two levels such as high-level features and low-level features. The high-level features are
extracted based on the facial objects such as eyebrows, eyes, nose, lips, and mouth and
the corresponding features such as eyebrow slant, eye size, eye spacing, pupil size, nose
length, nose width, nose wrinkle, mouth openness, mouth width, mouth curvature, tight
lips, and lips droop. The low-level features are shape, texture, and color. These features
are extracted and classified by using deep learning algorithms such as Graph-based
Capsule Neural Network (GCapsNet). The traditional capsule neural network has
high complexity for large number of data sets. To overcome this drawback we integrate
the capsule neural network with Graph neural network. Large number of datasets is
managed by using Graph neural network and the classification is performed by capsule
neural network to select the optimal features to reduce the overfitting and increase the
accuracy. Seven important facial expressions are classified depend upon the extracted
features; they are neutral, happy, sad, fear, disgust, surprise, and anger. Evaluation of
the proposed work is performed by considering the following performance metrics,
Accuracy
Precision
Recall
Confusion matrix
Facial landmark detection error
4. Literature Survey
[1] The author proposed an approach to recognize the facial expression from the static
facial image by using hybrid deep learning networks. Initially, the expressional features
were extracted from the facial image by using spatial attention convolutional neural
network (SACNN). In this neural network, VGG-19 is used as a spatial attention
module which is used to extract the pixel-based features from the facial images. After
extracting the features, landmark detection is performed based on geometry of the facial
image by using attention mechanism based on long short-term memory networks
(ALSTM), and this mechanism was also used to estimate different landmark regions'
importance. Here FER-2013, CK+, and JAFFE datasets are used for experiment. The
author used SACNN for extracting the pixel-based expressional features from the facial
images. However, SACNN needs huge trained images to extract features effectively that
increase the complexity for classification of facial expressions. Here, Batch
normalization layer is included in the SACNN to reduce the internal covariance shift
but, it needs huge batch size to normalize the outputs of the convolution layer to
produce better results and Lack of considering the size of batch leads to high false-
positive rate. To discover the facial landmarks from the extracted facial features
ALSTM was used and it is also used to evaluate the landmark region’s importance
adaptively. However, lack of considering the overfitting problem when exploring the
facial landmarks that decrease the accuracy of the facial expression classification.
[2] In this paper, authors proposed an approach for recognizing the facial expression by
using deep convolution neural network (DCNN) with local gravitational force
descriptor. Initially, the local features are extracted from the facial image by using the
local gravitational force descriptor. After extracting the local features, DCNN is used to
classify the facial expressions by dividing it into two major branches. The first branch
of DCNN is used to extract the geometric features such as curves, edges, and lines
present in the facial images. The second branch of DCNN is used to extract holistic
features and classification. The classification score of facial expressions is computed by
using a technique named score level fusion. Here FER-2013, CK+, and JAFFE datasets
are used for experiment. Geometric features such as curves, lines, and edges of the
facial images are considered by using DCNN. However, these geometric features of the
facial images are not enough to classify the facial expressions accurately that leads to
high false-positive rate. Here, DCNN is used for both feature extraction and
classification of facial expressions from the corresponding features but, DCNN felt
difficult to classify the facial expression when the facial image contains rotation or
some angle of tilt that leads to less accuracy and more layers were used for the three
types of layers present in the DCNN. Especially the number of max-pooling layers
present is 5. However, more max-pooling layers present in the DCNN increase the
probability of losing the efficient features that leads to high false positive rate.
Preprocessing process is performed to align the angle and pose of the facial image.
However, lack of considering the illumination parameters leads to face occlusion,
contrast, etc., which decreases the classification of facial expressions accuracy.
[3] The author proposed an approach for facial expression recognition (FER) by using
hybrid deep learning algorithms with pose considering face alignment. Initially, the
pose-guided face alignment method is used to decrease the intra-class difference present
in the facial images by considering three basic steps such as target pose discovery,
template generation, and target matching. Angular symmetry is used for redundant
features elimination and selects the efficient template by performing clustering with K-
means clustering algorithm. Hybrid deep learning algorithms such as CNN and RNN
are used for extracting the facial features and VGG-16 and ResNet are used for
classification of facial expressions. Here Oulu-CASIA, CK+, AR, and JAFFE datasets
are used for experiment. The results from the pose-guided face alignment method were
clustered by using K-means clustering algorithm. However, this clustering algorithm
cluster the dataset in k number, and the entire k number of clusters has a single cluster
that leads to high complexity to retrieve the extracted features for classification which
decreases the performance. In the hybrid deep learning algorithms, RNN also plays an
important role in feature extraction. However, the process involved in RNN is difficult
which increases the training complexity in classification of facial expression. ResNet is
used for classify the facial expression from the extracted features but, the time duration
of this network is high that affects the classification with high time complexity. Here,
the features are extracted from the clustering output to classify the facial expressions but
the clustering of facial images is not enough to extract the features accurately that leads
to high false-positive rate.
[4] The authors proposed an approach to discover the facial expressions by using
ensemble rule with deep learning algorithm. The face expression recognition algorithms
are classified into two types such as feature-based algorithm and convolutional neural
network-based algorithm. Initially, the facial landmark is extracted from the input
image. Perform frontalization by using frontalization algorithm to manage the pose by
rotating and manage the brightness of the image. Shortcut CNN is used to extract the
features from the facial image and for FER classification adaptive exponentially
weighted average ensemble rule was used. FER-2013, JAFFE, and CK+ datasets were
used for experiment. In this paper, the landmarks are detected with 68 feature points to
perform facial frontalization but, FER needs few landmarks, for instance, the landmarks
of jaw are not important to classify accurate facial expression. Considering all facial
landmarks for frontalization increases the classification complexity and decreases the
accuracy. Here, frontalization algorithm is used to consider the brightness rotation and
pose of the facial images but, lack of considering the geometry features increases the
false positive rate. CNN is used to extract the features from the frontalized facial image.
However, CNN doesn’t consider the coordinate frames of the facial images that lead to
less accuracy in recognizing the facial expression.
[5] The author proposed an approach to detect the facial landmark by performing
semantic segmentation. Initially, the architecture of semantic segmentation was
designed to segment the facial images by encoding them and feature maps of the facial
images were extracted from the encoder and the feature maps are decoded by
corresponding feature maps. Secondly, VGG-16 is used to extract facial landmarks from
the feature maps and extract the features of the facial image. For classification of
images, the feature maps are given to the softmax layer and classify the feature maps
based on their weights. For experiment analysis authors created the dataset by own and
used VGG-16 for training the facial images for classification of feature maps based on
weights to create facial landmarks. However, VGG-16 takes more time to train the
images and needs more bandwidth that affects the performance.
[6] Authors proposed an approach to identify the facial expression from the facial
images by using deep convolutional neural networks. Initially, edge computing is
involved to ensure the privacy of the facial images from the cloud. The generative
adversarial network (GAN) is modified by including circular consensus to create
CycleGAN to train the facial images effectively. Information of class constraints is also
included in the CycleGAN to improve style conversion process. The normal classifier of
GAN i.e. discriminator is also modified by including an auxiliary expression classifier
to classify the facial expressions efficiently. For experiment analysis authors used
JAFFE, CK+, and FER-2013 datasets to identify the recognition rate and GAN is
modified by adding circular consensus and an auxiliary expression classifier to create
CycleGAN for better classification. However, GAN can’t have the ability to predict the
state of the facial image that leads to less accuracy.
[7] Authors proposed an approach to recognize the facial expression by considering the
pose and identity invariant of the facial image using dynamic multi-channel metric
network (DML-Net). Initially, the DML-Net is used to learn the local and global
features from various facial regions by using parallel convolutional networks. To
discover the pose and identity-invariant of the facial expressions joint embedded feature
learning is used. End-to-End training is performed for the facial images to recognize the
facial expressions with low FER loss and overfitting. Here BU-3DFE, Multi-PIE,
SFEW 2.0., and KDEF datasets are used to evaluate the facial expressions. DML-Net is
used to extracting the features and classification of facial expressions from facial
images. However, it is not suitable to classify the facial expressions in unconstrained
environments i.e. various datasets.
[8] Authors proposed an approach to identify the facial expression by using CapsField
technique. It is a technique with the combination of convolutional neural network
(CNN) and capsule neural network. Initially, CNN is used to extract the features from
the pre-processed array of facial images. Capsule network is used to route the facial
features to select the features hierarchically. These methods reduce the redundancy of
the features and make the classification process easier with effective classification of
facial expressions. Here authors proposed the dataset named light field faces in the wild
(LFFW) for experimental analysis. CNN was used with capsule network to extract the
features and classify the facial expressions. However, CNN does not consider the
alignment and position of the facial images that leads to less accuracy.
[9] Authors proposed an approach to recognize facial expression by using frequency
neural network (FreNet). Initially, the facial images are pre-processed by considering
rotation correction; ensuring the eyes are on horizontal line, and resizing using discrete
Fourier transform (DCT). Block-FreNet is introduced to reduce the dimension and
effective feature learning. The low-level features are extracted by using learnable
multiplication kernel. After extracting the low-level features, high-level features are
extracted by using a summarization layer. To classify the facial expressions from the
extracted features ANN-based classification layer was used. FER-2013 dataset was used
for experimental analysis. DCT is used to perform preprocessing for the facial images
considering resizing, rotation correction, etc., but lack of considering the geometrical
features leads to high false-positive rate.
[10] Authors proposed an approach to identify the facial expression by considering the
geometry aware pose invariant using deep learning mechanisms such as generative
adversarial network (GAN). Initially, the poses from the facial images were based on
angles, and according to facial expressions, the facial landmarks were generated facial
landmarks. The identity representation is detached by using the facial shape geometry
which was provided by the facial landmarks. GAN is used to extract the features and
perform classification for detecting the facial expressions from the extracted features.
Here BU-3DFE, SFEW, and Multi-PIE datasets were used for performing experimental
analysis. GAN is used to extract the features from the facial images and perform
classification for detecting the facial expressions. However, at the same time, GAN
trains the generator and discriminator model that increases the training complexity.
[11] Authors proposed an approach to recognize the facial expression by considering
the alignment and synthesis of facial images by using joint deep learning model.
Initially, facial alignment is learned to generate the geometry code to perform facial
synthesis. In the facial synthesis, three types of losses are considered such as
discriminator loss, content similarity loss, and perceptual loss. Facial features are
extracted i.e the expression and geometry code are extracted by considering the pose
invariant and corresponding facial landmarks. After extracting the codes the facial
expression is classified by using the softmax layer. The datasets used are BU-3DFE,
SFEW, and Multi-PIE. Here, normalization is performed for the facial landmarks by
considering the inter-ocular distance. However, lack of several illumination parameters
such as shadowing effect, contrast, etc., leads to less accuracy.
[12] Authors proposed an approach to detect the racial landmarks by using heatmap
offset regression technique. Initially, the regression network is divided into two stages
such as structural hourglass network (SHN) and global constraint network (GCN).
Preprocessing is performed to extract the features accurately. SHN is used to discover
the facial landmarks at initial condition by using the heatmap. Improved inception
ResNet is used in SHN for contextual feature representation learning. GCN is used to
perform offset estimation for efficient landmark location. Loss function is proposed to
improve the facial landmarks in a precise manner. Finally, the outputs from SHN and
GCN are combined for accurate prediction of facial landmarks. The datasets used are
300W, AFLW, 300-VW, and COFW and heatmap offset regression is used to predict
the facial landmarks in a precise manner. However, heatmap doesn’t adapt to the color
of the images that leads to less accuracy.
5. Experiment and Discussion
5.1 Dataset
CK+ Dataset is used for facial expression recognition using deep learning. Cohn-Kanade
(CK+) dataset having 593 images ranging from 18 to 50 years of age with both the genders and
these images are divided into seven expression classes: anger, contempt, disgust, fear,
happiness, sadness, and surprise. The facial image is taken as input from the CK+ dataset and
the selected image loaded successfully as shown in Figure 1.
FIGURE 1. Loading of input facial image from dataset
5.2 Bi-level Preprocessing
In the bi-level preprocessing, we used the Grayscale algorithm for illumination normalization
to convert the normal facial image to a grayscale image. After performing illumination
normalization, facial angle and geometry are normalized by performing pose normalization by
using geometry-based Polar transformation to remove the background from the facial image.
FIGURE 2. Bi-level Preprocessing
5.3 Segmentation
After performing Bi-level Preprocessing, the Clustering-based Segmentation and Facial
Landmark Detection is performed. In this step, Segmentation of facial images is mainly
performed to recognize facial objects such as eyebrows, eyes, nose, mouth, and lips. In this step
we used Improved Fuzzy C- means clustering algorithm. Next, initialize the cluster’s center by
using the Firefly algorithm.
FIGURE 3. Facial Landmark Result
5.4 Feature Extraction
Multi-Feature Extraction and Facial Expression Classification: Various features are extracted
from the landmark detected image and it is classified into two levels such as high-level features
and low-level features. The high-level features are extracted based on the facial objects such as
eyebrows, eyes, nose, lips, and mouth and the corresponding features such as eyebrow slant, eye
size, eye spacing, pupil size, nose length, nose width, nose wrinkle, mouth openness, mouth
width, mouth curvature, tight lips, and lips droop. The low-level features are shape, texture, and
color.
Figure 4. High level and Low level Feature Extraction
5.5 Classification
The deep learning network analyzer for classification is shown in the Figure 5 as
follows:
Figure 5. Deep Learning Network Analyzer
Figure 6. Deep Learning Result
The classification result, “ Sad” is shown in the Figure 7 as follows:
Figure 7. Classification Result
5.6 Performance Metrics
Finally, we evaluate the following performance metrics : Accuracy, Precision, Recall, Facial
landmark detection error, Confusion matrix.
Figure 8. shows the Accuracy, Recall metrics while Figure 9. shows the Precision and Facial
landmark detection error.
Figure 8. Accuracy, Recall
Figure 9. Precision, Facial Landmark Detection Error
Figure 10. Confusion matrix
References
[1] Liu, C., Hirota, K., Ma, J., Jia, Z., & Dai, Y. (2021). Facial Expression
Recognition Using Hybrid Features of Pixel and Geometry. IEEE Access, 9,
18876-18889.
[2] Mohan, K., Seal, A., Krejcar, O., & Yazidi, A. (2021). Facial Expression
Recognition Using Local Gravitational Force Descriptor-Based Deep
Convolution Neural Networks. IEEE Transactions on Instrumentation and
Measurement, 70, 1-12.
[3] Liu, J., Feng, Y., & Wang, H. (2021). Facial Expression Recognition Using
Pose-Guided Face Alignment and Discriminative Features Based on Deep
Learning. IEEE Access, 9, 69267-69277.
[4] Tsai, K., Tsai, Y., Lee, Y., Ding, J., & Chang, R.Y. (2021). Frontalization and
adaptive exponential ensemble rule for deep-learning-based facial expression
recognition system. Signal Process. Image Commun., 96, 116321.
[5] Kim, H., Kim, H., Rew, J., & Hwang, E. (2020). FLSNet: Robust Facial
Landmark Semantic Segmentation. IEEE Access, 8, 116163-116175.
[6] Chen, A., Xing, H., & Wang, F. (2020). A Facial Expression Recognition
Method Using Deep Convolutional Neural Networks Based on Edge
Computing. IEEE Access, 8, 49741-49751.
[7] Liu, Y., Dai, W., Fang, F., Chen, Y., Huang, R., Wang, R., & Wan, B. (2021).
Dynamic multi-channel metric network for joint pose-aware and identity-
invariant facial expression recognition. Information Sciences, 578, 195-213.
[8] Sepas-Moghaddam, A., Etemad, A., Pereira, F., & Correia, P.L. (2021).
CapsField: Light Field-Based Face and Expression Recognition in the Wild
Using Capsule Routing. IEEE Transactions on Image Processing, 30, 2627-
2642.
[9] Tang, Y., Zhang, X., Hu, X., Wang, S., & Wang, H. (2021). Facial Expression
Recognition Using Frequency Neural Network. IEEE Transactions on Image
Processing, 30, 444-457.
[10] Zhang, F., Zhang, T., Mao, Q., & Xu, C. (2020). Geometry Guided Pose-
Invariant Facial Expression Recognition. IEEE Transactions on Image
Processing, 29, 4445-4460.
[11] Zhang, F., Zhang, T., Mao, Q., & Xu, C. (2020). A Unified Deep Model for
Joint Facial Expression Recognition, Face Synthesis, and Face Alignment. IEEE
Transactions on Image Processing, 29, 6574-6589.
[12] Zhang, J., Hu, H., & Feng, S. (2020). Robust Facial Landmark Detection via
Heatmap-Offset Regression. IEEE Transactions on Image Processing, 29, 5050-
5064.