2018 4th International Conference on Frontiers of Signal Processing
Hand Gesture Feature Extraction Using Deep Convolutional Neural Network for
Recognizing American Sign Language
Md Rashedul Islam Ummey Kulsum Mitu
School of Computer Science and Engineering Department of Computer Science and Engineering
University of Aizu University of Asia Pacific
Fukushima, Japan Dhaka, Bangladesh
e-mail: rashed.cse@gmail.com e-mail: ummey.kulsum@gmail.com
Rasel Ahmed Bhuiyan Jungpil Shin
Department of Computer Science and Engineering School of Computer Science and Engineering
University of Asia Pacific University of Aizu
Dhaka, Bangladesh Fukushima, Japan
e-mail: raselcse34@gmail.com e-mail: jpshin@u-aizu.ac.jp
Abstract—In this era, Human-Computer Interaction (HCI) is a Non-touch, gesture, and voice interface are becoming
fascinating field about the interaction between humans and popular no a day. Hence, DNI is a new technology of HCI to
computers. Interacting with computers, human Hand Gesture communicate with a machine by recognizing the brain signal
Recognition (HGR) is the most significant way and the major without any physical participation.
part of HCI. Extracting features and detecting hand gesture DNIS is often directed at assisting, augmenting, or
from inputted color videos is more challenging because of the repairing human cognitive functions. But, in every kind of
huge variation in the hands. For resolving this issue, this paper applications, this technology is difficult and expensive to
introduces an effective HGR system for low-cost color video
using webcam. In this proposed model, Deep Convolutional
embed. So, those newly added technologies are introduced as
Neural Network (DCNN) is used for extracting efficient hand adaptive to the real applications based on the requirements
features to recognize the American Sign Language (ASL) using and cost-effectiveness. Whatever, those newly introduced
hand gestures. Finally, the Multi-class Support Vector technologies can't reach the satisfaction level of users
Machine (MCSVM) is used for identifying the hand sign, significantly yet. To overcome the challenges, many
where CNN extracted features are used to train up the researchers working on improving those interfaces at the
machine. Distinct person hand gesture is used for validation in level of effectiveness, usability, and robustness [3].
this paper. The proposed model shows satisfactory An ideal interface should have some common features
performance in terms of classification accuracy, i.e., 94.57% criteria like usability, accuracy, affordability, and scalability.
Nowadays, the Human gesture has become a popular HCI
Keywords-human-computer-interaction (HCI); convolutional interface, and the usage of human gesture is increasing
neural network (CNN); Hand Gesture Recognition; sign
language; multi-class support vector machine (MCSVM)
rapidly, which meets all these criteria.
HGR has lots of applications in different fields such as
computer game, virtual reality and sign language recognition
I. INTRODUCTION (SLR). Among them, SLR is the most used technique where
Human-computer-interaction referred to as HCI is an vocal transmission is impossible. Disable people should have
interacting interface between humans (users) and machines the capability to recognize sign generated by others.
(computers). Through HCI, humans and computers interact Therefore, many researchers have taken a challenge to
with each other in a novel way. Nowadays, it's a fascinating present an assembler prototype for the American Sign
research field, which is focused on the designs and uses of Language (ASL).
computer technology and most particularly, the interacting Several types of research have been done on human sign
interfaces between humans and machines. The HCI recognition with a few numbers of symbols. However, sign
technology has been remarkably expanded and raised up recognition for alphabet is more challenging. Many
with the changes in technology [1]. researchers invented approaches related to human body and
Conventionally in HCI, the command line interface (CLI) hand gesture to enhance the usage of technology. Kilioz et al.
uses the keyboard and the graphical user interface (GUI) introduced an effective approach for recognizing dynamic
uses a keyboard and a mouse along with graphics to provide hang gesture on the basis of real-time HCI [4]. Modanwal et
an interface for humans to interact with computers. On the al. solved the gap between machine and blind people by
basis of effective usability, new technologies introduce new introducing gesture recognition [5].
user interfaces like Direct Neural Interfaces (DNI) in HCI [2].
978-1-5386-7853-4/18/$31.00 ©2018 IEEE 115
Rempel et al. worked for understanding sign language For each sign, there are 120 images and for each person,
using a human hand gesture [6]. Denkowski et al. proposed a there are 3,120 (26x120) images. Figure 3 shows the
model to control residential and commercial building preprocessing of gesture images.
components using human gesture in a natural way [7]. Liang
et al. used a hidden Markov Model (HMM) for recognizing
the sign language [8]. Because of a large number of gesture
of alphabets, those models show sub-optimal results for
alphabet sign recognition.
From this point of view, this paper proposes an efficient
feature extraction process using Convolutional Neural
Network (CNN). The CNN consists of one or more fully
connected convolutional layers as standard multilayer neural
network [9]. CNN architecture is designed for handling 2D
images efficiently [10]. Also, CNN has several dynamic
parameters to train up the machine easily [11]. Finally, the
Multiclass Support Vector Machine (SVM) is used for
recognizing gesture for sign language.
Figure 2. The Region of Interest (ROI).
The rest of the paper organized is as follows. Section II
describes the different parts of the proposed model. The
experimental result is discussed in section III. Finally, Video frames from a webcam
section IV concludes this paper.
II. PROPOSED MODEL
Set environment and ROI position
Identify the ASL alphabet depending on human hand
gesture is the basic idea of our proposed model. The working
procedure of the proposed model is shown in Figure 1.
Capture background image frame
Training Process
Hand Background Feature MCSVM
Gesture subtraction and Extraction Training Capture frames of hand gesture, background
Image preprocessing using CNN subtraction for getting hand sign image
Trained SVMs
Filtering for noise reduction and converted
into a grayscale image.
Unknown Background Feature MCSVM
Hand subtraction and Extraction Classification
Gesture preprocessing using CNN
Evaluation Process Store hand sign image
Figure 1. Working procedure of the proposed model. Figure 3. Preprocessing of gesture images.
A. Experimental Setup and Preprocessing Gesture Image B. Gesture Feature Extraction Using CNN
In order to capture a video frame from the webcam, the In this proposed feature extraction model, the feature
experimental setup has been established at the very vector is extracted from a video frame using Deep
beginning. To discard the unwanted area of the video frame, Convolutional Neural Network (DCNN). All the extracted
a particular area is fixed as a Region of Interest (ROI). feature values of the images are stored in a file after
Figure 2 shows the Region of Interest. extraction.
In the background subtraction process, firstly, a frame of There are many effective machine learning algorithms for
background picture is taken without human hand gesture. feature extraction. Convolutional Neural Network is one of
The captured frame is subtracted for the video frame of hand the best techniques in the field of deep learning. CNN can be
gesture for getting the hand sign image. Because of the used on a large scale of diverse images. For a wide range of
background process and light effect, there are some noises in images, CNN can extract potential features for the
gesture images. To reduce the noise from the image, a classification model.
median filter is used. Then the image is converted into a
grayscale image. Finally, the human hand gesture images are
taken for 26 alphabet sign of ASL for three different persons.
116
Figure 4. The architecture of proposed feature extraction using CNN.
There are several networks of CNN. The "AlexNet" is § sv , sv 2 ·
one of the most popular network for its effectiveness. In exp ¨¨ ¸
i j
k ( svi , sv j ) ¸¸ (1)
¨ 2G
2
AlexNet network, there are five convolution layers and three
© ¹
fully-connected layers in the network [9]. Input dimensions where k is the processing function of two separate input
are defined in the first layer. In this paper, we used the input parameter svi and svj. Another independent variable is needed
image size is 227-by-217-by-3. Make up the bulk of CNN by to process inputted parameters or feature vector for finding
intermediate layers. Figure 4 shows the architecture of the width of effective basis kernel function which can be
Convolutional Neural Network used in this proposed model. denoted as δ.
CNN produces an activations method, is used as an input Generally, the SVM is a binary classifier. However, there
image for each layer. The convolution process is running is some basic form of SVM like - one-against-one (OAO),
layer by layer. However, for image feature extraction, there one-against-all (OAA), one-acyclic-graph (OAG) etc. [14].
are only a few layers that are suitable within a CNN. In this From those methods, OAA is used in this proposed model
proposed model, 'fc7' layer is considered for feature due to the least complications of this multi-class non-linear
extraction. Using this layer, basic image features are classifying method. The OAA-MCSVM contains total
captured by the layers at the beginning of the network. twenty-six SVMs working as a parallel way as follows in
Deeper network layers are processed these primitive features Figure 5. In each SVM, one class is differentiated from
and combine the early features to form higher level image others and the concluding decision is taken from this process
features. All of these higher-level features are well suited for by selecting the largest outputted value’s SVM.
the tasks of classification. Because deeper network layers are
combined all of these primitive features into a richer image
representation [10].
C. Gesture Classification Using SVM
Lastly, we used non-linear MCSVM for the classification
of each and every alphabetic sign in the last section of this
proposed model. SVM is a mostly used learning method for
the purposes of classification of extracted features and
regression. As SVM is a binary classifier basis on the
supervised learning approach for classifying data into two
classes by drawing a hyperplane [12].
The core working procedure of SVM is about classifying
the inputted sample data set into two distinct classes using a Figure 5. An OAA-MCSVM structure for identifying alphabets.
hyperplane. Many datasets are not linear in that case;
hyperplane is unable to classify those datasets into two
classes. Kernel function successfully concludes this problem III. EXPERIMENTAL RESULT AND EVALUATION
of classifying non-linear datasets [13]. Gaussian radial basis The proposed model is evaluated by a constructed dataset
function, Polynomial function, and hyperbolic tangent which contains 26 signs of three different persons. Every
function are some common kinds of non-linear kernel sign contains 120 images for each person. Therefore, there
functions. Among them, Gaussian radial basis kernel are 9,360 (3x26x120) images in total. The whole dataset is
function is most used non-linear kernel function. In this branched into two sets. First branched set contains 30%
paper, the Gaussian radial basis kernel function used, which images for training and another one contain rest of the 70%
can be expressed by equation (1). images for testing. Figure 6 shows the alphabet sign of ASL.
117
according to ASL conventions is used. The classification
accuracy obtained 94.57% which is significant for
introducing the SLR of ASL for disable people as an output
of HCI.
TABLE I. SIGN WISE CLASSIFICATION ACCURACY
Sign Recognition Accuracy Average Accuracy
A 88.49%
B 100%
C 91.67%
D 99.60%
E 81.35%
F 98.02%
G 99.21%
H 98.81%
I 99.60%
J 95.63%
K 92.46%
L 100%
M 87.70%
94.57%
N 86.90%
O 82.94%
P 92.46%
Q 99.60%
R 96.03%
S 98.81%
T 94.44%
U 94.05%
V 96.83%
W 97.62%
X 96.43%
Y 98.02%
Z 92.06%
TABLE II. PERSON WISE CLASSIFICATION ACCURACY
Person Recognition Accuracy Average Accuracy
Figure 6. ASL representation of alphabet. P1 93.19%
P2 95.29% 94.57%
The convolutional neural network is used for feature P3 95.24%
extraction. After feature extraction of the images using CNN,
therefore a 4096x2808 number of features for training and REFERENCES
6552x4096 number of features for testing we find. All of [1] A. Dix, “Human-computer interaction,” in Encyclopedia of Database
these features are informative, which helps to classify the Systems, Springer US, pp. 1327–1331, 2016.
categories of each person sign. [2] K. Nandakumar and J. L. Funk, “Understanding the timing of
economic feasibility: The case of input interfaces for human-
Using those more informative training and testing computer interaction,” Technology in Society, vol. 43, pp. 33 – 49,
features, classified each sign of each person performed by Nov. 2015.
the MCSVM. The classification accuracy obtained [3] B. Laurel and S. J. Mountford, “The Art of Human-Computer
satisfactorily which is 94.57%. The accuracies of Interface Design,” Addison-Wesley Longman Publishing Co.,
individually sign are shown in Table I. The person wise Inc. Boston, MA, USA, 1990.
results of our proposed model is shown in Table II. [4] N. C. Kiliboz and U. Gudukbay, “A hand gesture recognition
technique for human computer interaction,” Journal of Visual
IV. CONCLUSION Communication and Image Representation, vol. 28, pp. 97 – 104, Apr.
2015.
A significant challenge in real-life applications is hand [5] G. Modanwal and K. Sarawadekar, “Towards hand gesture based
gesture recognition in terms of the accuracy and robustness writing support system for blinds,” Pattern Recognition, vol. 57, pp.
associated with it. Non-touch hand gesture recognition of 50 – 60, Sep. 2016.
ASL is presented in this paper, the input gestures are [6] D. Rempel, M. J. Camilleri, and D. L. Lee, “The design of hand
collected using the webcam. At the very beginning, the still gestures for human-computer interaction: Lessons from sign language
hand image frame captured from a running video frame and interpreters,” International Journal of Human-Computer Studies, vol.
72, no. 10-11, pp. 728 – 735, Oct.-Nov. 2014.
performed DCNN in order to find more informative features.
[7] M. Denkowski, K. Dmitruk, and L. Sadkowski, “Building automation
Finally, identify the alphabet sign using MCSVM. For control system driven by gestures,” in Proceedings of the 13th IFAC
validation of this proposed model, our constructed dataset
118
and IEEE Conference on Programmable Devices and Embedded generic visual recognition,” in Proceedings of the 31st International
Systems, vol. 48, no. 4, pp. 246 – 251, 2015. Conference on Machine Learning, vol. 32, no. 1, pp. 647–655, Jun
[8] R.-h. Liang and M. Ouhyoung, “A sign language recognition system 2014.
using hidden markov model and context sensitive search,” in [12] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a
Proceedings of the ACM Symposium on Virtual Reality Software and local SVM approach,” in Proceedings of the 17th International
Technology. Hongkong, pp. 59–66, 1996. Conference on Pattern Recognition, 2004, vol. 3, pp. 32–36, 2004.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet [13] M. R. Islam, J. Uddin, and J.-M. Kim, “Acoustic Emission Sensor
classification with deep convolutional neural networks,” in Network Based Fault Diagnosis of Induction Motors Using a Gabor
proceedings of the 25th International Conference on Neural Filter and Multiclass Support Vector Machines.” Adhoc & Sensor
Information Processing Systems, pp. 1097–1105, Dec. 2012. Wireless Networks, vol. 34, pp. 273-287, Dec. 2016.
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks [14] J. Manikandan and B. Venkataramani, “Evaluation of multiclass
for large-scale image recognition,” Computer Vision and Pattern support vector machine classifiers using optimum threshold-based
Recognition, pp. 1409–1556, 2014. pruning technique,” IET signal processing, vol. 5, no. 5, pp. 506–513,
[11] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and 2011.
T. Darrell, “Decaf: A deep convolutional activation feature for
119