Data Science Theory, Analysis and Applications - Memon - Ahmed
Data Science Theory, Analysis and Applications - Memon - Ahmed
Edited by
Qurban A. Memon and
Shakeel Ahmed Khoja
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this
publication and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let us know so we
may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC),
222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Names: Memon, Qurban A. (Qurban Ali), editor. | Khoja, Shakeel Ahmed, editor.
Title: Data science : theory, analysis, and applications / edited by Qurban
A Memon, Shakeel Ahmed Khoja.
Description: Boca Raton : CRC Press, [2020] | Includes bibliographical
references and index. |
Identifiers: LCCN 2019029260 (print) | LCCN 2019029261 (ebook) |
ISBN 9780367208615 (hardback) | ISBN 9780429263798 (ebook)
Subjects: LCSH: Data mining—Statisical methods | Big data—Statisical
methods | Quantitative research.
Classification: LCC QA76.9.D343 D3944 2020 (print) | LCC QA76.9.D343
(ebook) | DDC 006.3/12—dc23
LC record available at https://lccn.loc.gov/2019029260
LC ebook record available at https://lccn.loc.gov/2019029261
vii
viii Contents
THE STRUCTURE
The book is a collection of fourteen chapters written by scholars and experts in
this field and organized into three parts. Each part consists of a set of chapters
addressing the respective subject outline to provide readers an in-depth and focused
ix
x Preface
understanding of concept and technology related to that part of the book. Some of
the chapters in each part are written in tutorial style in chapters concerning the
development process of data science and its emerging applications.
The book is structured as follows:
Part I comprises five chapters on data science theory, concepts, techniques, and
algorithms.
The first chapter extends the earlier work on Cassandra integrated with Hadoop
to a system called GeoMongoSpark and investigates on storage and retrieval of geo-
spatial data using various sharding techniques. Hashed indexing is used to improve
the processing performance with less memory.
The purpose of Chapter 2 is to study the different evolutionary algorithms for
optimizing neural networks in different ways for image segmentation purposes.
Chapter 3 introduces a new adaptive algorithm called Feature Selection Penguin
Search optimization algorithm, which is a metaheuristic feature subset selection
method. It is adapted from the natural hunting strategy of penguins, in which a group
of penguins take jumps at random depths and come back and share the status of food
availability with other penguins, and in this way, the global optimum solution is
found, namely Penguin Search Optimization Algorithm. It is combined with differ-
ent classifiers to find an optimal feature subset.
Currently, graph technology is becoming increasingly important, and graphs are
used to model dynamic and complex relationships of data order to generate knowl-
edge. Particularly, Neo4j is a database management system that currently leads the
NoSQL system on graph databases. In Chapter 4, the main objective is to propose
physical design guidelines that improve query execution time on graph databases in
terms of a specific workload in Neo4j. In this work, indexes, path materialization,
and query rewriting are considered as guidelines for the physical design on Neo4j
databases.
Chapter 5 provides information about the latest techniques of large-scale data col-
lection schemes to readers. A continuous sensor data with different intervals (cycles)
as sensor data stream is defined, and the collection methods for distributed sensor
data streams as a topic-based pub/sub (TBPS) system are proposed.
Part II comprises five chapters on data design and analysis.
The objective of Chapter 6 in this part is to explain and impart solution for the
effective analysis and management of big data in healthcare with four main parts,
namely (i) collection of healthcare data; (ii) analysis of healthcare data; (iii) manage-
ment of Big Data; and (iv) Big Data in healthcare. Then, the ability to query data is
primary for reporting and analytics to generate a report that is clear, crisp, and acces-
sible to the target audience when it is offered. This chapter discusses the effective
utilization of big data analysis and management in healthcare.
Preface xi
Chapter 7 demonstrates how the analysis of health data, such as blood cholesterol,
blood pressure, smoking, and obesity can identify high-risk heart attack patients and
how the proactive changes in these high-risk patient lifestyles and use of medication
can prevent a heart attack from taking place.
The Brugada syndrome (BrS) is a disease with a great predisposition to sudden
cardiac death due to ventricular arrhythmias. Dysfunctional ion channels are consid-
ered responsible for this entity. In an effort to characterize proteins, a nonsupervised
computational system called the Polarity Index Method® is presented in Chapter 8 to
extensively measure the “polar profile” or “electromagnetic balance” of proteins. In the
presented work, the method is calibrated with the BrS mutated proteins, searching for an
association of these entities with (i) the group of 36 BrS proteins from known 4,388 BrS
mutated proteins, (ii) two groups of intrinsically disordered proteins, (iii) six lipoprotein
groups, (iv) three main groups of antimicrobial proteins from UniProt and APD2 data-
bases, (v) a set of selective cationic amphipathic antibacterial peptides (SCAAP), and
(vi) the group of 557,713 “reviewed” proteins from the UniProt database.
In Chapter 9, the use of machine learning has recently shown that more useful
information can be extracted from clinical images and processed accurately, and
above all, more reliably than humans. Towards this end, this chapter uses a machine
learning algorithm to automatically discriminate healthy skin, superficial burn, and
full-thickness burn. After which, the features are used to train a multiclass support
vector machine (SVM).
The authors in Chapter 10 classify four classes of emotions (positive, negative,
depressed, and harmony) based on electroencephalography signal input collected
from brain activities. The proposed state transition system is associated with an
increased or decreased value of related channels/electrodes of the specific cortex
(frontal, temporal, occipital, etc.).
Part III comprises four chapters on applications and new trends in data science.
The authors of Chapter 11 investigate the performance of different kinds of fea-
ture generation schemes, such as local binary pattern (LBP), gradient local binary
pattern (GLBP), histogram of oriented gradient (HOG), run length feature (RLF),
and pixel density. Various similarity and dissimilarity measures are used to achieve
matching in handwritten documents.
The authors of Chapter 12 propose a unique study that combines text mining
technique along with machine learning approach to determine the satisfaction status
of hotel guests and its determinants with the help of overall ratings, which can be
considered as a proxy variable for guest satisfaction. The chapter evaluates the guest
satisfaction from 239,859 hotel reviews extracted from Tripadvisor.
Chapter 13 evaluates and compares the performance of naive Bayes and SVM
algorithms in classification sentiments embedded in movie reviews. In particular,
this study uses the MovieLens review platform to extend the application of sentimen-
tal analysis techniques.
In Chapter 14, human emotional states are recognized from full body move-
ments using feedforward deep convolution neural network architecture and Visual
Geometric Group (VGG)16 model. Both models are evaluated by emotion action
dataset (University of York) with 15 types of emotions.
xii Preface
SPECIAL THANKS
First of all, we would like thank our employer who gave us the time and encour-
agement in completing this book. We would also like to thank all the authors who
contributed to this book by authoring related and focused chapters in the data sci-
ence field. Their prompt responses, adhering to guidelines, and timely submission of
chapters helped us in meeting the book preparation deadlines. A total of 47 authors
from 13 countries contributed to the development of this book. We hope that this
book will prove beneficial to professionals and researchers, and at the same time,
encourage in generating new algorithms, applications, and case studies in this mul-
tidisciplinary field.
Editors
Qurban A. Memonhas contributed at levels of teaching, research, and c ommunity
service in the area of electrical and computer engineering. He has authored/
coauthored approximately 100 publications over 18 years of his academic career.
Additionally, he has written five book chapters: 1. “Intelligent Network System for
Process Control: Applications, Challenges, Approaches” in Robotics, Automation
and Control, 2009; 2. “RFID for Smarter Healthcare Collaborative Network”
in Building Next-Generation Converged Networks: Theory and Practice, CRC
Press, 2012; 3. “Secured and Networked Emergency Notification without GPS
Enabled Devices,” in Bio-inspiring Cyber Security and Cloud Services: Trends
and Innovations, Springer, 2014; 4. “Authentication and Error Resilience in Images
Transmitted through Open Environment” in Handbook of Research on Security
Considerations in Cloud Computing, 2015; 5. “JPEG2000 Compatible Layered
Block Cipher” in Multimedia Forensics and Security: Foundations, Innovations,
and Applications, 2017. He has also edited a book titled Distributed Network
Intelligence, Security and Applications, published by CRC press, 2013. His research
project undertaken by UAE-U students won CURA-2009 and CURA-2018 awards.
xiii
Contributors
Aakash Antara Barman
Department of Operational Research Information Technology Department
University of Delhi TechnoIndia
New Delhi, India Salt Lake, Kolkata, India
xv
xvi Contributors
Anu A. Gokhale
Illinois State University
CONTENTS
1.1 I ntroduction.......................................................................................................3
1.1.1 Integration of Spark and MongoDB......................................................4
1.2 Literature Survey...............................................................................................8
1.3 Proposed System................................................................................................9
1.3.1 Methodology for Processing Spatial Queries...................................... 10
1.3.2 Spark Master-Slave Framework........................................................... 11
1.3.3 Algorithms for Sharding...................................................................... 12
1.3.3.1 Algorithm for Range Sharding............................................. 12
1.3.3.2 Algorithms for Zone Sharding.............................................. 17
1.3.4 Dataset and Statistics........................................................................... 19
1.4 Results and Performance Evaluation...............................................................20
1.5 Conclusion....................................................................................................... 23
References.................................................................................................................24
1.1 INTRODUCTION
Companies that use big data for business challenges can gain advantages by inte-
grating MongoDB with Spark. The Spark framework provides support for analytics,
where process execution is fast because of in-memory optimization. Out of various
NoSQL databases, MongoDB provides a document data model that suits to variety
and voluminous amounts of data. As such, when integrated, MongoDB and Spark
together can index data efficiently and help in analytics of data-driven applications.
3
4 Data Science
Geospatial data can be analyzed to serve the needs of various applications such as
tourism, healthcare, geomarketing, and intelligent transportation systems. Spatial
data occurs in two forms: vector and raster that store latitude and longitude of
objects. Keyhole markup language (KML) is a tag-based structure and can be used
to display geographic data.
Tableau uses various file formats such as KML, ERSI shape files, GeoJSON
(JavaScript Object Notation) files, and MapInfo Interchange formats for geographic
data analysis, and display. Traditional databases process on structured data that
guarantees ACID (Atomicity, Consistency, Isolation, and Durability) properties.
NoSQL databases are developed to store and process unstructured data that guar-
antees CAP (Consistency, Availability and Partition Tolerance) properties with less
response time. MongoDB has no query language support, but data can be indexed
as in relational databases, structured as JSON fragments and can process social
networking applications, where latency has to be optimized. Cassandra monitors
nodes, handles redundancy, and avoids lazy nodes, whereas MongoDB monitors
these activities at a higher granular level. Even though some works are reported for
labeling and retrieving data from MongoDB, they are inefficient either at indexing
or at retrieval. This chapter aims at adding the functionality of spatial querying for
MongoDB database by integrating it with Spark.
Every location on this globe is a representation of intersection of latitude and lon-
gitude coordinates. Spatial operations are classified by the authors of [1]: proximity
analysis queries such as “Which parcels are within 100 meters of the subway?,”
contiguity analysis queries such as “Which states share a border with Coorg?,” and
neighborhood analysis queries such as “Calculate an output value at a location from
the values at nearby locations?” Geohashing techniques are used to find the near-
est location of a specified source location with high precision values. Latitude and
longitude are used to calculate geohash value: the higher the precision value, the
more the length of geohash value. Instead of sequential processing of the database,
a parallel search process must be done for a required destination using this geohash.
To achieve this, MongoDB is integrated with Spark, which is an efficient distributed
parallel processing.
Based on this, each chunk is assigned a range. Ranged sharding divides data into
ranges based on shard key values, and then each chunk is assigned a range [4].
The following paragraph presents how Range (Tag) Sharding and Zone Sharding
compute the hash values. Figure 1.2 presents a sample geospatial input.
Figure 1.3 presents Geohash computed using MD5 algorithm that guarantees uni-
form distribution of writes across shards but is less optimal for range-based queries.
Figure 1.4 presents document after appending HashKey. Similar process is fol-
lowed in both storage and retrieval operations for finding the corresponding shard.
The calculated MD5 value is the shard key and is appended to the document as
shown in Figure 1.5.
FIGURE 1.4 Hashed sharding: appending the HashKey and storing in the database.
6 Data Science
The following Figure 1.6 presents the shard key value for range sharding.
Figure 1.7 presents document after appending shard key. Similar process is fol-
lowed in both storage and retrieval operations for finding the corresponding shard.
The calculated shard value is the shard key and is appended to the document
based on the latitude and longitude ranges as shown in Figure 1.8.
Figure 1.9 presents the shard key value for zone sharding.
FIGURE 1.7 Range sharding: appending the shard key and storing in the database.
FIGURE 1.9 Zone sharding: generating shard zone based on geographic region.
Figure 1.10 presents document after appending the Shard Zone. Similar process is
followed in both storage and retrieval operations for finding the corresponding shard.
When this shard key range is modified, data is migrated to a new zone automatically.
The calculated geohash value is the shard key and is appended to the document
based on the zone of input location along with the zone value, as shown in Figure 1.11.
LTRMP (Long Term Resource Monitoring Program Component) Spatial Data
Query and Visualization Tool is developed Olsen [8]. It facilitates the display and
querying of Upper Mississippi River Basin (UMRB) data sets. It requires an under-
standing of geographic information system (GIS) concepts and LTRMP component
database structure. Data processing and visualization tools are reported in [9].
Refinement tools such as DataWrangler and Google Refine; conversion tools such
as Mr. Data Convertor; statistical analysis tools such as R project for statistical com-
puting; generic visualization applications such as Google fusion tables, Tableau, Many
Eyes, CartoDB, and GeoCommons; wizards such as Google Chart Tools; JavaScript
InfoVis Toolkit,D3.js, Protovis, and Recline.js; geospatial visualization tools such as
OpenHeatMap, OpenLayers, and OpenStreetMap; temporal data visualization tools
such as TimeFlow; tools for network analysis such as Gephi,NodeXL are discussed.
In the category of web applications, Google Fusion tables, CArtoDB,GoogleChart
Tools, Recline.js, and OpenLayers are widely used.
FIGURE 1.11 Data representation in zone sharding and storing in the database.
8 Data Science
1. To understand the need for integrating MongoDB and Spark and propose a
new architecture for the integration.
2. Collect benchmark datasets for testing the proposed architecture.
3. Apply various sharding techniques to partition the data.
4. Compare the performance evaluation of various sharding techniques.
5. Use Tableau for visualization.
1.2 LITERATURE SURVEY
Query processing in a centralized database system has more response time and slower
data access when compared with a decentralized database. Location-based services
use voluminous spatial data to answer end user queries. Hence, scalable techniques
that operate on distributed computing platforms and technologies are required for
efficient processing of spatial queries. Spatial data processing based on Hadoop, and
Spark comparison based on query language, spatial partitioning, indexing, and data
analysis operators are reported in [5]. The STARK system provides a number of spatial
predicates for filters and joins. The drawback with this framework is that filters can
delay the processing time, irrespective of any kind of data, and persistent indexing
causes memory wastage. The processing performance can be improved with the use
of hashed indexing that uses less memory. The work reported in [6] presents a distrib-
uted query scheduler for spatial query processing and optimization. It generates query
execution plans using spatial indexing techniques. Experiments are performed on real
datasets and are compared with other works. Although the query execution perfor-
mance is improved and communication cost is reduced, this architecture is costly to
implement, and the use of filters can increase the processing time of a query. Spatial-
Hadoop and GeoSpark are discussed in [7], where authors proved that GeoSpark is
faster than Spatial-Hadoop for geospatial big data analytics. Spatial data processing
system that schedules and executes range search is described in [10]. k-NN (k nearest
neighbor), spatiotextual operation, spatial-join, and k-NN-join queries are described.
Bloom filters are used to reduce network communication. Experiments are carried out
using sample data. Hashed indexing can improve the query processing time by reduc-
ing the use of memory for storing global and local indexing, even though frequently
accessed data is cached in Location Spark and less frequently used data is stored into
disk. The usage of filters on the spatial data increases the implementation cost of this
architecture. Authors in [11] described Panda architecture for spatial predictive que-
ries, such as predictive range, k-NN, and aggregate queries. Panda System will display
Visualization of GeoSpatial Query Processing 9
a system behavior statistics on the interface. Usage of grid and list data structures to
store the data may cause memory wastage. Also, processing the grid data structures
requires more amount of time. The identification of object moment is very important in
system, and variations may lead to major differences. Distributed profitable-area query
(DISPAQ) is described in [12]. It identifies profitable areas from raw taxi trip data
using PQ-index. Z-skyline algorithm prunes multiple blocks during query processing.
But usage of predictive function increases the computation overhead, though it reduces
the processing time for executing frequent queries. Performance can be improved by
dividing the places into zones. Data mining is used to increase marketing of an edu-
cational organization in [13]. Authors used k-means algorithm to cluster the locations.
Student residential address is calculated using latitude and longitude and visualized
minimum, maximum, and average distance. Such visualization helps to improve the
admission rate. Haversine formula is used to calculate the distance between the two
locations. Their application can be used for 1,000–2,000 students. Authors of [14] used
open shape-based strategy (OSS) to eliminate the computation related to the world
boundary. It also reduces I/O (Input-Output) and CPU (Central Processing Unit) costs
when compared with range query strategies (RQS) and is scalable than RQS for large
datasets. However, OSS is more expensive than the filter operations in RQS. Direction
of object plays an important role in the evaluation of queries. Surrounding join query,
that retrieves, for each point in Q, all its surrounding points in P, is described in [15].
Even though the surrounding join query enriches the semantics of the conventional
distance-based spatial join query, creating a VD (Visitor Database) requires a lot of
processing time. Also, the pruning of unnecessary nodes in hierarchical algorithm
requires more time, which can be avoided by starting only with necessary nodes.
To summarize, the main drawback of the existing system is time complexity for
processing large datasets. Our work mainly focuses on reducing the query response
time by using hashed indexing and zones. Also, existing system has high time com-
plexity for processing large geospatial datasets and usage of intermediate data. The
usage of complex data structures to hold data leads to large processing times and
requires complex infrastructures to handle. Visualization alone cannot help to per-
form analytics on the data that is handled by the system. Visualizing can be made
fast and efficient using Tableau, as it handles different types of data and supports
business analytics operations for generating automatic reports. Sharding techniques
can be treated as an alternative to partitioning techniques, as such Range Sharding
and Zone Sharding are used in our framework. We also made an attempt to deter-
mine which sharding performs well for geospatial query processing.
1.3 PROPOSED SYSTEM
This section presents a detailed architecture of GeoMongoSpark along with algo-
rithms. Figure 1.12 presents the proposed GeoMongoSpark Architecture to perform
geospatial querying, aiming at optimized query response time for spatial data analysis.
Figure 1.13 presents a detailed flow chart of the sequence of steps for the
Visualization of Geospatial Data. Initially, data is stored by passing the data to Zone
Sharding and Range Sharding. The output of a geospatial can be converted to JSON
format and can be given as input to a tableau for visualization.
10 Data Science
i. Range Sharding
Master communicates with the cluster manager and sends the task to the
executor, as shown in Figure 1.14. Figure 1.15 presents the Executor Task
in Range Sharding.
ii. Zone Sharding
The Master Task takes the user input location as the input, and the
functions of it are represented as shown in Figure 1.16. Figure 1.17 pres-
ents Executor Task in Zone Sharding. The data can be retrieved from the
spatial database based on the user location, and the query is specified
as shown in Figure 1.18. During retrieval, we make a comparison of which
sharding techniques give the best response time: Range Sharding or Zone
Sharding.
12 Data Science
For our experiments, we used a cluster of four nodes, where each node has
an Intel Core i5 processor, 16 GB RAM, and a 1 TB disk. All experiments
were executed in MongoDB Cluster mode with a single master that controls
and manages each executor in cluster nodes. On our cluster, we run Windows
10 Pro with MongoDB 3.4.5, Spark 2.2.0, Scala 2.11.8, and Java 1.8.
1. Find “Tag” variable using GetTag() //based on the tag of the input
location//
2. Create “cursor” for input query //to access the desired documents from
the database//
3. if (cursor!=empty)
Begin
While (cursor!=empty)
Begin //retrieve the locations from
the database//
R.append(document); //add the location to
Result //
List
End
End
else
Add null to R
Begin
While (cursor A!=empty)
Begin
//retrieve the locations from the database//
R.append(document); //add the location to
Result //
Visualization of GeoSpatial Query Processing 15
List
End
While(cursor B!=empty)
Begin
//retrieve the locations from the database//
R.append(document); //add the location to
Result //
List
End
End
else
Begin
Add null to R
End
Begin
While (cursor!=empty)
Begin
if (k==0) break;
//retrieve the locations from the
database//
R.append(document); //add the location to Result
List //
k--;
End
End
else
Add null to R
Begin
While (cursor!=empty)
Begin
//retrieve the locations from the database//
R.append(document); //add the location to
Result List//
End
End
Add null to R
Begin
While(cursor A!=empty)
Begin
if(k==0) break;
//retrieve the locations from the database//
R.append(document); //add the
location to Result List//
k--;
End
While (cursor B!=empty)
Begin
if (k==0) break;
//retrieve the locations from the database//
R.append(document); //add the
location to Result List//
k--;
End
End
else
Add null to R
ii. Uber Taxi Trip Dataset—The dataset is in the form of .TSV and is available
at [20]. It contains information about CarID, Time and Data, and Latitude
and Longitude values for 1,048,500 tuples.
iii. Road Network Dataset—The dataset is in .csv format and is available at
[21]. It contains information about NodeID and Latitude and Longitude
values for about 21,000 tuples. Statistics of datasets is given in Table 1.1.
i. Range Query Using Zone Sharding and Range Sharding: Given latitude and
longitude values, find the locations in a particular range.
ii. Range Join Query Using Zone Sharding and Range Sharding: Given latitude
and longitude values, find the locations in a particular range of two place types.
iii. k-NN Query Using Zone Sharding and Range Sharding: Given latitude and
longitude values, find the k-locations in a particular range.
iv. k-NN Join Query Using Zone Sharding and Range Sharding: Given
latitude and longitude values, find the k-locations in a particular range of
two place types.
Table 1.2 presents response time for querying data of various queries using Zone
and Range Sharding. It can be observed that for India Dataset and Taxi Trip Dataset,
TABLE 1.1
Statistics of Test Data Used
S. No. Dataset Name with URL Size in MB
1 India dataset [19] 0.06
2 Taxi trip dataset [20] 56.0
3 Road network dataset [21] 0.7
Visualization of GeoSpatial Query Processing 21
Zone sharding performed well when compared with Range sharding for all query
types. For road network dataset, Zone sharding performed well for Range Join
Query, k-NN Query, k-NN Join Query, and Range Sharding performed well for
Range Query. In road network dataset, the latitude and longitude of each tuple are
random in order.
Table 1.3 compares execution time for Join Query of GeoMongoSpark,
GeoSpark [7], SpatialSpark [17], and Stark [5]. The time value taken for
GeoMongoSpark is the average of Zone Sharding and Range Sharding techniques.
Figure 1.19 presents the processing time comparison for Zone Sharding, and
Figure 1.20 presents the processing time comparison for Range Sharding for all the
three datasets.
Table 1.4 presents the execution time for k-NN search when k = 10, 20, 30,
and Table 1.5 presents the execution time (in seconds) of k-NN Join search when
k = 10, 20, 30.
It can be observed that runtime increases as k value increases. Figure 1.21
presents the GUI for GeoMongoSpark. GUI presents various options to execute
the query to the left window, and query visualization output is shown at the right
window.
Figure 1.22 presents the visualization output of Range Join Query for Zone
Sharding. Similar visualization is generated for all other queries. Different col-
ors are used to plot so as to distinguish the type of place (museum, hospital,
school, bank).
TABLE 1.2
Response Time (in ms) for Querying Data of Various Queries Using Zone
Sharding (ZS) and Range Sharding (RS)
Execution
Execution Time Execution Time Time for
for Range for Range Join Execution Time k-NN Join
Query Query for k-NN Query Query
S. No. Dataset ZS RS ZS RS ZS RS ZS RS
1 Dataset-1 86 92 95 97 83 90 101 108
2 Dataset-2 90 91 120 128 95 98 110 120
3 Dataset-3 93 90 115 120 92 100 115 122
TABLE 1.3
Execution Time for Join Queries in Various Frameworks
Geo Spark Spatial Spark STARK GeoMongoSpark
92.5 s 40.07 s 57.5 s 130 ms
22 Data Science
TABLE 1.4
Runtime (in μs) of k-NN Search for k = 10, 20, 30
k-NN Query k = 10 k = 20 k = 30
Dataset-1 87,000 92,000 105,000
Dataset-2 98,000 108,000 113,000
Dataset-3 93,000 100,000 109,000
TABLE 1.5
Runtime (in s) of k-NN Join Search for k = 10, 20, 30
k-NN Query k = 10 k = 20 k =30
Dataset-1 0.103 0.112 0.119
Dataset-2 0.110 0.116 0.125
Dataset-3 0.108 0.115 0.120
Visualization of GeoSpatial Query Processing 23
1.5 CONCLUSION
GeoMongoSpark added geospatial query processing that works as an effective stor-
age and retrieval system. Using shard key during storage and retrieval helped in faster
data access. Integrating Spark and MongoDB made us to process spatial query data
stored in MongoDB without having to move data into Spark environment. Geohash
helped to shard data, thereby improving query performance. When the shard key is
appended with a geospatial query, mongos routes the query to a subset of shards in
the cluster. Three benchmark datasets are used for experimenting on a variety of
queries and two sharding techniques. GeoMongoSpark performance is compared
with GeoSpark, SpatialSpark and Stark. Performance is also compared for different
k values (10, 20, 30) for k-NN and k-NN join query. Results of geospatial queries are
visualized using tableau. Zone sharding proved to be better than Range sharding.
The future work is to setup multinode configurations for Spark, which can pro-
duce the best results for processing large spatial datasets. An effective searching
technique such as geoIntersect, geoWithin, and nearSphere is still required to per-
form join on fuzzy domains (when a search point belongs to more than one domain).
Such searching technique can be used in applications such as Telecom industry to
implant cell towers and also in tourism and development industries. Our future work
also concentrates on building machine learning model using Scala. Sharding tech-
nique can be treated as an alternative to partitioning techniques, but it poses several
drawbacks such as operational complexity, single point of failure, etc.
24 Data Science
REFERENCES
1. Berry, J.K. (1987). Fundamental operations in computer-assisted map analysis.
International Journal of GIS. 1. 119–136.
2. Apache Spark and MongoDB (2015), www.mongodb.com/products/spark-connector,
Last accessed on [Feb. 11, 2018].
3. Alger, K.W. (2017), MongoDB Horizontal Scaling through Sharding, https://dev.to/
kenwalger/mongodb-horizontal-scaling-through-sharding, Last accessed on [Feb. 07,
2018].
4. “Mongos” (2017), https://docs.mongodb.com/manual/core/sharded-cluster-
query-router/, Last accessed on [Aug. 18, 2017].
5. Hagedorn, S., et al. (2017), The stark framework for spatial temporal data analyt-
ics on spark, Proceedings of 20th International Conference on Extending Database
Technology (EDBT), pp. 123–142, Bonn
6. Tang, M., et al. (2016), In-memory Distributed Spatial Query Processing and
Optimization, pp. 1–23.
7. Lenka, R.K., et al. (2017), Comparative Analysis of SpatialHadoop and GeoSpark for
Geospatial Big Data Analytics, arXiv:1612.07433v2 [cs.DC], pp. 484–488.
8. Olsen, D.A. (1997), Long Term Resource Monitoring Program Spatial Data Query and
Visualization Tool, Environmental Management Technical Center, pp. 1–12.
9. Datos.gob.es (2013), Data Processing and Visualisation Tools, ePSIplatform Topic
Report, pp. 1–23.
10. Tang, M., et al. (2016), Location spark: A distributed in memory data management
system for big spatial data, Proceedings of the VLDB Endowment, Vol. 9, No. 13,
pp. 1565–1586.
11. Hendawi, A.M., et al. (2017), Panda ∗: A generic and scalable framework for predictive
spatio-temporal queries, GeoInformatica, Vol. 21, No. 2, pp. 175–208.
12. Putri, F.K., et al. (2017), DISPAQ: Distributed profitable-area query from big taxi trip
data, Sensors, Vol. 17, p. 2201.
13. Hegde, V., et al. (2016), Student residential distance calculation using Haversine for-
mulation and visualization through Googlemap for admission analysis, Proceedings
of IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC), pp. 1–5, Chennai, India
14. Liu, X., et al. (2003), Object-based directional query processing in spatial databases,
IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 2, pp. 295–304.
15. Li, L., et al. (2017), Surrounding join query processing in spatial databases, Proceedings
of ADC 2017, Springer International Publishing, pp. 17–28, Brisbane, Australia
16. Vasavi, S., et al. (2018) Framework for geospatial query processing by integrat-
ing Cassandra with Hadoop, Knowledge Computing and Its Applications, Editors:
S. Margret Anouncia Uffe Kock Wiil, Springer, Singapore, pp. 131–160.
17. You, S., Zhang, J., Gruenwald, L. (2015). Large-scale spatial join query processing
in cloud. In Proceedings of the 2015 31st IEEE International Conference on Data
Engineering Workshops (ICDEW), pp. 34–41.
18. Vasavi, S., et al. (2018), Framework for geospatial query processing by integrating
Cassandra with Hadoop, GIS Applications in the Tourism and Hospitality Industry,
Editors: Somnath Chaudhuri, Nilanjan Ray, Chapter 1, IGI Global, pp. 1–41.
19. India dataset, www.latlong.net/country/india-102.html, Last accessed on [Feb. 11,
2018].
20. Taxitrip dataset, https://raw.githubusercontent.com/dima42/uber-gps-analysis/master/
gpsdata/all.tsv, Last accessed on [Feb. 11, 2018].
21. Road network dataset, www.cs.utah.edu/~lifeifei/research/tpq/cal.cnode, Last accessed
on [Feb. 11, 2018].
2 A Study on
Metaheuristic-Based
Neural Networks for
Image Segmentation
Purposes
Navid Razmjooy
Independent Researcher, Belgium
Vania V. Estrela
Universidade Federal Fluminence
Hermes J. Loschi
State University of Campinas – UNICAMP
CONTENTS
2.1 I ntroduction.....................................................................................................26
2.2 Supervised Image Segmentation..................................................................... 27
2.3 Literature Review............................................................................................ 27
2.4 Artificial Neural Networks..............................................................................28
2.5 Optimization.................................................................................................... 32
2.6 Metaheuristic Algorithms................................................................................ 33
2.6.1 Genetic Algorithm...............................................................................34
2.6.2 Particle Swarm Optimization Algorithm............................................ 36
2.6.3 Imperialist Competitive Algorithm..................................................... 37
2.7 Optimization of the Neural Networks Weights Using
Optimization Algorithms................................................................................ 39
2.8 Experimental Setup and Method Analysis...................................................... 42
2.9 Conclusions......................................................................................................44
References.................................................................................................................44
25
26 Data Science
2.1 INTRODUCTION
Digital images that are generated from smaller elements are called pixels. By
considering an image as function f, with two independent dimensional including x,
and y, f(x, y), where −∞ < x , y < ∞ can be illustrated on a page by taking a value for
its intensity or brightness. Here, the point (x, y) is called the point of a pixel.
The process of applying different methods on these images is called image process-
ing. The applications of image processing and machine vision in different industries
and fields are increasing day by day. There are two principal limitations that make the
digital image processing methods interesting:
By increasing the awareness of human life and by complicating their demand, the
utilization of new machine vision systems in the real world is inevitable [1]. The
visual perception of the human can be easily mistaken; this disadvantage increases
the instability of the human, the inspecting cost, and changing of such assess-
ments, which made it necessary to have a machine vision system in industrial and
sensitive works.
One of the main parts of image processing is image segmentation. The main idea
behind image segmentation is to classify a considered image into its main compo-
nents. In other words, the main purpose of image segmentation is to simplify the
next process steps by turning the image into a more meaningful data. Image seg-
mentation uses assigning labels to all pixels of the image, where pixels with similar
features are labeled in the same class [2]. This process includes an important part
of most computer vision applications [3]. Image segmentation can be classified into
four principal categories: histogram-based thresholding methods, methods based on
classification and clustering techniques, texture analysis based methods, and region-
based split and merge methods [4].
Among the different types of image segmentation, classification is a useful
technique that separates the main components of an image, whereas the pixels
with the same features include a special range of gray levels. In simple terms,
classification-based segmentation has been given considerable attention in the last
few years [5–10]. Furthermore, due to their ability to achieve near-optimal solutions
for many applications, heuristic algorithms have been recently utilized to achieve
proper solutions for the problem of image segmentation [5,10–12].
Recently, the neural network is turned into one of the most popular methods for
image segmentation in different applications. Multilayer perceptron (MLP) is one of
the most widely used neural network models in which the connection weight training
is normally completed by a backpropagation (BP) learning algorithm [13–15].
BP learning algorithm uses a gradient descent method for minimizing the error
between the desired and the output for classification. One of the main drawbacks of
the gradient descent method is getting stuck in the local trap.
Using metaheuristic algorithms can minimize error in the neural network by
optimizing the value of its parameters like weight training, architecture adaptation
Metaheuristic-Based Neural Networks 27
(for determining the number of hidden layers, number of hidden neurons, and node
transfer functions), and learning rules [10,13,16].
This chapter focuses on a study about how metaheuristic algorithms can opti-
mize the performance of a neural network and how they can be employed for image
segmentation purposes. The main idea is to optimize the initial weights of a feedfor-
ward neural network.
2.3 LITERATURE REVIEW
In the recent years, applications of automated computer-aided assessment
systems increased from industrial quality inspection [1], agricultural systems [19],
aerial imaging [20], to medical imaging [5]. These processes have been applied to
the input of digital images from the camera.
The main purpose of supervised methods is to classify all the pixels of a digital
image into one of several classes; the main idea behind supervised methods is that
the main groups are first determined by a supervisor (expert), and then, the method
classifies the data based on a given class [8,9,21,22].
In contrast, in unsupervised methods, the groups are not defined by en expert. So,
these methods have two main works to do; first, identify some optimal groups, and
second, classifying the given data among these groups.
28 Data Science
Recently, a large number of different types of ANNs have been presented. For
instance, all of the introduced networks, such as MLP networks, Kohnen networks,
Hopfield networks, etc., are trained based on different conceptions, like Error BP
method [45–46]. From the point of solution nature, ANNs can be divided into two
different groups of postpropagation and recurrent networks (in which they use the
output feedback). The style of how to address the aforeexplained cases is one of the
most significant differences between human memory and computer memory. For a
computer memory, addressing is based on the memory blocks or the information on
the permanent memory. For instance, to get a specific text or image, the memory
address or file associated with that image or text should exist.
In this type of addressing, by having a picture or text, the structure cannot rec-
ognize the considered target, and it needs a lot of learning samples to compare the
target with them. This is so clear that doing such a thing is so time-consuming and
costly. Unlike the computer memory, the human brain does this work so easily and
fast. For instance, if you see an image incomplete, you can imagine the total image
in your brain, if you see a letter in your notebook, you can remember all the things
that you should do, etc.
This kind of memory is the so-called Content Addressable Memory [47,48]. In
this type of memory, after receiving the contents of a memory block, the address is
immediately given as output.
One of the most basic neural models available is the MLP model, which simulates
the transfusion function of the human brain. In this kind of neural network, most of
the human brain’s network behavior and signal emission have been considered, and
hence is sometimes referred to as feedforward networks [49,50].
30 Data Science
Each of the neurotic cells, called neuron, receives an input (from a neuronal or
nonneuronal cell), and then transmits it to another cell (neuronal or nonneuronal).
This behavior continues until a certain outcome, which will eventually lead to deci-
sion, process, thinking, or movement. The structure of MLPs comprises sets of sen-
sory units, which includes an input layer, one or more hidden layers, and an output
layer. Figure 2.1, indeed, is a schematic of an MLP [51]. The error BP algorithm in
MLPs changes the weights and bias values of the network to decrease the error value
more quickly. The error BP algorithm can be described using the following formula:
x k +1 = x k − α k gk (2.1)
where x k describes the weight and bias vectors in the kth iteration, α k is the learning
rate in the kth iteration, and gk describes the gradient in the kth iteration.
Another commonly used network is the Radial Basis Function (RBF) based net-
works; similar to the pattern of MLP neural networks, there are other types of neu-
ral networks in which processor units are centrally focused on processing [52,53].
This focus is modeled using an RBF. In terms of overall structure, RBF neuronal
networks do not differ much from MLP networks, and they differ only in the type
of processing that neurons perform on their cores. However, RBF networks often
have a faster process of learning and preparation. In fact, due to the concentration
of neurons on a specific functional range, it would be easier to adjust them [54,55].
RBF networks need more neurons than the standard MLP networks, but most of
these kinds of networks need less time than the MLPs for learning. When abundant
data exists, RBF networks result in better performance than MLPs. RBF networks
comprise a kind of feedforward neural networks with a similar structure to MLP
networks. The schematic of an RBF network is shown in Figure 2.2.
One of the differences between RBF and other ordinary network is their inputs.
In the RBF network, the inputs of the transfer function are equal to the distance vec-
tor between weights and inputs multiplied by bias. The input values are sent to the
hidden layer nodes by the input layers. Each node in the hidden layer is described
by a transmission function f that transmits the input signals; in Figure 2.3, the radial
transfer function example diagram is shown.
For the pth input pattern, i.e., X p, the Jth hidden node’s response, which means
y j , is equal to
X p − U j
yj = f (2.2)
2σ j
2
where describes the Euclidean norm, σ is the RBF range, and U j is the center
of the jth RBF.
The network output is achieved using the weighted linear sum of the hidden layer
responses in each of the output nodes. The following equation illustrates how to
obtain the output of the kth node in the output layer:
Z pk = ∑y w (2.3)
j =1
i kj
In all of these models, there is a mathematical structure that can be graphically dis-
played and has a set of parameters and adjusting parameters. This general structure
is adjusted and optimized by an algorithm of learning algorithms, which can display
a proper behavior.
A look at the learning process in the human brain also shows that in fact we also
experience the same process in our brains, and all our skills, thoughts, and memories
are shaped by the weakening or strengthening of the relationship between the neural
cells of the brain. This amplification and weakening in mathematical language mod-
els describe itself as the setting of a parameter (known as Weight) [5,10,12,24,56,57].
2.5 OPTIMIZATION
The main purpose of optimization is to find the best acceptable solution, given the
constraints and requirements of the problem [58,59]. For a problem, there may be
several different answers that are defined to compare them and select an optimal
answer, a function called a fitness (cost) function.
How to choose a fitness function depends on the nature of the problem. For
instance, consumed time or cost is one of the common goals of optimizing transpor-
tation networks. However, selecting an appropriate fitness function is one of the most
important steps in optimization.
Sometimes, in optimization, multiple objective functions are considered simulta-
neously, which are called multiobjective optimization problems [60,61].
The simplest way to deal with these problems is to form a new objective (fitness)
function as a linear combination of main objective functions, in which the effect of
each function is determined by the weight assigned to it. Each optimization problem
has a number of independent variables, which are called design variables, which are
represented by n dimensions of vector x. The purpose of optimization is to deter-
mine the design variables in such a way that the objective function is minimized or
maximized.
Optimization problems are divided into two main categories:
Equations representing constraints may be equal or unequal, and in each case, the
optimization method is different. However, constraints determine the acceptable
design area. The optimization process can be considered in the following stages:
Problem formulation: At this stage, a decision problem is defined along with a
general structure of it. This general structure may not be very accurate but states the
general state of the problem, which includes the input and the output factors and the
objectives of the problem. Clarifying and structuring the problem may be complicated
for many optimization problems [62–64].
Problem modeling: At this stage, a general mathematical model for the problem
is constructed. Modeling may help with similar models in the subject’s history. This
step resolves the problem to one or more optimization models.
Metaheuristic-Based Neural Networks 33
Problem optimization: After modeling the problem, the solution routine produces
a good solution to the problem. This solution may be optimal or almost optimal.
The point to be taken into consideration is that the obtained result is a solution to
the model, not to the real problem. During formulation and modeling, there may be
some changes in the real problem, and the new problem is far from the actual one.
Problem deployment: The solution obtained by the decision maker is examined,
and, if acceptable, is used, and, if the solution is not acceptable, the optimization
model or the algorithm should be developed and the optimization process repeated.
A general description of the optimization problem is given in the following:
minimize
n
fi ( x ), (i = 1, 2, …, M ),
x ∈R
gk ( x ) ≤ 0, ( k = 1, 2, …, K )
where fi(x), hj(x), and gk(x) are functions of the vector. Each xi member of x is called
the decision variable that can be continuous, real, discrete, or a combination of them.
While using the optimization algorithms, a few cases should be considered:
2.6 METAHEURISTIC ALGORITHMS
The goal of metaheuristic algorithms is to provide a solution within a reasonable time
frame that is appropriate to solve the problem, and the exploratory algorithm may
not be the best solution to solve the problem, but can be the closest to the solution.
Metaheuristic algorithms can be combined with optimization algorithms to
improve the efficiency of the algorithm. A metaheuristic algorithm is a c ombination
of heuristic algorithms that are designed to find, generate, or select each exploration in
each step, and provide a good solution to problems that are optimizing. Metaheuristic
algorithms consider some of the hypotheses of the optimization problems to be
solved.
Metaheuristic algorithms are some kinds of algorithms that are generally inspired
by nature, physics, and human life and are used to solve many of the optimization
problems. Metaheuristic algorithms are usually used in combination with other algo-
rithms to reach the optimal solution or escape from the local optimal solution [65].
34 Data Science
In recent years, one of the most important and promising researches has been
“Inventive methods derived from nature,” which have similarities with social or natu-
ral systems. Their application is based on continuous inventive techniques that have
interesting results in solving the Nondeterministic Polynomial-time hard (NP-Hard)
problems.
First, we begin by defining the nature and the naturalness of the methods; the
methods are derived from physics, biology, and sociology, and are as follows:
• Selection of rewards for strong personal qualities and penalties for the
weaker person
• A mutation that introduces random members and the possibility of a new
person’s generation
Generally, there are two situations that are found in the original methods derived
from nature, selection, and mutation: choosing the basics for optimizing and jump-
ing the basics of ideas for online searching.
Among the characteristics of the innovative methods derived from nature, one
can mention the following:
2.6.1 Genetic Algorithm
GA is one of the most important algorithm that is used to optimize for defined
functions on a limited domain. In this algorithm, the past information is extracted
according to the inheritance of the algorithm and is used in the search process. The
concept of GA was developed in 1989 by John Holland [66,67].
GAs are usually implemented as a computer simulator in which the population of
an abstract sample (chromosomes) of the candidates for the solution of an optimiza-
tion problem leads to a better solution.
Metaheuristic-Based Neural Networks 35
Traditionally, solutions were in the form of strings of 0 and 1, but today they are
implemented in other ways. The hypothesis begins with a completely randomized
unique population and continues in generations. In each generation, the capac-
ity of the whole population is evaluated; several individuals are selected in the
random process of the current generation (based on competencies) and m odified
to form the new generation (fraction or recombination), and in the next repetition
of the algorithm is converted to the current generation. The structure of the GA
is as follows:
F (n) = G ( F ( x )) (2.5)
where f is an objective function, and the function G converts the value of the
objective function to a nonnegative number, and F is the amount of fitness
it relates to.
The suitability or not of the solution is measured by the amount of the
obtained objective function.
C. The population size and the number of production: the number of chro-
mosomes is called population size. One of the advantages of GA is that it
uses parallel searches rather than traditional search methods. By definition,
population size is the size of parallel searches [68,69].
Genetic operators should be used to find a point in the search space. Two of these
operators to generate children are Crossover and Mutation.
Crossover operator is the main operator for generating new chromosomes in
GA [70]. Like its counterpart in nature, this operator produces new people whose
components (genes) are composed of their parents. Some types of intersecting opera-
tors are one-point crossover, two-point crossover, uniform crossover, etc.
Mutation operator is a random process in which the content of a gene is replaced
by another gene to produce a new genetic structure. Note that each repetition of the
algorithm that leads to the creation of a new population is called a generation [71].
The flow chart of the GA is given in Figure 2.4:
36 Data Science
Start
Initializing
Fitness evaluation
No
Start
Initializing
Fitness evaluation
Yes
End
is to simulate the colonial political process. In the same way that the GA simulates
biological evolution, in the imperialist competition algorithm, political evolution has
been used. This algorithm was introduced in 2007 by Atashpaz-Gargari and Lucas
and has been ever since used as a tool for many applications and research fields. The
high power of this algorithm, especially in dealing with continuous issues, has led
the imperialist competition algorithm to be considered as one of the major tools for
optimization [78].
Like other metaheuristic algorithms, ICA has an initialization to random genera-
tion of a set of possible solutions. These initial solutions are called country.
ICA with the following process will gradually improve these initial solutions
(countries) and ultimately provide an optimal solution to the optimization problem.
The main foundations of this algorithm are assimilation, imperialistic competi-
tion, and revolution. This algorithm, by imitating the social, economic, and political
evolution of countries and by mathematical modeling of them, provides parts of
this process with regular operators in the form of algorithms that can help to solve
complicated optimization problems. In fact, this algorithm looks at the optimization
problems in the form of countries and tries to improve these responses during the
iterative process and ultimately to the optimal solution for the problem.
Metaheuristic-Based Neural Networks 39
Like other evolutionary algorithms, this algorithm also begins with a number
of random initial populations; each member is called a “country.” Some of the
best members of the population (the equivalent of elites in the GA) are selected as
imperialists.
The remaining population is considered as colonies. Imperialists, depending on
their strength, are pulling their colonies with a specific process. The power of all
empires depends on both the imperialist country (as the core) and its colonies. In
mathematical terms, this dependence is modeled on the definition of imperial power,
in particular, the power of the imperialist country, in addition to its average colonial
power.
With the formation of early empires, colonial competition begins among them.
Any empire that fails to succeed in colonial competition and adds to its power (or at
least minimizes its influence) will be eliminated from the imperialist competition,
so the survival of an empire is dependent on its power to capture the colonies of
competing empires and to control them.
Consequently, during colonial struggles, the power of the empires will be gradu-
ally increased, and the weaker empires will be eliminated. Empires will have to
advance their own colonies to boost their power.
In brief, this algorithm considered colonialism as an inseparable part of the
evolution of human history, and its impact on imperial and colonial countries and
the entire history has been used as the source of inspiration for an efficient and inno-
vative algorithm for evolutionary computing.
In this study, two parameters of ICA, including colony assimilation coefficient
and revolution probability, are defined by Beta and pRevolution.
The flow chart of the ICA is shown in Figure 2.6.
f (x) = K
∑ω g (x) + b (2.7)
i
i i i
40 Data Science
Start
Initializing
Fitness evaluation
No
Imperialistic competition
Yes
End
where K is a predefined function like the sigmoid function. A good vision for sim-
plifying the collection of functions gi is to assign them as a vector of g = (g1…gn).
As explained earlier, BP method is one of the widely used methods for solving
feedforward networks. It calculates the error on all training pairs and adjusts its
weights to achieve an optimal solution for the desired output. It will be performed
in several iterations to achieve the optimal solution and the minimum value of error
for the training set.
After the training step, the network with optimal values of weights is prepared
for validation test and utilization. A schematic diagram of how to adjust an MLP by
metaheuristics is shown in Figure 2.7.
Metaheuristic-Based Neural Networks 41
Generally, the gradient descent algorithm is utilized for obtaining the minimum
value for the BP network. This algorithm includes a very important problem: trap-
ping into the local minimum. This drawback is totally dependent on initial (weight)
settings. One of the advantages of the metaheuristic algorithms is to escape from the
local minimum based on its exploration characteristics.
For optimizing the weights of the network, ANN is first trained using a meta-
heuristic algorithm to find the optimal initial weights. Afterward, the neural network
is trained by using a BP algorithm that includes an optimal BP network.
The mean squared error of the network (MSE) that should be minimized using
metaheuristics is given later:
g m
1
∑ ∑(Y (k ) − T (k )) (2.8)
2
MSE = j j
2 k =1 j =1
where m describes the number of nodes in the output, g is the number of training
samples, Y j ( k ) is the desired output, and Tj ( k ) defines the real output.
The main purpose of image segmentation based on the optimized neural net-
work in different input images is to categorize all of their pixels into some con-
sidered classes for simplifying the feature extraction, classification, etc. into the
next level.
In the image segmentation based on supervised techniques, we have to first
network based on several different samples with their solutions (subject with its
label). After this learning, the neural network starts the optimal learning pro-
cess based on adjusting its weights according to the illustrated methodology.
Afterward, the neural network has the ability to classify an input image based on
its learning. Figure 2.8 shows the steps of image segmentation based on a super-
vised technique.
42 Data Science
The purpose of the proposed method for the first database is to detect cancer-like
pixels from its healthy pixels, and the purpose of the proposed method for the
second and third database is to separate the defected parts of potatoes from its
healthy parts.
The input layer of the network considered three neurons from each image as
either cancer (defected) or noncancer (healthy). In this analysis, a sigmoid function
is utilized as an activation function for the optimized MLP network. The output gen-
erated a pixel with intensity between 0 and 255. Because of using sigmoid function
and for achieving a two-fold output, a single threshold value is applied to the image.
In the following, some examples of the proposed technique based on ICA algorithm
are shown in Figure 2.9.
As it is clear, using the proposed technique can generate a strong method for
image segmentation. It is important to know that, in this part, there is no postpro-
cessing (like morphology) for it.
For analyzing the accuracy of three optimization algorithms, CDR, FAR, and
FRR are evaluated.
CDR defines the correct detection rate, and FAR defines the false acceptance
rate which is the percentage of identification moments in which false acceptance
happens, and FRR is the false rejection rate that describes the percentage of identi-
fication moments in which false rejection happens. In the following, the formula for
achieving each of this index is given.
The value of the parameters for the analyzed metaheuristic algorithms is given in
Table 2.1.
FIGURE 2.9 Some of the results for the algorithms: (A) original image and (B) segmented
image based on optimized MLP network.
TABLE 2.1
The Value for the Parameters in the Analyzed Metaheuristic Algorithms
Algorithm Parameter Value
GA Crossover probability 0.9
Mutation probability 0.01 0.1
Selection mechanism Roulette wheel Roulette wheel
PSO Acceleration constants [2,2]
Inertia weights 0.1
ICA Number of countries 500
Number of imperialists 10
Beta 5
pRevolution 0.1
Zeta 0.1
44 Data Science
TABLE 2.2
Classification Comparison of Performance in the Proposed Technique with
Three Different Metaheuristics
Metric Ordinary MLP MLP-GA MLP-PSO MLP-ICA
CFIA + USDA CDR (%) 88 90 89 90
FAR (%) 7.5 8 7 6.5
FRR (%) 4.5 4 4 3.5
ACD CDR (%) 90 90 93 94
FAR (%) 6 7 4 3
FRR (%) 4 3 3 3
Table 2.2 presents the efficiency of the presented segmentation algorithms based
on CDR, FAR, and FRR.
It is clear from the earlier results that using metaheuristics can improve the system
accuracy. It is also clear that ICA has better efficiency from the other metaheuristics
for this purpose.
2.9 CONCLUSIONS
Image segmentation is a part of data science that works on dividing the image as data
to its principal elements. In this chapter, the application of metaheuristic algorithms,
including GA, PSO algorithm, and ICA, has been studied for optimizing the struc-
ture of neural networks for image segmentation purposes. For analyzing the proposed
technique, two different databases including potato and melanoma skin cancers have
been employed, where the main purpose is to separate the defected parts from the
healthy parts based on intensity pixel classification by training the neural network.
The weights of the neural network are optimized based on a metaheuristic algorithm
to achieve a suitable accuracy. Simulation results showed that using metaheuris-
tic algorithms for optimizing the neural network speeds up the convergence speed
and reduces the root-mean-square error. To compare the performance of the three
introduced metaheuristics, three metrics including CDR, FAR, and FRR have been
employed, and the results show good efficiency for all of them, especially for world
cup optimization algorithm. It is important to know that, since using these types of
approaches increases the designing cost and the complexity of the system, it is better to
utilize them for the applications that classic methods like this don’t give proper results.
REFERENCES
1. Razmjooy, N., Mousavi, B. S., & Soleymani, F. (2012). A real-time mathematical com-
puter method for potato inspection using machine vision. Computers & Mathematics
with Applications, 63(1): 268–279.
2. Ghamisi, P., Couceiro, M. S., Benediktsson, J. A., & Ferreira, N. M. (2012). An efficient
method for segmentation of images based on fractional calculus and natural selection.
Expert Systems with Applications, 39(16): 12407–12417.
Metaheuristic-Based Neural Networks 45
3. Razmjooy, N., Mousavi, B. S., Sargolzaei, P., & Soleymani, F. (2011). Image thresh-
olding based on evolutionary algorithms. International Journal of Physical Sciences,
6(31): 7203–7211.
4. Brink, A. (1995). Minimum spatial entropy threshold selection. IEE Proceedings-
Vision, Image and Signal Processing, 142(3): 128–132.
5. Razmjooy, N., Mousavi, B. S., & Soleymani, F. (2013). A hybrid neural network
Imperialist Competitive Algorithm for skin color segmentation. Mathematical and
Computer Modelling, 57(3): 848–856.
6. Banimelhem, O., & Yahya, Y. A. (2011). Multi-thresholding image segmentation using
genetic algorithm. Jordan University of Science and Technology, Irbid, Jordan: 1–6.
7. Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S.
(2017). Dermatologist-level classification of skin cancer with deep neural networks.
Nature, 542(7639): 115–118.
8. Gautam, D., & Ahmed, M. (2015). Melanoma detection and classification using SVM
based decision support system. Paper presented at the India Conference (INDICON),
2015, Annual IEEE, New Delhi.
9. Mousavi, B. S., & Soleymani, F. (2014). Semantic image classification by genetic
algorithm using optimised fuzzy system based on Zernike moments. Signal, Image
and Video Processing, 8(5): 831–842.
10. Razmjooy, N., Sheykhahmad, F. R., & Ghadimi, N. (2018). A hybrid neural
network–world cup optimization algorithm for melanoma detection. Open Medicine,
13(1): 9–16.
11. Petroski Such, F., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., &
Clune, J. (2017). Deep neuroevolution: Genetic algorithms are a competitive alter-
native for training deep neural networks for reinforcement learning. arXiv preprint
arXiv:1712.06567.
12. Razmjooy, N., & Ramezani, M. (2016). Training wavelet neural networks using
hybrid particle swarm optimization and gravitational search algorithm for system
identification, International Journal of Mechatronics, Electrical and Computer
Technology, 6(21), 2987–2997.
13. Zhang, Z. (2018). Artificial neural network. Multivariate Time Series Analysis in
Climate and Environmental Research, Editor: Zhihua Zhang (pp. 1–35): Springer,
Switzerland.
14. Mastorakis, N. (2018). Neural network methods for image segmentation. Paper
presented at the Applied Physics, System Science and Computers II: Proceedings of
the 2nd International Conference on Applied Physics, System Science and Computers
(APSAC2017), September 27–29, 2017, Dubrovnik, Croatia.
15. Roffman, D., Hart, G., Girardi, M., Ko, C. J., & Deng, J. (2018). Predicting non-
melanoma skin cancer via a multi-parameterized artificial neural network. Scientific
reports, 8(1): 1701.
16. DeGroff, D., & Neelakanta, P. S. (2018). Neural Network Modeling: Statistical
Mechanics and Cybernetic Perspectives: CRC Press, Boca Raton.
17. Hemanth, D. J., & Estrela, V. V. (2017). Deep Learning for Image Processing
Applications (Vol. 31): IOS Press.
18. Nascimento, J. D., da Silva Tavares, R., Estrela, V. V., de Assis, J. T., & de Almeida,
J. C. H. Image Processing Techniques Applied to Microtectonics, Santo Amaro, Brazil.
19. Moallem, P., Razmjooy, N., & Ashourian, M. (2013). Computer vision-based potato
defect detection using neural networks and support vector machine. International
Journal of Robotics and Automation, 28(2): 137–145.
20. Estrela, V. V., & Coelho, A. M. (2013). State-of-the art motion estimation in the
context of 3D TV. Multimedia Networking and Coding, Editors: Reuben A. Farrugia,
Carl J. Debono, (pp. 148–173): IGI Global.
46 Data Science
21. Zhang, C., Pan, X., Li, H., Gardiner, A., Sargent, I., Hare, J., & Atkinson, P. M.
(2018). A hybrid MLP-CNN classifier for very fine resolution remotely sensed image
classification. ISPRS Journal of Photogrammetry and Remote Sensing, 140: 133–144.
22. Özdoğan, M. (2019). Image classification methods in land cover and land use. Remote
Sensing Handbook-Three Volume Set, Editor: Prasad Thenkabail (pp. 265–280): CRC
Press, Boca Raton.
23. Irani, R., & Nasimi, R. (2011). Application of artificial bee colony-based neural
network in bottom hole pressure prediction in underbalanced drilling. Journal of
Petroleum Science and Engineering, 78(1): 6–12.
24. Moallem, P., & Razmjooy, N. (2012). A multi layer perceptron neural network
trained by invasive weed optimization for potato color image segmentation. Trends in
Applied Sciences Research, 7(6): 445.
25. Fadaeddini, A., Eshghi, M., & Majidi, B. (2018). A deep residual neural network
for low altitude remote sensing image classification. Paper presented at the Fuzzy and
Intelligent Systems (CFIS), 2018 6th Iranian Joint Congress on, Kerman, Iran.
26. Mikaeil, R., Haghshenas, S. S., Haghshenas, S. S., & Ataei, M. (2018). Performance
prediction of circular saw machine using imperialist competitive algorithm and fuzzy
clustering technique. Neural Computing and Applications, 29(6): 283–292.
27. Moghaddam, M. H. R., Sedighi, A., Fasihi, S., & Firozjaei, M. K. (2018). Effect of
environmental policies in combating aeolian desertification over Sejzy Plain of Iran.
Aeolian Research, 35: 19–28.
28. Azadi, S., & Karimi-Jashni, A. (2016). Verifying the performance of artificial
neural network and multiple linear regression in predicting the mean seasonal munici-
pal solid waste generation rate: A case study of Fars province, Iran. Waste manage-
ment, 48: 14–23.
29. Montana, D. J., & Davis, L. (1989). Training feedforward neural networks using
genetic algorithms, Proceedings of the 11th international joint conference on Artificial
intelligence, pp: 762–767, vol 1, 1989, San Francisco.
30. Sharma, C., Sabharwal, S., & Sibal, R. (2014). A survey on software testing
techniques using genetic algorithm. arXiv preprint arXiv:1411.1154.
31. Hosseini, H., Farsadi, M., Khalilpour, M., & Razmjooy, N. (2012). Hybrid Energy
Production System with PV Array and Wind Turbine and Pitch Angle Optimal
Control by Genetic Algorithm (GA), Journal of World’s Electrical Engineering and
Technology, 1(1): 1–4.
32. Trelea, I. C. (2003). The particle swarm optimization algorithm: convergence
analysis and parameter selection. Information processing letters, 85(6): 317–325.
33. Moallem, P., & Razmjooy, N. (2012). Optimal threshold computing in automatic
image thresholding using adaptive particle swarm optimization. Journal of applied
research and technology, 10(5): 703–712.
34. de Jesus, M. A., Estrela, V. V., Saotome, O., & Stutz, D. (2018). Super-resolution via
particle swarm optimization variants. Biologically Rationalized Computing Techniques
For Image Processing Applications, Editors: Jude Hemanth Valentina Emilia Balas,
(pp. 317–337): Springer.
35. Razmjooy, N., & Ramezani, M. (2014). An Improved Quantum Evolutionary
Algorithm Based on Invasive Weed Optimization. Indian J. Sci. Res, 4(2): 413–422.
36. Razmjooy, N., Khalilpour, M., & Ramezani, M. (2016). A new meta-heuristic
optimization algorithm inspired by FIFA World Cup competitions: Theory and its
application in PID designing for AVR system. Journal of Control, Automation and
Electrical Systems, 27(4): 419–440.
37. Razmjooy, N., Mousavi, B. S., Soleymani, F., & Khotbesara, M. H. (2013). A
computer-aided diagnosis system for malignant melanomas. Neural Computing and
Applications, 23(7–8): 2059–2071.
Metaheuristic-Based Neural Networks 47
38. Vijayalakshmi, Y., Jose, T., Babu, S. S., Jose, S. R. G., & Manimegalai, P. (2017).
Blue brain - A massive storage space. Advances in Computational Sciences and
Technology, 10(7): 2125–2136.
39. Cao, M., He, Y., Dai, Z., Liao, X., Jeon, T., Ouyang, M., …, Dong, Q. (2016). Early
development of functional network segregation revealed by connectomic analysis of the
preterm human brain. Cerebral Cortex, 27(3): 1949–1963.
40. Dumoulin, S. O., Fracasso, A., van der Zwaag, W., Siero, J. C., & Petridou, N. (2018).
Ultra-high field MRI: advancing systems neuroscience towards mesoscopic human
brain function. Neuroimage, 168: 345–357.
41. Behrang, M., Assareh, E., Ghanbarzadeh, A., & Noghrehabadi, A. (2010). The
potential of different artificial neural network (ANN) techniques in daily global solar
radiation modeling based on meteorological data. Solar Energy, 84(8): 1468–1480.
42. Ostad-Ali-Askari, K., Shayannejad, M., & Ghorbanizadeh-Kharazi, H. (2017).
Artificial neural network for modeling nitrate pollution of groundwater in marginal
area of Zayandeh-rood River, Isfahan, Iran. KSCE Journal of Civil Engineering, 21(1):
134–140.
43. Khoshroo, A., Emrouznejad, A., Ghaffarizadeh, A., Kasraei, M., & Omid, M. (2018).
Topology of a simple artificial neural network Sensitivity analysis of energy inputs
in crop production using artificial neural networks. Journal of Cleaner Production,
197(1), 992–998.
44. Rafiei, M., Niknam, T., Aghaei, J., Shafie-khah, M., & Catalão, J. P. (2018).
Probabilistic load forecasting using an improved wavelet neural network trained by
generalized extreme learning machine. IEEE Transactions on Smart Grid.
45. Li, J., Cheng, J.-h., Shi, J.-y., & Huang, F. (2012). Brief introduction of back propagation
(BP) neural network algorithm and its improvement. Advances in Computer Science
and Information Engineering, Editors: D. Jin, S. Lin, (pp. 553–558): Springer, Berlin.
46. Ding, S., Li, H., Su, C., Yu, J., & Jin, F. (2013). Evolutionary artificial neural net-
works: a review. Artificial Intelligence Review, 39(3): 251–260.
47. Jarollahi, H., Gripon, V., Onizawa, N., & Gross, W. J. (2013). A low-power
content-addressable memory based on clustered-sparse networks. Paper presented at
the Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE
24th International Conference on, Washington, DC.
48. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. (2016).
Meta-learning with memory-augmented neural networks. Paper presented at the
International Conference on Machine Learning.
49. Ansari, H., Zarei, M., Sabbaghi, S., & Keshavarz, P. (2018). A new comprehensive model
for relative viscosity of various nanofluids using feed-forward back-propagation MLP
neural networks. International Communications in Heat and Mass Transfer, 91: 158–164.
50. Mohammadi, J., Ataei, M., Kakaei, R. K., Mikaeil, R., & Haghshenas, S. S. (2018).
Prediction of the production rate of chain saw machine using the multilayer perceptron
(MLP) neural network. Civil Engineering Journal, 4(7): 1575–1583.
51. Rezaee, M. J., Jozmaleki, M., & Valipour, M. (2018). Integrating dynamic fuzzy
C-means, data envelopment analysis and artificial neural network to online prediction
performance of companies in stock exchange. Physica A: Statistical Mechanics and its
Applications, 489: 78–93.
52. Han, H.-G., Lu, W., Hou, Y., & Qiao, J.-F. (2018). An adaptive-PSO-based self-
organizing RBF neural network. IEEE Transactions on Neural Networks and Learning
Systems, 29(1): 104–117.
53. Ahangarpour, A., Farbod, M., Ghanbarzadeh, A., Moradi, A., & MirzakhaniNafchi,
A. (2018). Optimization of continual production of CNTs by CVD method using
Radial Basic Function (RBF) neural network and the Bees Algorithm. Journal of
Nanostructures, 8(3): 225–231.
48 Data Science
54. Tafarroj, M. M., & Kolahan, F. (2019). Using an optimized RBF neural network
to predict the out-of-plane welding distortions based on the 3-2-1 locating scheme.
Scientia Iranica, 26(2), 869–878.
55. Xie, S., Xie, Y., Huang, T., Gui, W., & Yang, C. (2019). Generalized predictive
control for industrial processes based on neuron adaptive splitting and merging RBF
neural network. IEEE Transactions on Industrial Electronics, 66(2): 1192–1202.
56. Nur, A. S., Radzi, N. H. M., & Ibrahim, A. O. (2014). Artificial neural network
weight optimization: A review. Indonesian Journal of Electrical Engineering and
Computer Science, 12(9): 6897–6902.
57. Aljarah, I., Faris, H., & Mirjalili, S. (2018). Optimizing connection weights in
neural networks using the whale optimization algorithm. Soft Computing, 22(1): 1–15.
58. Fahimnia, B., Davarzani, H., & Eshragh, A. (2018). Planning of complex supply
chains: A performance comparison of three meta-heuristic algorithms. Computers &
Operations Research, 89: 241–252.
59. Akbari, M., Gheysari, M., Mostafazadeh-Fard, B., & Shayannejad, M. (2018).
Surface irrigation simulation-optimization model based on meta-heuristic algorithms.
Agricultural Water Management, 201: 46–57.
60. Donoso, Y., & Fabregat, R. (2016). Multi-objective Optimization in Computer
Networks using Metaheuristics: Auerbach Publications, Boca Raton.
61. Sadeghi, M., Nemati, A., & Yari, M. (2016). Thermodynamic analysis and multi-
objective optimization of various ORC (organic Rankine cycle) configurations using
zeotropic mixtures. Energy, 109: 791–802.
62. Wang, G.-G., Gandomi, A. H., Alavi, A. H., & Dong, Y.-Q. (2016). A hybrid meta-
heuristic method based on firefly algorithm and krill herd. Handbook of Research
on Advanced Computational Techniques for Simulation-Based Engineering, Editor:
Pijush Samui, (pp. 505–524): IGI Global.
63. Stützle, T., & López-Ibáñez, M. (2019). Automated design of metaheuristic
algorithms. Handbook of Metaheuristics, Editors: Michel Gendreau, Jean-Yves Potvin
(pp. 541–579): Springer, Berlin.
64. Memon, Q. (2019). On assisted living of paralyzed persons through real-time
eye features tracking and classification using support vector machines. Medical
Technologies Journal, 3(1): 316–333.
65. Khalilpuor, M., Razmjooy, N., Hosseini, H., & Moallem, P. (2011). Optimal control
of DC motor using invasive weed optimization (IWO) algorithm. Paper presented at
the Majlesi Conference on Electrical Engineering, Majlesi town, Isfahan, Iran.
66. Holland, J. H. (1992). Genetic algorithms. Scientific American, 267(1): 66–73.
67. Khan, G. M. (2018). Evolutionary computation. Evolution of Artificial Neural
Development, Editor: Khan, G. M. (pp. 29–37): Springer, Berlin.
68. Askarzadeh, A. (2018). A memory-based genetic algorithm for optimization of
power generation in a microgrid. IEEE Transactions on Sustainable Energy, 9(3):
1081–1089.
69. Mirjalili, S. (2019). Genetic algorithm. Evolutionary Algorithms and Neural
Networks, Editor: Mirjalili, S., (pp. 43–55): Springer, Berlin.
70. Hosseinabadi, A. A. R., Vahidi, J., Saemi, B., Sangaiah, A. K., & Elhoseny,
M. (2018). Extended genetic algorithm for solving open-shop scheduling problem. Soft
Computing, 23: 1–18.
71. Pal, S. K., & Wang, P. P. (2017). Genetic Algorithms for Pattern Recognition: CRC
press, Boca Raton.
72. Bansal, J. C. (2019). Particle swarm optimization. Evolutionary and Swarm
Intelligence Algorithms, Editors: Bansal, Jagdish Chand, Singh, Pramod Kumar, Pal,
Nikhil R. (pp. 11–23): Springer, Berlin.
Metaheuristic-Based Neural Networks 49
73. AminShokravi, A., Eskandar, H., Derakhsh, A. M., Rad, H. N., & Ghanadi,
A. (2018). The potential application of particle swarm optimization algorithm for fore-
casting the air-overpressure induced by mine blasting. Engineering with Computers,
34(2): 277–285.
74. Mahi, M., Baykan, O. K., & Kodaz, H. (2018). A new approach based on parti-
cle swarm optimization algorithm for solving data allocation problem. Applied Soft
Computing, 62: 571–578.
75. Atashpaz-Gargari, E., & Lucas, C. (2007). Imperialist competitive algorithm:
An algorithm for optimization inspired by imperialistic competition, Proceedings of
Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, Singapore.
76. Kaveh, A., & Talatahari, S. (2010). Optimum design of skeletal structures using
imperialist competitive algorithm. Computers & Structures, 88(21–22): 1220–1229.
77. Aliniya, Z., & Mirroshandel, S. A. (2019). A novel combinatorial merge-split
approach for automatic clustering using imperialist competitive algorithm. Expert
Systems with Applications, 117: 243–266.
78. Mirhosseini, M., & Nezamabadi-pour, H. (2018). BICA: A binary imperialist
competitive algorithm and its application in CBIR systems. International Journal of
Machine Learning and Cybernetics, 9(12): 2043–2057.
79. Database, A. C., from www.aihw.gov.au/australian-cancer-database/.
80. United State Department of Agriculture, from www.usda.gov/.
81. Agency, C. F. I., from http://inspection.gc.ca/eng/1297964599443/1297965645317.
3 A Study and Analysis
of a Feature Subset
Selection Technique
Using Penguin Search
Optimization Algorithm
Agnip Dasgupta, Ardhendu Banerjee,
Aniket Ghosh Dastidar, Antara Barman,
and Sanjay Chakraborty
TechnoIndia
CONTENTS
3.1 I ntroduction..................................................................................................... 51
3.2 Literature Review............................................................................................ 53
3.3 Proposed Work................................................................................................ 54
3.3.1 Pseudocode of the Proposed FS-PeSOA Algorithm............................ 55
3.3.2 Discussion............................................................................................ 57
3.3.2.1 Hunting Strategy of Penguins............................................... 57
3.3.2.2 Fitness Function Evaluation.................................................. 57
3.3.2.3 Position Update Logic........................................................... 58
3.3.2.4 Oxygen Update Logic........................................................... 58
3.4 Result Analysis................................................................................................ 59
3.5 Conclusions...................................................................................................... 67
References................................................................................................................. 67
3.1 INTRODUCTION
Machine learning is a branch of artificial intelligence (AI), which allows applications
to become more authentic in anticipating to which class a particular data resides.
Various applications of machine learning are spread over the areas like healthcare,
finance, retail, travel, social media, advertisements, and most importantly, data
mining [1]. There are two types of learning: supervised learning and unsupervised
learning. In machine learning, feature alone is a measurable property or trait of a
phenomenon being observed. Feature subset selection is one of the important tasks in
51
52 Data Science
data mining. To perform effective data mining and pattern recognition tasks, we need
the help of an efficient feature selection technique. However, due to a large number of
features present in high-dimensional datasets, there is a chance of u nnecessary over-
fitting, which increases the overall computational c omplexity and reduces the pre-
diction accuracy of the procedure. This feature selection problem belongs to a set of
NP (non-deterministic polynomial)-hard problems, where the complexity increases
exponentially if the number of features along with the size of datasets is increasing.
Feature subset selection not only helps us to get rid of the curse of dimensionality
but also helps us to shorten the training time and simplifying the model, making it
easier for the analysts to interpret it. There are various approaches that deal with
both supervised and unsupervised ways of feature subset selection [2–5]. There are
two classes of feature selection methods, such as (i) filter-based feature selection and
(ii) wrapper-based feature selection.
Filter methods are generally a part of the preprocessing step. Here, each feature
is selected on the basis of their scores in various statistical tests, some of which are
Pearson’s correlation (PCA), linear discriminant analysis (LDA), analysis of vari-
ance (ANOVA), and chi-square. The other methods are wrapper methods, where a
set of a subset is randomly chosen and then the efficiency is checked, and after that,
other features apart from the subset are chosen, and the results are checked again,
or some irrelevant or less important features are removed from the subset. This con-
tinues until we find an ideal subset. In wrapper methods, the problem is reduced to a
search problem. Using wrapper methods are quite expensive [6]. Some of the exam-
ples of wrapper methods are forward selection, backward elimination, and recursive
feature elimination. Apart from these two methods, we can also use an embedded
method that includes the qualities of both wrapper methods and filter methods. It is
used by those algorithms that have built-in feature subset selection methods [7].
After feature subset selection, we use classifiers to classify which class the
particular data belongs to. However, we have tested our algorithm using K-nearest
neighbors (KNN), support vector machine (SVM), and Random Forest classifi-
ers [8]. KNN is a very simple nonparametric decision procedure that is used to assign
unclassified observations a class with the use of a training dataset [9].
In this chapter, we have worked upon feature subset selection using “penguin
search algorithm,” inspired by the hunting strategy of the penguins [10]. Also, we
have tested our algorithm on some popular UCI (unique client identifier) datasets,
including Iris, Pima, Wisconsin, etc. Then, we have compared our work with the
existing feature subset algorithm. We have also used three different types of clas-
sifiers and checked how good our algorithm works in terms of parameters such as
accuracy, precision, recall, and F1 score. There are several existing algorithms or
approaches for “feature subset selection,” which are inspired by nature like ant, bee
colony, whale, etc., and so we are using “penguin search optimization algorithm
(PeSOA)” in our work, which is based on the way penguins hunt for food [10].
Penguins take random jumps in different places and random depths to find out fish,
and after they find fish, they come back and communicate about the food avail-
ability with the other penguins, and this continues till the penguins find the best
place or the place where maximum fish are present or the global maxima. The goal
of finding the global maxima continues until the oxygen level of penguins does not
Penguin Search Optimization Algorithm 53
get depleted. Each penguin has to return back to the surface after each trip. The
duration of the trip is measured by the amount of oxygen reserves of the penguins,
the speed at which they use it up, or their metabolism rate. This behavior of the
penguins has given us motivation for the development of a new optimization method
based on this strategy of penguins. Penguins are sea birds, and they are unable to
fly [10]. Metaheuristics is mainly used for the development of new artificial systems,
and it is effective in solving NP-hard problems. It can be classified in various ways.
The first work in the field of optimization commenced in 1952 based on the use of
a stochastic manner. Rechenberg diagrammed the first algorithm using evolution
strategies in the optimization in 1965 [10]. Most of the methods utilize the thought
of population, in which a set of solutions are calculated in parallel at each iteration,
such as genetic algorithms (GAs), particle swarm optimization (PSO) algorithm, and
ant colony optimization algorithm (ACO) [11–12]. Other metaheuristic algorithms
use the search results based on their past experiences to guide the optimization in
following the iterations by putting a learning stage of intermediate results that will
conduct to an optimal solution. The work reported in this chapter is an example of
it. It mainly depends on the collaborative hunting strategy of penguins. The catholic
optimization process starts based on the individual hunting process of each penguin,
who must share information to his group related to the number of fish found of their
individual hunting area. The main objective of the group conversation is to achieve
a global solution (the area having abundant amount of food). The universal solution
is chosen by a selection of the best group of penguins who ate the maximum number
of fish. Comparative studies based on other metaheuristics have proved that PeSOA
accomplishes better answers related to new optimization strategy of collaborative
and dynamic research of space solutions.
This rest of the book chapter is organized as follows. A brief literature review has
been done in Section 3.2. In Section 3.3, we have described our proposed work with a
suitable flow chart diagram. Then, we have described a detailed performance analy-
sis of our proposed approach in Section 3.4. We have also compared our proposed
work with some previous studies related to different parameters of classification in
Section 3.4, and finally, Section 3.5 describes the conclusion of this chapter.
3.2 LITERATURE REVIEW
In the last decade, various researchers adopted various optimization methods for solv-
ing the feature selection problem. In one of the earliest works, a novel marker gene
feature selection approach was introduced. In this approach, a few high-graded infor-
mative genes were elected by the signal–noise ratio estimation process. Then, a novel
discrete PSO algorithm was used to choose a few marker genes, and SVM was used as
an evaluator for getting excellent prediction performance on colon tumor dataset. The
authors have introduced an algorithm called swarm intelligence feature selection algo-
rithm, which is mainly based on the initialization and update of the swarm particles.
In their learning process, they had tested the algorithm in 11 microarray datasets for
brain, leukemia, lung, prostate, etc. And they have noticed that the proposed algorithm
was successfully increasing the classification accuracy and reducing the number of
chosen features compared with other swarm intelligence process. The authors have
54 Data Science
compared the utilization of PSO and GA (both illuminated with SVM) for the clas-
sification of high-dimensional microarray data. Those algorithms are mainly used for
finding small samples of informative genes amongst thousands of them. An SVM
classifier with tenfold cross-validation was applied to validate and assess the provided
solutions [13]. There is one research work, where whale optimization algorithm (WOA)
was introduced, through which a new wrapper feature selection approach is proposed.
It is a recently proposed algorithm that has not been systematically used to feature
selection problems. Two binary variants of the WOA algorithm were mainly intro-
duced to find the subsets of the optimal feature for classification purposes. In the first
case, the aim was mainly to study the impact of using the Tournament and Roulette
Wheel selection mechanisms instead of using a random operator in the searching pro-
cedure. In the second case, crossover and mutation operators are used to prolong the
exploitation of the WOA algorithm. The proposed methods are tested based on stan-
dard benchmark datasets. That paper also considers a comprehensive study related to
the parameter setting for the proposed technique. The results display the ability of the
proposed approaches in searching for the optimal feature subsets [14]. In a paper, a
noble version of the binary gravitational search algorithm (BGSA) was proposed and
used as a tool to choose the best subset of features with the objective of improving
classification accuracy. The BGSA has the ability to overcome the stagnation situation
by enhancing the transfer function. And then the search algorithm was used to explore
a larger group of possibilities and avoid stagnation. To assess the proposed improved
BGSA (IBGSA), classification of some familiar datasets and improvement in the
accuracy of CBIR systems were dealt. And the results were compared with those of
the original BGSA, GA, binary PSO (BPSO), and electromagnetic-like mechanism.
Relative results ensured the effectiveness of the proposed IBGSA in feature selection
[15]. Various optimization techniques (hill-climbing, PSO, etc.) are used to do an effi-
cient and effective unsupervised feature subset selection [7].
It can be seen from a survey of existing work that a few researchers in the last
decade have tried to solve the problem of feature selection using optimization
techniques. There have been a couple of attempts by researchers to unify diverse
algorithms for supervised and unsupervised feature selections. However, we think
PeSOA can provide a better selection strategy in case of wrapper-based feature
selection process. In our recent work, we discuss this area of feature selection.
3.3 PROPOSED WORK
In our algorithm, each location where the penguin jumps into is considered a feature
or a dimension, and as the penguins dive to find the food, the penguins dive deep into
the feature to check whether that particular feature is important or not or the amount
of food or fish is ample or not. The results are shared among the other penguins
once they come to the surface, and this search for finding the optimum result or the
minimum number of features by which classification gives good results continues
till the oxygen level depletes or the best features or best positions are found out by
the penguins, i.e. [16], in terms of penguins places with the highest amount of fish or
food. On the basis of these things, we have designed our algorithm.
Penguin Search Optimization Algorithm 55
Start;
Obtain Dataset;
Split the dataset in the ratio of 80:20 as training set and
test set;
Generate Random population of P penguins;
Initialize Oxygen reserve for Penguins
Initialize the first location of Penguins
While (iterations<Oxygen) do
For each penguin j do
Look for Fish available (Calculate
fitness of available data for
the current Penguin with the help of Fitness Function 4.3.2).
Determine the quantity of fish available.
Update the Position of the Penguin
(based on the Position Update Logic 4.3.3)
End for;
Update the best solution;
Get the food quantity (fitness) data from the
penguins to update the group.
Scale the food quantity (fitness) data for
position update in next iteration.
Update the Oxygen reserve for the penguins
(using Oxygen Update Logic 4.3.4).
End While;
Find out the features that qualify the fitness cutoff
Save the obtained feature subset;
Use this subset of features to undergo classification using
SVM, KNN and Random Forest;
Performance analysis using precision, recall, f-score, and
accuracy;
End;
The first step of the Machine Learning is choosing the dataset we intend to work
upon. These datasets have been explained in Table 3.1. After the dataset has been
chosen, the data needs to be normalized or scaled to a particular range. This is done
because it might be possible that one attribute ranges between 1 and 100 and the
other attribute ranges from 10,000 to 50,000. This type of variance in the dataset
will be affecting our results. Hence, we need to scale it down to a particular range,
setting the upper limit and the lower limit. After the scaling has been done, the data-
set is divided into two parts—training dataset and testing dataset. This is generally
in the ratios of 70:30 or 80:20. The training dataset, as the name suggests, is used
to train our machine, and the testing dataset is used to test our algorithm and how
efficiently it works. This step is also called as data splitting. The overall workflow of
our proposed approach is shown in Figure 3.1.
56 Data Science
FIGURE 3.1 The overall flow chart of the proposed FS-PeSOA algorithm.
Now, after the data gets split into two parts, we will execute our proposed algorithm
for feature subset selection, and the best features are selected out. This step is known
as feature subset selection, which has been explained in the Introduction part.
Briefly, it means selecting the minimum number of features by which our machine
can identify which class a particular data belongs to. After feature subset selection
Penguin Search Optimization Algorithm 57
has been done, several classifiers are used like SVM, KNN, Random Forest, etc. for
further classification process. However, after the completion of the training phase,
we would like to test our algorithm with the testing dataset, which we kept aside and
will be used for performing the same steps again for classification.
After the classification is done, we are checking how accurate our results are
and how well it performs when compared with other benchmark feature selection
algorithms.
In the “Result Analysis” section, the comparison factors are accuracy, precision,
recall, and F1 score, and the benchmark feature selection algorithms which we have
compared are LDA and PCA.
3.3.2 Discussion
3.3.2.1 Hunting Strategy of Penguins
The hunting strategy of penguins is theorized in [10]. This hypothesis theorized that
their hunting strategy may be explained in an economical manner. Penguins are a
biological being that have a definite amount of capacity; by capacity it means that
they have a definite amount of oxygen and strength to fulfill their search for food.
During the process of finding the food for their survival, they tend to dive in water
for finding fish. They go economical where they need to find food, and the point
is they go for a search in food where the amount of food found is on par with the
energy spent to find the food. Moreover, they have a limited amount of oxygen left
with them that limits the amount of time they can continue with the total hunting
procedure. They do dive in the water and look for the available food and consume
them, and when they come back to the surface, they again communicate with the
whole group about the amount of food found and at which location, and this commu-
nication among the penguins plays a vital role in this whole searching of food by the
penguins. Whenever they communicate among them about the data of the location
and quantity of food found, the penguins that have found a lesser amount of food
tend to travel to the location that has been reported to have more amount of food in
comparison to others. However, we have tried to visualize this hunting strategy of
penguins to optimize our searching technique for an optimal subset of all features of
the whole dataset in this chapter. We have tried to visualize the amount of food found
by the penguins as the goodness or the fitness of features the penguins travel to. All
the penguins work as a whole unit to deliver a single objective, to find the maximum
amount of food that they can find with a limited number of resources. There is a
cycle where they go in search of food and come back with food and communicate
with the other penguins with the data of the food and location, and then the penguins
travel to other locations, and then they again go in search of food. This whole cycle
goes on and on, until and unless they have found the required amount of food.
as described in [17]. The eigenvalue works on the principle of calculating the vari-
ance of the particular feature with the total number of records for the particular
feature. The basic idea is that the more the values of a particular feature scattered in
a more varied range, the feature is fit for being used for the purpose of classification.
Whenever a penguin goes to a particular feature, the eigenvalue function is used to
get the variance of the feature, and this is the fitness. Based on this fitness value, the
features are selected based on the cut-off criteria. Now, suppose we have a dataset of
{x(i), i = 1, 2, 3, …, m} of m different features. So the data lies in an m-dimensional
subspace and the data basically lies between the diagonal of this m dimensional sub-
space. Now, we find the mean of the data using Equation (3.1) as follows [17].
m
µ=
1
m ∑x
i =1
(i )
(3.1)
Replace the x(i) data with x(i) − μ, and from this, we get the mean of the data normal-
ized. However, the data with zero and no mean is omitted and normalized. Now, we
find the variance of the data using Equation (3.2).
σ j2 =
1
m ∑( x ) (3.2)
i
(i ) 2
j
Replace x(i) with X(i) ∕σ, and from this, we get the data scaled in a particular range, and
this normalizes the data with respect to the whole dataset and all features. This is the
final step of normalization. This way, we get the variance of the data [17].
This resultant variance is the score of the feature with a set of all data records.
This represents how the data is scattered in the maximum to minimum scale for the
feature, which will determine the features that should be suitable for classification.
Based on this position update logic, the penguins will keep updating their position
until they find the optimal amount of food.
until they stop to find out the food that they have. We have a predefined value of oxygen
that is same for all the penguins in the beginning and end of each iteration, and the
amount of oxygen gets reduced to a fixed value so that after a definite amount of time
is complete, it brings an end to the series of iterations. This can also be considered as
the number of generations our iteration will go on until an optimal solution is reached.
3.4 RESULT ANALYSIS
The implementation of the algorithm has been done on Python programming
language using Anaconda as the software and on Jupiter notebook, and the experi-
ments are tested on a computer with system specifications of 4 GB RAM, Intel i3
core Processor, and 500 GB hard disk memory. Seven real-world “UCI Machine
Learning Repository”-approved datasets are used to assess the efficiency of our pro-
posed algorithm [18]. Some of them have about 4–5 features and some have about
30–40 features, which make it appropriate for us to perform the feature subset selec-
tion. These datasets are also diverse in the context of number of classes and samples.
These datasets have been represented in Table 3.1.
These datasets include Ion, Pima, Iris, Vehicle, Wisconsin, Glass, and Wine.
Using our proposed algorithm, i.e., FS-PeSOA, we have performed the feature subset
selection, and then used KNN, Random Forest, and SVM classifiers to perform the
classification. We have also tested the datasets using other algorithms like PCA and
LDA, and then we compared the results hoping to get better performance.
These are the UCI-approved datasets that have been used to check the efficiency
of our algorithm. According to the UCI repository of machine learning [18], Iris is
the flower dataset that is perhaps the best-known database to be found in the pattern
recognition field. It has three classes and four attributes, and the attributes include
sepal length, petal length, sepal width, and petal width. The classes are Iris Setosa,
Iris Versicolour, and Iris Virginica.
Ion dataset or ionosphere dataset is a collection of radar data collected in Goose
Bay, Labrador. The system has a phased array of 16 high-frequency antennas with
a transmitting power on the order of 6.4 kW, and the targets are free electrons that
are present in the ionosphere. This dataset consists of 33 attributes and two classes,
namely good radar and bad radar.
TABLE 3.1
List of Datasets from UCI [17] which Are Tested Using PESOA
Datasets No. of Observations No. of Classes
Iris 150 3
Glass 214 6
Ion 351 2
Pima 768 2
Vehicle 846 4
Wine 178 3
Wisconsin 569 2
60 Data Science
Good radar is that which shows some structure in the ionosphere, and the bad
radar is that whose signals cannot pass through the ionosphere. Pima dataset is col-
lected by a survey done on 768 people by the National Institute of Diabetes and
Digestive and Kidney Diseases. The motive of the dataset or the machine learning
part of the dataset is to find out whether a particular patient has diabetes or not
on the basis of various diagnostic measurements. In the dataset, the survey is done
on Indian females who are above 21 years and are of Pima Indian Heritage. This
dataset has two classes: either the patient is suffering or not. According to the “UCI
repository of Machine Learning,” the wine dataset contains the results of a chemi-
cal analysis of wines. It has three classes and multiple attributes like fixed acidity,
volatile acidity, citric acid, residual sugar, chlorides, free sculpture dioxide, total
sculpture dioxide, density, pH, sulfates, alcohol, etc. The number of features before
subset selection was 13 and that after feature subset selection was reduced to 4.
According to the UCI Repository, the vehicle dataset classifies a given silhouette
as one of four types of vehicles with 18 features, some of which are Compactness,
Circularity, Radius Ratio, Elongatedness, etc. This data was originally gathered in
1986–1987 by J.P. Siebert. According to the UCI Repository of Machine Learning,
the study of classification of types of glass is motivated by a criminological inves-
tigation. At the scene of the crime, the glass left can be used as evidence. It has six
classes and multiple attributes such as Id number: 1–214, refractive index, Sodium
(unit measurement: weight percent in corresponding oxide, as are attributes 4–10),
Magnesium, Aluminum, Silicon, Potassium, Calcium, Barium, Iron, Type of glass:
(class attribute): building_windows_float_processed,building_windows_non_float_
processed,vehicle_windows_float_processed, vehicle_windows_non_float_processed
(none in this database), containers, tableware, and headlamps. The number of features
before subset selection was 10 and that after feature subset selection was reduced to 4.
According to the UCI Repository of Machine Learning, the features are computed
from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe
the characteristics of the cell nuclei present in the image. The Wisconsin dataset has
two classes and multiple attributes like ID number, Diagnosis (M = malignant, B =
benign), and ten real-valued features that are computed for each cell nucleus like radius
(mean of distances from center to points on the perimeter), texture (standard deviation
of grayscale values), perimeter, area, smoothness (local variation in radius lengths),
compactness (perimeter^2/area − 1.0), concavity (severity of concave portions of the
contour), concave points (number of concave portions of the contour), symmetry, and
fractal dimension (“coastline approximation” − 1). The number of features before sub-
set selection was 30 and that after feature subset selection was reduced to 6. Table
3.2 introduces us to the results, i.e., Accuracy, Precision, Recall, and F1 score of our
algorithm when tested upon the datasets in Table 3.1. The table also explains the total
number of features that were present before and after the feature subset selection. The
table gives a detailed information or results when different classifiers have been used.
Figure 3.2 shows the performance of accuracy, precision, recall, and the F1 score
of a Glass dataset under KNN, Random Forest, and SVM, when previously operated
by PCA, LDA, and our proposed FS-PeSOA algorithm. The following graph is of
Ion dataset and its results.
Penguin Search Optimization Algorithm 61
TABLE 3.2
Proposed Algorithm Result Analysis
Total Feature
Dataset Before After Classification Accuracy Precision Recall F1 Score
Iris 4 1 KNN 90 0.94 0.90 0.90
4 1 Random Forest 90 0.94 0.90 0.90
4 1 SVM 90 0.94 0.90 0.90
Glass 10 4 KNN 81.39 0.82 0.81 0.80
10 4 Random Forest 79.06 0.81 0.79 0.79
10 4 SVM 81.39 0.80 0.81 0.79
Ion 33 8 KNN 95.91 0.88 0.86 0.85
33 8 Random Forest 91.54 0.92 0.92 0.91
33 8 SVM 91.54 0.93 0.92 0.91
Pima 8 1 KNN 59.74 0.58 0.60 0.59
8 1 Random Forest 64.28 0.64 0.64 0.64
8 1 SVM 65.58 0.62 0.66 0.60
Vehicle 18 5 KNN 57.05 0.58 0.57 0.57
18 5 Random Forest 52.35 0.52 0.52 0.52
18 5 SVM 58.82 0.59 0.59 0.58
Wine 13 4 KNN 69.44 0.69 0.69 0.69
13 4 Random Forest 55.55 0.64 0.56 0.57
13 4 SVM 38.88 0.22 0.39 0.28
Wisconsin 30 6 KNN 79.04 0.83 0.82 0.84
30 6 Random Forest 84.61 0.79 0.81 0.82
30 6 SVM 95.61 0.96 0.96 0.96
FIGURE 3.2 Performance analysis of the proposed FS-PeSOA with other algorithms on a
Glass dataset.
62 Data Science
Figure 3.3 shows the performance of accuracy, precision, recall, and the F1 score
of an Ion dataset under KNN, Random Forest, and SVM when previously operated
by PCA, LDA, and our proposed FS-PeSOA algorithm. The following graph is of
Pima dataset and its results.
Figure 3.4 shows the performance of accuracy, precision, recall, and the F1 score
of a Pima dataset under KNN, Random Forest, and SVM when previously operated
by PCA, LDA, and our proposed FS-PeSOA algorithm. The following graph is of
Wine dataset and its results.
Figure 3.5 shows the performance of accuracy, precision, recall, and the F1 score
of a Wine dataset under KNN, Random Forest, and SVM when previously operated
by PCA, LDA, and our proposed FS-PeSOA algorithm. The following graph is of
Wisconsin dataset and its results.
FIGURE 3.3 Performance analysis of the proposed FS-PeSOA with other algorithms on an
Ion dataset.
FIGURE 3.4 Performance analysis of the proposed FS-PeSOA with other algorithms on a
Pima dataset.
Penguin Search Optimization Algorithm 63
FIGURE 3.5 Performance analysis of the proposed FS-PeSOA with other algorithms on a
Wine dataset.
FIGURE 3.6 Performance analysis of the proposed FS-PeSOA with other algorithms on a
Wisconsin dataset.
Figure 3.6 shows the performance of accuracy, precision, recall, and the F1 score
of a Wisconsin dataset under KNN, Random Forest, and SVM when previously oper-
ated by PCA, LDA, and our proposed FS-PeSOA algorithm. The following graph is
of Vehicle dataset and its results.
The Figure 3.7 shows the performance of accuracy, precision, recall, and the F1
score of a Vehicle dataset under KNN, Random Forest, and SVM when previously
operated by PCA, LDA, and our proposed FS-PeSOA algorithm. These are the
comparison graphs of different algorithms and our FS-PeSOA algorithm. We have
gathered a comparison result in Table 3.3, which gives a data representation for these
graphs.
64 Data Science
FIGURE 3.7 Performance analysis of the proposed FS-PeSOA with other algorithms on a
Vehicle dataset.
Tables 3.3 and 3.4 show that our proposed FS-PeSOA algorithm generates the
most number of “Win”s for supervised feature selection. To summarize, in the con-
text of the seven UCI datasets used in the experiments,
• FS-PeSOA gives better average accuracy than LDA and PCA algorithms.
• FS-PeSOA has more number of “Win”s than any other algorithm (refer
Table 3.4).
TABLE 3.3
Comparison Based on Accuracy, Precision, Recall, and F1 Score Parameters
Dataset Algorithm Accuracy Precision Recall F1 Score
A
Iris KNN with PCA 80 0.90 0.80 0.80
KNN with LDA 60 0.47 0.60 0.50
KNN with FSPeSOA 90 0.94 0.90 0.90
Random Forest with PCA 77.77 0.89 0.78 0.77
Random Forest with LDA 60 0.47 0.60 0.50
Random Forest with FSPeSOA 91.22 0.94 0.90 0.90
SVM with PCA 80 0.90 0.80 0.80
SVM with LDA 60 0.47 0.60 0.50
SVM with FSPeSOA 90 0.94 0.90 0.90
Glass KNN with PCA 64.61 0.62 0.65 0.62
KNN with LDA 53.84 0.67 0.54 0.59
KNN with FSPeSOA 81.39 0.82 0.81 0.80
Random Forest with PCA 66.15 0.66 0.66 0.65
(Continued)
Penguin Search Optimization Algorithm 65
B
Vehicle KNN with PCA 43.70 0.45 0.40 0.43
KNN with LDA 38.58 0.48 0.39 0.36
KNN with FSPeSOA 57.05 0.58 0.57 0.57
Random Forest with PCA 30.31 0.32 0.30 0.30
Random Forest with LDA 35.82 0.36 0.36 0.35
Random Forest with FSPeSOA 52.35 0.52 0.52 0.52
SVM with PCA 38.18 0.30 0.38 0.33
SVM with LDA 30.31 0.14 0.30 0.19
SVM with FSPeSOA 58.82 0.59 0.59 0.58
Wine KNN with PCA 80 0.90 0.80 0.80
KNN with LDA 64.4 0.72 0.64 0.65
KNN with FSPeSOA 69.44 0.69 0.69 0.69
Random Forest with PCA 75.5 0.81 0.76 0.76
Random Forest with LDA 71.1 0.77 0.71 0.72
Random Forest with FSPeSOA 55.55 0.64 0.56 0.57
SVM with PCA 80 0.90 0.80 0.80
SVM with LDA 20 0.04 0.20 0.07
SVM with FSPeSOA 38.88 0.22 0.39 0.28
(Continued)
66 Data Science
TABLE 3.4
Win–Loss Ratio of Algorithms for Different Datasets
Dataset Classifier Accuracy Precision Recall F1 Score
Iris KNN Win Win Win Win
Random Forest Win Win Win Win
SVM Win Win Win Win
Glass KNN Win Win Win Win
Random Forest Win Win Win Win
SVM Win Win Win Win
Ion KNN Win Win Win Win
Random Forest Win Win Win Win
SVM Win Win Win Win
Pima KNN
Random Forest Win Win Win
SVM
Vehicle KNN Win Win Win Win
Random Forest Win Win Win Win
SVM Win Win Win Win
Wine KNN
Random Forest
SVM
Wisconsin KNN
Random Forest
SVM Win Win Win
Win/Loss 14 Win/7 Loss 12 Win/9 Loss 14 Win/7 Loss 14 Win/7 Loss
Penguin Search Optimization Algorithm 67
3.5 CONCLUSIONS
In this chapter, a new supervised feature selection approach based on penguin search
optimization has been presented. In this chapter, we have summarized the workflow
of penguins to an algorithm and tried to use them in a manner that has been benefi-
cial to the task of feature selection for different datasets. In terms of performance
of the proposed FS-PeSOA algorithm, experiments using seven publicly available
datasets have shown that it has given better results than the full feature set and the
benchmark algorithms of feature selection. It has been compared against PCA and
LDA for supervised feature selection. As a future work, this proposed algorithm can
be used for science and development purposes and also in biomedical research areas
where data analysis is required for the identification of various patterns of different
diseases.
REFERENCES
1. Stewart, S., & Thomas, M. (2007). Eigenvalues and eigenvectors: Formal, symbolic,
and embodied thinking. In The 10th Conference of the Special Interest Group of the
Mathematical Association of America on Research in Undergraduate Mathematics
Education (pp. 275–296), San Diego, California
2. Tibrewal, B., Chaudhury, G. S., Chakraborty, S., & Kairi, A. (2019). Rough set-based
feature subset selection technique using Jaccard’s similarity Index. In Proceedings of
International Ethical Hacking Conference 2018 (pp. 477–487). Springer, Singapore.
3. Goswami, S., Das, A.K., Guha, P. et al. (2017). An approach of feature selection
using graph-theoretic heuristic and hill climbing. Pattern Analysis and Applications,
Springer. doi:10.1007/s10044-017-0668-x.
4. Goswami, S., Das, A.K., Guha, P. et al. (2017). A new hybrid feature selection approach
using feature association map for supervised and unsupervised classification. Expert
Systems with Applications, Elsevier, 88, 81–94. doi:10.1016/j.eswa.2017.06.032.
5. Al-Kassim, Z., Memon, Q. (2017). Designing a low-cost eyeball tracking keyboard for
paralyzed people. Computers & Electrical Engineering, 58, 20–29.
6. Ng, A. (2000). CS229 Lecture notes. CS229 Lecture Notes, 1(1), 1–3.
7. Goswami, S., Chakraborty, S., Guha, P., Tarafdar, A., & Kedia, A. (2019). Filter-based
feature selection methods using hill climbing approach. In Natural Computing for
Unsupervised Learning (pp. 213–234). Springer, Cham.
8. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.
Journal of Machine Learning Research, 3(Mar), 1157–1182.
9. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial
Intelligence, 97(1–2), 273–324.
10. Gheraibia, Y., & Moussaoui, A. (2013, June). Penguins search optimization algo-
rithm (PeSOA). In International Conference on Industrial, Engineering and Other
Applications of Applied Intelligent Systems (pp. 222–231). Springer, Berlin, Heidelberg.
11. Chandrasekhar, G., & Sahin, F. (2014). A survey on feature selection methods.
Computers & Electrical Engineering, 40(1), 16–28.
12. Al-Ani, A. (2005). Feature subset selection using ant colony optimization. International
Journal of Computational Intelligence, 2(1), 53–58.
13. Sahu, B., & Mishra, D. (2012). A novel feature selection algorithm using particle swarm
optimization for cancer microarray data. Procedia Engineering, 38, 27–31.
68 Data Science
14. Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature
selection. Applied Soft Computing, 62, 441–453.
15. Rashedi, E., & Nezamabadi-pour, H. (2014). Feature subset selection using improved
binary gravitational search algorithm. Journal of Intelligent & Fuzzy Systems, 26(3),
1211–1221.
16. Parsopoulos, K. E., & Vrahatis, M. N. (2002). Particle swarm optimization method for
constrained optimization problems. Intelligent Technologies–Theory and Application:
New Trends in Intelligent Technologies, 76(1), 214–220.
17. Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A
review of classification techniques. Emerging Artificial Intelligence Applications in
Computer Engineering, 160, 3–24.
18. Lichman, M., & Bache, K. (2013). UCI Machine Learning Repository. University
of California, Irvine, School of Information and Computer Sciences. In [Online].
Available: http://archive.ics.uci.edu/ml.
4 A Physical Design
Strategy on a
NoSQL DBMS
Marcos Jota, Marlene Goncalves,
and Ritces Parra
Universidad Simón Bolívar
CONTENTS
4.1 I ntroduction..................................................................................................... 69
4.2 Motivation Example........................................................................................ 70
4.3 Neo4j................................................................................................................ 73
4.4 Design Guidelines............................................................................................ 74
4.5 Physical Design................................................................................................ 76
4.5.1 Query Rewriting Using Path Redundancy Pattern.............................. 77
4.5.2 Query Rewriting Using Minimal Query Pattern................................. 79
4.5.3 Path Materialization............................................................................80
4.5.4 Index Creation.....................................................................................80
4.6 Experimental Study......................................................................................... 81
4.6.1 Experimental Design........................................................................... 81
4.6.2 Impact of the Proposed Physical Design on Query Performance
for a 1 GB Database���������������������������������������������������������������������������� 82
4.6.3 Impact of the Proposed Physical Design on Query Performance
for a 10 GB Database�������������������������������������������������������������������������� 85
4.6.4 Impact of the Proposed Physical Design on Query Performance
for a 100 GB Database������������������������������������������������������������������������ 87
4.7 Related Work................................................................................................... 88
4.8 Discussion........................................................................................................90
4.9 Future Research Directions.............................................................................90
4.10 Conclusion....................................................................................................... 91
References................................................................................................................. 91
4.1 INTRODUCTION
Massive amounts of data are being generated by users and electronic devices
every day. Sensors connected to electronic devices provide an enormous amount
of accurate information in real time, while the data generated by users on social
69
70 Data Science
networks produces valuable information about them. In an era of big data, organiza-
tions need to gather and efficiently manage such amount of data for their decision-
making processes.
Currently, the graph technology is become increasingly important, and the graphs
are used to model dynamic and complex relationships of the data in order to generate
knowledge. For example, Google has incorporated a Knowledge Graph to its search
algorithm to significantly improve semantic searches; the searches are disambigu-
ated and contextualized [1].
Although the data have been stored in relational databases traditionally, they
have serious problems in terms of scalability in information management [2].
Thus, a new paradigm has emerged, and it is called NoSQL (not only SQL)
databases. NoSQL databases provide better scalability because it is easier to
add, delete, or perform operations in the system without affecting performance.
Particularly, Neo4j is a NOSQL database management system (DBMS) that
is graph-oriented. In this chapter, our main objective is to introduce a physi-
cal design strategy that improves the query execution for a specific workload in
Neo4j. A physical design consists of proposing several data structures or guide-
lines in order to improve the data access in a database. It is worth noting that
the structures used for a physical design will depend on the DBMS. Therefore,
the first step of a physical design is to have a good knowledge of the DBMS. We
selected Neo4j because it is a DBMS pioneered in the use of graphs as a physical
structure for storing data.
This chapter is structured as follows: Section 4.2 presents a motivation exam-
ple for our problem of physical design in a graph-based DBMS, Neo4j. Section 4.3
describes Neo4j as a graph-based DBMS and its execution plans and physical oper-
ators. Section 4.4 proposes a set of guidelines for a physical design on databases
stored in Neo4j. Section 4.5 explains the application of our physical design guide-
lines for some example queries. The performance of the proposed strategy will be
empirically evaluated in Section 4.6. Related work is presented in Section 4.7. Some
limitations of our guidelines are discussed in Section 4.8. Finally, the future research
directions and conclusion of this chapter will be discussed in Sections 4.9 and 4.10,
respectively.
4.2 MOTIVATION EXAMPLE
Suppose a movie database contains 171 nodes. Each node can be a movie or a person.
There are 38 movie nodes and 133 person ones. Additionally, this database has six
relationships between these nodes. Figure 4.1 introduces a schema of our example
database that includes one bucle between two Person nodes and five relationships
between Person and Movie. Not all nodes of a given type have to be in any of the
relationships shown in the schema of Figure 4.1.
In addition, consider a query to find information about the matrix movie. We can
specify this query in Cypher language, as is shown in Figure 4.2.
The query in Figure 4.2 was executed and its execution plan retrieved by Neo4j
is presented in Figure 4.3. We can observe in Figure 4.3 that the first operator,
Physical Design Strategy on a NoSQL DBMS 71
AllNodeScan, performs a scan on each of the nodes in the graph to check whether it
fulfills the query condition, the Filter operator evaluates the query condition, select-
ing those results that meet this condition, and the ProduceResults operator finally
generates the results for the user. It is noteworthy that if the database comprises
thousands or millions of nodes instead of 171 nodes, the query in Figure 4.2 will be
way more costly.
We can also help the Neo4j planner incorporating additional information in our
query. Since we know that the node is a Movie node instead of a Person node, we
can pose this query in such a way that it is only executed for searches on the nodes
of the Movie type. Figure 4.4 illustrates an example query where the type of node is
specified into the match clause.
When the type of node is incorporated into the query, a better plan is obtained.
In Figure 4.5, it can be noted that the NodeByLabelScan operator is applied rather
than the AllNodeScan node. Since the NodeByLabelScan operator is only executed
on Movie nodes, it produces 38 estimated nodes, i.e., it approximately discards 78%
nodes in its search.
Intuitively, a small sample of physical design strategies has been i ntroduced
in this section as an example in Neo4j [3]. A physical design produces an
implementation-level description of a database: the storage structures and methods
used for efficient access to the data [4]. An adequate physical design is the key for
quickly accessing the information. It depends on specific aspects of the DBMS on
which it needs to be implemented and important clues must be followed to design an
optimal physical database. Thus, we must know the internal workings of our DBMS
well in order to understand what is happening and how we can optimize a query.
In our motivating example, a rewritten query improved its performance; however,
not all classic physical design strategies are successful in Neo4j. Particularly, Neo4j
offers a native index version whose implementation is not clearly specified by Neo4j
and is limited for certain data types such as spatial, temporal, and numeric. For the
other data types, Neo4j uses Lucene which supports inverted indexing. Neo4j index-
ing differs from traditional systems such as PostgreSQL and Oracle, which include
FIGURE 4.4 Query for finding the movie matrix using the node label.
FIGURE 4.5 Plan generated for executing a query using node labels.
Physical Design Strategy on a NoSQL DBMS 73
several types of indexes including B-Trees. In consequence, Neo4j does not have a
simple way of improving range comparisons, because the B-Tree use is usually a
recommended strategy for this purpose; Neo4j supports a limited use of indices on
range comparisons [5]. In addition, Neo4j lacks collection operators, and it only has
UNION and UNION ALL operators that reduce the possibilities of rewriting a query
to avoid a certain operator.
Despite some limitations of Neo4j, it also presents a different way to manipulate
the data, e.g., it introduces a list data structure in order to manipulate data in several
ways. In this sense, a new challenge is created for database administrators that have
to adapt themselves to new technologies including NoSQL DBMSs. Additionally,
the physical database design is a fundamental phase to ensure a good performance
for database applications. When a database application grows in scale, the need for
improving query performance also increases. If the queries begin to worse their
performance, it is necessary to adequately handle significant amounts of information
and efficiently execute different operations on the data.
Our goal in this chapter is to propose a physical database design strategy that
improves the query performance for Linked Data Benchmark Council (LDBC)
Social Network Benchmark (SNB) [6] using a graph NoSQL DBMS, Neo4j. Neo4j
is one of the main NoSQL DBMSs on graph databases that are being widely used
because they are the most natural way to understand any system that surrounds us
and to represent its connections through graphs. For example, Facebook and Google
support their business by means of graphs. Facebook is built on the value of relation-
ships between people by capturing all these connections in a social graph [7] while
Google improves its searches using a Knowledge Graph [1].
4.3 NEO4J
Neo4j is an open-source, native graph DBMS that provides an ACID (Atomicity,
Consistency, Isolation, Durability)-compliant transactional backend [8]; its source
code is written in Java and Scala. Particularly, Neo4j efficiently implements a
property graph model. A property graph is built using nodes and edges, but it has
two additional definitions: properties and labels. A property is an attribute of a
node or a relationship. A label is used to tag a group of nodes or relationships in
order to represent its role on the model. An example of a property graph can be
observed in [8].
Additionally, Neo4j is able to generate a physical or execution plan for a query. An
execution plan is a tree-like structure composed of operators [9]. These operators can
be classified into the following five categories:
• Expanding operators: They are those used to traverse the graph. Given a set
nodes, it will traverse the relationships, outgoing or ingoing, and retrieve
the end nodes.
• Node-retrieving operators: They retrieve the starting point of a search in a
graph. If an index is not specified, a scan over all nodes will be performed,
else an index will be used. These operators are leaf nodes in an execution plan.
74 Data Science
• Filter operators: They select a subset of data. For example, removing all
nodes that don’t fulfill a certain condition or removing all nodes except the
first X nodes. These operators are usually found between operators.
• Join operators: They integrate the result of two branches with already
expanded paths. These operators are performed when the execution plan
has to merge two branches to produce one result.
• Aggregation operators: These operators correspond to usual aggregation
operations such as count, sum, average, etc.
It can be noted that there are other types of operators for writing, such as create,
update, or delete, but they are not considered in this chapter; additional details of
Neo4j operators can be found in [9].
To generate an execution plan, Neo4j requires knowing certain statistics to deter-
mine which is the best plan according to the planner. These statistics are the number
of nodes having a certain label; the number of relationships by type; selectivity per
index; the number of relationships by type, and the number of relationships grouped
by the label of the ending or starting node [9]. Then, these statistics are applied to
estimate the number of records that is expected to be produced by each operator.
Once the best plan is built by the planner, then it is executed. In addition, the execu-
tion plan is stored in cache to prevent any unnecessary recalculation. However, the
execution plan is generated again if the computed statistics change.
Once a plan is executed, Neo4j shows actual records for each operator and an
abstract unit called DB (Database) Hits. A DB Hit is an abstract unit of storage
engine work and could correspond to looking up a node, property, or relationship.
Since the amount of DB Hits is somewhat proportional to the amount of time it takes
to execute the query, it is obvious why one of the main objectives should minimize
the DB Hits.
Finally, it is important to note that some considerations must be taken into account
to improve the execution time of a query. The operators of an execution plan, the
amount of records each operator process and its outputs, and the amount of DB
Hits generated while executing a plan are clues for a database administrator on how
he/she can make physical design and reduce the processing time of a query.
4.4 DESIGN GUIDELINES
A good physical design can be achieved if the designer understands how the DBSM
works, especially what are the query processing techniques offered by it. The fol-
lowing guidelines for a physical design in Neo4j summarize techniques that can be
applied by a database designer:
4.5 PHYSICAL DESIGN
Before starting a physical design of a database, its data must be analyzed in
conjunction with the queries that are frequently executed, in order to propose strate-
gies and structures that can improve the processing time of its queries. Thus, the
dataset selected for our physical design was generated from the SNB [11] of the
LDBC. An SNB is a data generator that produces synthetic social network datasets.
In this experimental study, we work with the Interactive Workload and Business
Intelligence Workload (BIW), which is the only workload of this benchmark, devel-
oped for Cypher and consists of analytical queries. The specification for each of the
queries of this workload can be found in [12,13].
Based on the choke points, we selected 13 from 25 queries of the Interactive
Workload and BIW. A choke point is an aspect observed during the evaluation of
queries and corresponds to an opportunity of physical design to improve the query
processing time [13]. Moreover, the choke points may be classified in terms of aggre-
gation performance, join performance, data access locality, expression calculation,
correlated subqueries, parallelism and concurrency, and RDF (Resource Description
Framework) and Graph specifics.
Each of the queries of the Interactive Workload and BIW is associated with at
least one choke point. In this work, we will only consider 22 choke points, although
the benchmark subsequently changed, adding six additional check points.
The process for selecting 13 queries from the Interactive Workload and BIW
consisted of randomly choosing a reduced number of 13 combinations from a set
25
of 25 queries. It is noteworthy that = 5, 200,300 is a very large number. In this
13
sense, we applied two criteria to reduce the total number of combinations. These
criteria are the following: (i) For each choke point, there must be at least one query
associated with it; (ii) There cannot be a choke point that is associated with all of the
queries in each combination.
Since we want to test the design guidelines against all the choke points available,
we defined the first criterion. Thus, we decided that the final combination must asso-
ciate each choke point with at least one query. Figure 4.6 illustrates a table that
contains the choke points for each query in combination with the queries: BI 3, BI 4,
BI 6, BI 7, BI 9, BI 12, BI 14, BI 16, BI 18, BI 19, BI 21, BI 22, and BI 23, where
the BI acronym means business intelligence that is assigned by SNB. In Figure 4.6,
each row represents a query, and each column corresponds to a choke point. It can
be noted that each of the columns has at least one dot, i.e., each choke point is in this
combination.
The second criterion was defined to avoid having one choke point shared between
all of the queries in a combination. To prove the physical design guidelines in several
situations, we avoided having a common characteristic between all of the queries
in a combination. We can observe in Figure 4.6 that no column is completely filled
with dots.
With these two criteria, the total amount of combinations was reduced to 276,194.
At last, for our experimental study, we randomly chose one of these combinations
Physical Design Strategy on a NoSQL DBMS 77
FIGURE 4.6 Choke points associated with each query for a combination. (Source: Ref. 12.)
following the selected month and year. Thus, we rewrite the query BI 3 to expand the
relationship between Message and Tag only once by means of a case expression. The
lines of the query BI 3 associated to the resulting rewriting are shown in Figure 4.8,
where the common path in the two optional match clauses is highlighted in bold.
Since the version of Neo4j that we are using did not support date f ormats,
the date was stored as string. Therefore, the expression message. creation-
Date/100,000,000,000 = (year1 * 100) + month1 corresponds to an operation to
Physical Design Strategy on a NoSQL DBMS 79
extract the month and year according to the position in the string in order to compare
them.
Finally, we discard the third guideline. Particularly, the query BI 3 has optional
match clauses. Since this kind of clause does not support indexes, the third guideline
cannot be applied.
4.5.3 Path Materialization
We materialized one of the paths of the query BI 4, considering the second guideline
for this case. The query BI 4 returns all the forums created in a given country, which
contains at least one post with a given tag (Tagclass); the forum location is the same
as the location of the forum moderator. Its execution plan is characterized by (i) two
scan operators on the nodes Country and TagClass; (ii) six expand operators on the
relationship between Country and City, City and Person, Person and Forum, Forum
and Post, Post and Tag, and Tag and TagClass; (iii) three filter operators on the nodes
Country, Tag, and TagClass; (iv) a hash join between the Tag obtained from the path
between Post and Tag and the Tag retrieved from the path between Tag and TagClass;
(v) an aggregation operator to count the number of posts; (vi) a projection and a sort
operator by two properties; (vii) a Top operator to return the first 20 results.
It is important to note that the execution plan of the query BI 4 has a filter
operator on a unique key, the country name. The country name is a good candidate
for an index creation following the first guideline, because one country node will
be retrieved instead of scanning 112 country nodes. Nevertheless, an index on the
country name does not significantly improve the query performance because the
scan on all country nodes is cheap. Another index that can be defined for this query
is an index on the name property of Tagclass node. However, the execution time of
the query BI 4 considerably worsened when we created an index on Tagclass nodes,
since the engine starts to expand from the Tagclass nodes rather than the country
ones, even though the number of Tagclass nodes is much higher than the number of
countries, despite having created an index on the country nodes. One of the aspects
to be highlighted is that the index on Tagclass can be created to improve the queries
BI 9, BI 16, and BI 19 but it would have a negative impact on the query BI 4.
Finally, since the index creation could improve a little or worsen the processing
time of the query BI 4, we decided to follow the second guideline. In particular, the
most expensive path considering DB Hits is the one composed of the nodes Person,
Forum, and Post. Nevertheless, this path was not created because the Forum node
is required for the projection operator, because the query BI 4 returns the id, the
title, and the creation date of the forum. Thus, if we materialize a path comprised by
Person, Forum, and Post, this path would not be used by the planner. In consequence,
the path with the second highest DB Hits was materialized, i.e., the path composed
of Post, Tag, and TagClass. Lastly, this same guideline was applied for the follow-
ing queries: BI 9, BI 16, BI 18, BI 19, BI 21, B1 22, and BI 23. It is worth noting that
some relationships created for some queries were also used by other queries. Thus,
the new relationship created for query BI 4 was employed in the plan of the query BI
16, and the materialized path for query BI 7 was used in the plans of queries BI 22
and BI 21, and the relationship defined for the query BI 18 was utilized in the plan
of query BI 19.
4.5.4 Index Creation
The third guideline was applied for the query BI 6, which identifies people who have
created at least one message with a given tag. Also, it counts the number of likes
Physical Design Strategy on a NoSQL DBMS 81
and replies that the message has received. The execution plan built by Neo4j for this
query has the following operators: (i) a scan on the Tag node; (ii) three filters on the
nodes Tag, Message, and Person; (iii) four expands on Tag and Message, Message
and Person, Person and Message, Message and Comment; (iv) a sort operator on Tag
name; (v) an aggregation to count the number of likes that the message has received
and the number of replies that the message has had; (vi) a projection; (vii) a Top for
the first 100 results.
On this case, an index on the Tag name was created by applying the third g uideline,
considering that the Tag name is a unique key. Contrary to the query BI 4, the Tag
node has a notably large number of instances, and therefore, the query performance
improvement is significant because a scan on Tag nodes will be highly costly. Lastly,
this same guideline was applied for the query BI 14.
4.6 EXPERIMENTAL STUDY
In this section, the experimental study is reported. First, it begins by describing
the metrics and the configuration that were considered in this experimental study.
Second, it continues by comparing the physical design strategy following the
guidelines proposed in this work against the default strategy of Neo4j.
4.6.1 Experimental Design
Datasets: This study was conducted on synthetic datasets. Synthetic datasets consist
of data produced by a data generator of SNB called DATAGEN [14]. DATAGEN
generates data based on a scheme of a social network, and it extracts some data
collected from DBpedia [2], which allows the data to be realistic and correlated;
the posts constitute the largest amount of data in the generated dataset and contain
textual data extracted from DBpedia. The database schema contains people who
live in a city, which is part of a country or continent. These people know others and
write messages that are tagged according to a tag class. Also, people are members
of forums that contain messages, they can work in a company, or they can study in a
university. Additionally, the nodes in the schema have a degree of distribution simi-
lar to a power law. Thus, to generate the data, DATAGEN creates all relationships
from an amount of people to ensure that data and link distribution are generated as
in a real social network like Facebook [2]. According to the documentation [12], the
DATAGEN is able to generate datasets at scales of 1 GB, 3 GB, 10 GB, 30 GB, etc.
Even though DATAGEN allows any scale factor, its documentation also specifies
that the maximum recommended factor for a dataset should be 1 TB. In our experi-
mental, the dataset size was set to 1, 10, and 100 GB. We limited the dataset size
because of space requirements. First, the DATAGEN runs on Hadoop, which needs a
lot of disk space to execute MapReduce jobs. Second, the DATAGEN generates CSV
(comma separated values) files with considerable sizes. For example, 1.2 GB of space
is required for a 1 GB dataset. Third, our physical design involves materialized paths
increasing the database size to 2.7 GB for a 1 GB dataset.
Queries: Besides DATAGEN, benchmark provides a BIW. This workload was
designed on choke points that are considered to test the DBMS performance on
82 Data Science
FIGURE 4.10 Time executed by each query on a 1 GB database using physical design and
the default Neo4j evaluation.
TABLE 4.1
Percentage of Improvements in Runtime of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 1 GB Database
Query BI 3 BI 4 BI 6 BI 7 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 22 BI 23
Time −1.48 31.05 14.53 97.53 15.67 53.19 20.92 9.77 11.83 −0.94 98.14 9.6
improv. (%)
DB Hits 10.55 13.91 17.54 96.02 18.04 53.32 62.63 5.66 2.15 0.3 95.91 0.36
improv. (%)
are unavoidable, and they still impact on the final execution plan. Additionally, path
materialization was more effective for queries BI 4, BI 9, and BI 16 if we compared
them against queries BI 18, BI 19, and BI 23. Their improvement percentages vary
from 15.67% to 31.05% because the materialized paths have a higher impact on the
query performance.
The third highest performance improvement was obtained for the query BI 14
in half the time of the strategy without physical design. Its improvement is 53.19%,
because an index structure was defined for a high selectivity condition. Also, the
initial execution plan expanded two different paths, and then they were joined with
the remaining nodes. However, the planner improves its execution plan filtering inter-
mediate results instead of completely expanded paths. In consequence, unnecessary
traversals are avoided significantly reducing the cost of retrieving nodes. In addition,
the performance of query BI 6 was a little better when the index was c reated, and its
improvement percentage was 14.53%.
84 Data Science
FIGURE 4.11 DB Hits by each query on a 1 GB database using physical design and the
default Neo4j evaluation.
TABLE 4.2
Percentage of Improvements in DB Hits of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 1 GB Database
Query BI 3 BI 4 BI 6 BI 7 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 22 BI 23
Time −1.48 31.05 14.53 97.53 15.67 53.19 20.92 9.77 11.83 −0.94 98.14 9.6
improv. (%)
DB Hits 10.55 13.91 17.54 96.02 18.04 53.32 62.63 5.66 2.15 0.3 95.91 0.36
improv. (%)
FIGURE 4.12 Time executed by each query on a 10 GB database using physical design and
the default Neo4j evaluation.
TABLE 4.3
Percentage of Improvements in Runtime of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 10 GB Database
Query BI 3 BI 4 BI 6 BI 7 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 22 BI 23
Time 7.76 51.83 14.86 97.56 33.48 53.28 11.91 16.16 12.29 12.67 97.75 14.11
improv. (%)
DB Hits 9.14 13.95 2.84 97.71 32.49 38.18 24.63 24.57 18.79 0.13 95.6 0.14
improv. (%)
FIGURE 4.13 DB Hits by each query on a 10 GB database using physical design and the
default Neo4j evaluation.
TABLE 4.4
Percentage of Improvements in DB Hits of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 10 GB Database
Query BI 3 BI 4 BI 6 BI 7 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 22 BI 23
Time 7.76 51.83 14.86 97.56 33.48 53.28 11.91 16.16 12.29 12.67 97.75 14.11
improv. (%)
DB Hits 9.14 13.95 2.84 97.71 32.49 38.18 24.63 24.57 18.79 0.13 95.6 0.14
improv. (%)
FIGURE 4.14 Time executed by each query on a 100 GB database using physical design
and the default Neo4j evaluation.
TABLE 4.5
Percentage of Improvements in Runtime of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 100 GB Database
Query BI 3 BI 4 BI 6 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 23
Time improv. (%) 98.65 64.42 36.74 68.82 71.48 77.82 70.78 78.71 58.89 38.39
DB Hits improv. (%) 8.34 16.65 0.32 39.78 42.6 60.55 2.64 1.06 0.08 0.1
0.05, the mean of the query execution times on a 100 GB database without physical
design is significantly higher than the mean of the query execution times on a 100
GB database with physical design.
In addition, Figure 4.15 shows the DB Hits for each query and strategy. The X axis
contains the query identifiers, and the Y axis corresponds to the DB Hits normalized
in the range (0, 1). Also, Table 4.6 includes the percentage of improvements in DB
Hits for each query.
It can be noted in Figure 4.15 and Table 4.6 that the DB Hits results are similar
to the execution time. However, we can observe that the performance of queries
BI 6 were lower w.r.t. the execution of this query for a 1 GB database and a 10 GB
database.
4.7 RELATED WORK
Physical database design has attracted much interest in the database community
during decades [17,18,19,20,21]. These works propose guidelines to follow in the field
Physical Design Strategy on a NoSQL DBMS 89
FIGURE 4.15 DB Hits by each query on a 100 GB database using physical design and the
default Neo4j evaluation.
TABLE 4.6
Percentage of Improvements in DB Hits of Our Physical Design Strategy
w.r.t. the Default Neo4j Evaluation for Each Query on a 100 GB Database
Query BI 3 BI 4 BI 6 BI 9 BI 14 BI 16 BI 18 BI 19 BI 21 BI 23
Time improv. (%) 98.65 64.42 36.74 68.82 71.48 77.82 70.78 78.71 58.89 38.39
DB Hits improv. (%) 8.34 16.65 0.32 39.78 42.6 60.55 2.64 1.06 0.08 0.1
4.8 DISCUSSION
In this section, some limitations of our proposed guidelines are discussed. The first
limitation is that the proposed guidelines assume that all designers own basic physi-
cal database design skills. With no basic knowledge of physical database design, it
will be a challenge to start designing with the proposed guidelines. In this sense,
path materialization can be difficult to apply because Neo4j does not provide sup-
port for materialized views. It is the designer’s responsibility to create and update
materialized paths, in addition to rewriting the query by replacing the portion of the
path to be materialized with the materialized relationship. However, a designer can
effectively apply the guidelines through experimentation and determine which one is
the best guideline according to the query characteristics. Also, our proposed guide-
line may be the basis for automating the physical design process on graph-oriented
NoSQL databases from scratch, which may not require further technical training.
The optimizer can automatically select the best materialized path for a query, and
the database engine can keep updated all the materialized paths.
The second limitation relates to the few structures provided by the DBMS. The
indices are not used by Neo4j if they are defined on a property in the condition of
an optional match clause. Sometimes, a hash index can be advantageous when a
condition involves an exact value instead of a range of values. Thus, the number
of structures limits the number of guidelines that can be defined for the database
physical design.
The third limitation is the amount of NoSQL DBMS. The guidelines described
herein do not necessarily apply to other DBMSs. For example, the path materializa-
tion no longer makes sense for other types of DBMSs. Also, there is no standard
for the query language as with relational DBMSs. The physical design guidelines
associated with the query rewriting is tied to the Cypher language of Neo4j. When
the DBMS changes, the query language becomes different, and therefore, query
rewriting guidelines must be rethought.
newest version of Neo4j as future work to define new guidelines to help the designer
to improve performance queries.
In addition, the proposed physical design guidelines are based on query rewriting,
path materialization, and indexing. These guidelines help to improve those queries
characterized by optional match clauses and large search spaces due to a consider-
able number of relationships and nodes. Some queries worsened or slightly improved
mainly because indexing or materialization could not be applied or they just did not
contribute enough to reduce the processing time. In this sense, the strategy of how
Neo4j evaluates such queries must be analyzed in depth to propose better physical
design guidelines that achieve a better query performance.
Lastly, multimodel NOSQL systems are being more popular. To propose a
physical design on a multimodel NOSQL system can be highly challenging because
they support several different data models. Also, this type of DBMS can establish
a standard of query language, and thus, guidelines such as query rewriting can be
sustained over time when not having to be rethought for another query language.
4.10 CONCLUSION
This chapter has unveiled a physical design proposal for a database of SNB and has
also presented an experimental study indicating that there is a substantial improve-
ment for the query performance in Neo4j when the physical design guidelines pro-
posed in this work are applied; performance is measured in terms of execution time
and DB Hits.
The proposed guidelines defined in this work are based on query rewriting, mate-
rialization, and indexes. There is a cost associated with the creation of an index or
path materialization, because the index requires additional space, and the path must
be precomputed and stored. However, the SNB queries are analytical and are fre-
quently executed, and therefore, paying this price seems worthwhile.
The proposed guidelines were applied for queries on databases of different sizes
to empirically study the query performance. In addition, statistical tests were per-
formed to validate our results. Nonparametric tests were chosen because they sta-
tistically demonstrate that sampled data were not normally distributed. Based on
statistical tests, it was proved that the resulting runtimes for 13 queries using the
proposed physical design guidelines were better with a statistically significant differ-
ence. Additionally, our experimental study shows that the execution time improves
as the database size increases.
REFERENCES
1. A. Singhal, “Introducing the Knowledge Graph: Things, Not Strings,” Official Google
Blog. May, 2012. [Online]. Available at: www.blog.google/products/search/introducing-
knowledge-graph-things-not/. [Accessed Jan. 30, 2019].
2. A. Castro, J. González, and M. Callejas, “Utilidad y funcionamiento de las bases de
datos NoSQL,” Revista Facultad De Ingeniería, vol. 21, pp. 21–32, Jan. 2013.
3. P. Selmer and M. Needham, “Tuning Your Cypher: Tips & Tricks for More Effective
Queries. Neo4j Graph Database Platform,” 2016, [Online]. Available at: https://neo4j.
com/blog/tuning-cypher-queries/. [Accessed Oct. 30, 2018].
92 Data Science
23. M. Goncalves and J. N. Mendoza, “A physical design strategy for datasets with multiple
dimensions,” in Intelligent Multidimensional Data Clustering and Analysis, Editors:
Siddhartha Bhattacharyya , Sourav De , Indrajit Pan, Paramartha Dutta, IGI Global,
pp. 1–27, 2017.
24. A. A. Imam, S. Basri, R. Ahmad, J. Watada, M. T. Gonzalez-Aparicio, and M. A.
Almomani, “Data modeling guidelines for NoSQL document-store databases,”
International Journal of Advanced Computer Science and Applications, vol. 9, no. 10,
pp. 544–555, 2018.
25. C. Asaad and K. Baïna, “NoSQL databases–seek for a design methodology,” in
International Conference on Model and Data Engineering, Springer, Cham, pp.
25–40, 2018.
26. M. J. Mior, K. Salem, A. Aboulnaga, and R. Liu, “NoSE: Schema design for NoSQL
applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10,
pp. 2275–2289, 2017.
27. S. Agarawal, S. Chaudhuri, and V. Narasayya, “Automated selection of materialized
views and indexes for SQL databases,” in Proceedings of 26th International Conference
on Very Large Databases, Cairo, Egypt, pp. 191–207, 2000.
28. Y. Papakonstantinou, “Polystore query rewriting: The challenges of variety,” in EDBT/
ICDT Workshops, 2016, Bor-deaux, France
5 Large-Scale Distributed
Stream Data
Collection Schemes
Tomoya Kawakami
Nara Institute of Science and Technology
Tomoki Yoshihisa
Osaka University
Yuuichi Teranishi
Osaka University
National Institute of Information and
Communications Technology
CONTENTS
5.1 I ntroduction.....................................................................................................96
5.2 Data Collection Scheme for Distributed TBPS...............................................97
5.2.1 Assumed Environment........................................................................97
5.2.1.1 Assumed TBPS Architecture................................................97
5.2.1.2 Assumed Overlay for Distributed TBPS............................... 98
5.2.2 Proposed Method............................................................................... 100
5.2.2.1 Methodology Principle....................................................... 100
5.2.2.2 Collective Store and Forwarding........................................ 101
5.2.2.3 Adaptive Data Collection Tree............................................ 103
5.2.3 Evaluation.......................................................................................... 105
5.2.3.1 Simulation Parameters........................................................ 105
5.2.3.2 Simulation Results.............................................................. 106
5.3 Data Collection Scheme Considering Phase Differences.............................. 109
5.3.1 Problems Addressed.......................................................................... 109
5.3.1.1 Assumed Environment........................................................ 109
5.3.1.2 Input Setting........................................................................ 110
5.3.1.3 Definition of a Load............................................................ 110
5.3.2 Proposed Method............................................................................... 111
5.3.2.1 Skip Graphs......................................................................... 111
5.3.2.2 Phase Differences............................................................... 112
95
96 Data Science
5.1 INTRODUCTION
The Internet of Things (IoT) [1] has attracted greater interest and attention with
the spread of network-connected small devices such as sensors, smartphones, and
wearable devices. In the data science field, stream data generated from IoT devices
are analyzed to get various information. A larger amount of data can lead to high-
quality information as faster stream data collection is one of the main techniques in
the data science field, and various schemes have been proposed. To enable IoT appli-
cations for data collection, pub/sub messaging [2] is considered to be a promising
event delivery method that can achieve asynchronous dissemination and collection
of information in real time in a loosely coupled environment. For example, sensor
devices correspond to publishers, and IoT application corresponds to a s ubscriber.
Topic-Based Pub/Sub (TBPS) protocols such as MQTT (Message Queuing Telemetry
Transport) [3] and AMQP (Advanced Message Queuing Protocol) [4] are widely
used by many IoT applications. These systems have a broker server for managing
topics. The broker gathers all the published messages and forwards them to the
corresponding subscribers.
In IoT and Big Data applications, collecting all of the raw (unfiltered) sensor data
is important for conducting various forms of analysis [5]. In this case, the larger
the number of sensors treated on the application for analysis, the larger the number
of messages that need to be received per time unit on the broker and subscribers
in TBPS. For example, when there are publishers corresponding to a certain kind
of sensor that publishes sensor data every 10 s, the number of target sensors in an
application is 10,000, and the broker must receive 1,000 messages per second on
an average. Thus, the number of messages received by the broker and subscribers
tends to explode in IoT and Big Data applications. In general, the number of mes-
sages sent and received per unit time affect the network process load, because tasks
such as adding/removing headers and serializing/deserializing payloads are required
for each message. Therefore, even though the size of each sensor data is small, the
increase in the number of publishers can cause network process overloads on the
broker and subscribers. This leads to a loss of data or unusual increases in delivery
latency problems that have an adverse effect on IoT and Big Data applications.
Many existing studies tackle the problem of scalability in TBPS systems. The
approach of these studies is based on distributed brokers in which brokers are
run as peers in a peer-to-peer system. The brokers construct an overlay network
among themselves. For example, there are Distributed Hash Table (DHT)-based
Distributed Stream Data Collection Schemes 97
approaches [6,7], hybrid overlay approaches [8], and Skip Graph (SG)-based [9]
approaches [10,11]. These approaches can keep the number of connections that each
broker needs to accept small by multihop message forwarding on overlays. However,
the aim of these existing methods is to deliver messages from one publisher to multi-
ple subscribers in a scalable manner. Thus, they are unable to avoid network p rocess
overloads caused by collection, such as when messages are received from a large
number of publishers. In addition, the existing techniques do not assume different
intervals at the same time to periodically collect data from publishers.
Therefore, we define continuous sensor data with different intervals (cycles) as a
sensor data stream and have proposed collection methods for distributed sensor data
streams as a TBPS system [12,13]. Especially in [12], we have proposed a message
forwarding scheme on overlays called “Collective Store and Forwarding” (CSF).
The scheme can reduce a load of network processes dramatically when there are a
large number of publishers (nodes), maintaining the delivery time constraints given
for the messages. In addition, we have also proposed a flexible tree construction
method called “Adaptive Data Collection Tree” (ADCT) on Chord# [14] that can
adjust the maximum load of network processes on distributed brokers to avoid over-
loads caused by CSF. Moreover, we have proposed an expanded method assigning
phase differences to balance the collection time among the same or specific col-
lection cycle nodes [15]. We call this novel approach “phase shifting (PS).” The PS
approach enables the SG method to decrease the probability of load concentration
to the specific time or node. Assigning phase differences at random to the nodes,
the collection times are distributed even if there are same collection cycle nodes.
We have evaluated our proposed methods in simulation. Our experimental results
show that our proposed method can reduce the loads of nodes and realize highly
scalable systems to periodically collect distributed sensor data. The scalability of the
data collection systems is significantly important to accommodate a huge number of
objects and encourage the growth of the data science field.
In the following, the data collection scheme for distributed TBPS is described in
Section 5.2. The data collection scheme considering phase differences is described
in Section 5.3. We describe the discussion and related work in Sections 5.4 and 5.5,
respectively. Finally, the conclusion of the chapter is presented in Section 5.6.
5.2.1 Assumed Environment
In this section, we present an overview of the distributed TBPS environment that we
assume.
scalability. Each broker corresponds to the network entity located physically close to
each end device. The end devices correspond to IoT entities such as sensors, smart-
phones, and appliances.
The broker can run on any networked computer that is located close to the end
device. The computer can be a server at the distributed data center, a Wi-Fi access
point with computing power, and a personal desktop computer. The model is com-
patible with the hierarchical cloud architectures recently proposed for IoT (the so-
called “edge computing” architecture).
A broker contains a subscriber or a publisher of a topic when one of the devices
accommodated on the broker attempts to subscribe/publish to the topic. The brokers
then construct an overlay network for message delivery on each topic. Each broker
joins the overlay network to handle distributed message deliveries. The end device
can be a broker if it is able to handle an overlay function. Hereafter, an entity that
joins the overlay is referred to as a “node.” The node can be a publisher, a subscriber,
or both. Brokers can contain multiple publishers or subscribers across different
topics.
FIGURE 5.2 TBPS overlay structure on Chord# and its forwarding path.
2l-th skipped node on the right side. These pointers are maintained as a routing
table of the node. For example, the node with ID 1 has a pointer to 3 at levels 1 and
5 at level 2, as shown in the figure. This pointer set is called a “finger table.” The
structure of the finger table is constructed by using the periodic message exchange
between the neighbor nodes on each level.
A message toward a key is forwarded using upper-level pointers. By using this
structure, any key can reach the corresponding node at most log2 N hops, where N is
the total number of nodes. Moreover, Chord# can maintain the same level of perfor-
mance even when there are a small number of nodes, as long as the routing tables are
updated sufficiently. Refer article [14] for more details about Chord#.
In ST, each subscriber or publisher joins a structured overlay network based on
its key, which includes a string for the topic name. One broker can have multiple
keys to contain multiple publishers or subscribers of different topics. To distinguish
subscribers with the same topic name, a unique suffix (ID) is added to the topic
name on each key. There is a subscriber segment and a publisher segment on each
topic. The subscriber segment must exist on the right side of the publisher segment
on Chord#, because it only has right-side pointers on upper levels. This structure
enables publishers to detect the existence of subscribers using the neighbor link at
level 0, thereby reducing the amount of redundant traffic from publishers when there
are no subscribers [10].
In this TBPS overlay structure, the following theorem holds, as it uses Chord# as
the base structure.
Theorem 1
The maximum number of hops required to deliver a message for a topic and the
number of connections each node need to accept are log2 M, where M is the number
of publishers and subscribers for a topic.
100 Data Science
By this theorem, the message delivery for a large number of subscribers is exe-
cuted in a scalable manner. However, the broker that accommodates a subscriber can
cause network process overloads when an application needs to accept a large number
of publishers. The number of messages the node needs to receive in a short period
becomes at most as many as the number of publishers, because each publisher runs
asynchronously. In addition, a few publishers need to forward a larger number of
messages than others due to the path concentration of data forwarding on structured
overlays. As a result, the load of the network process easily exceeds the capacity of
the node.
5.2.2 Proposed Method
In this section, we describe the details of our proposed TBPS scheme for solving the
problems associated with large-scale data collection.
5.2.2.1 Methodology Principle
The basic idea of our proposal is to reduce the overhead of message forwarding for a
large number of messages by using a mean known as “message merging.”
In general, if the size of data that is needed to be transferred is the same, the per-
formance is better when multiple items of data are sent as one large chunk segment
than they are sent in multiple small segments [16]. Therefore, if there are a large
number of small messages that are needed to be delivered, the message transfer time
is reduced by merging the messages into one large message.
Figure 5.3 shows the results of a preliminary experiment that measured the trans-
mission speed between two virtual machines on a cloud test bed [17] using a Java-
based messaging framework [18]. The specifications of the experiment environment
are listed in Table 5.1.
TABLE 5.1
Preliminary Experiment Setup
Parameters Value
Virtual CPU 1 core/2.1 GHz
No. of virtual machines 1/Host
Memory 2 Gbyte
Network 1000 BASE-T
OS Ubuntu 12.04
Protocol TCP/IP
Hypervisor KVM 3.5
The average of ten times per experiment for each data size is plotted. As shown
in the figure, when the message size is small, the network speed is limited. As mes-
sage size grows, the network speed rises. In this experimental environment, if the
message size is larger than 16 kbyte, the network speed is saturated. The cause of
the speed limitation is the overhead needed to handle each message, such as adding/
removing headers and serializing/deserializing payloads.
In TBPS systems, publishers act asynchronously, and the messages are published
at the timing set by the publisher. To merge messages in this situation, we propose a
message forwarding scheme called CSF on Chord#-based distributed TBPS. In CSF,
we assume that each message has its own “delivery deadline.” In IoT applications,
the sensor data has its own expiration time, which refers to the duration for which the
data is considered to be valid. For example, if an image sensor observes the density
of people in a certain area every 1 min, the observed value may be valid for 1 min. In
this case, the deadline of the message is 1 min. The delivery deadline is defined by
publishers based on the time constraints of the message content. The relay nodes on
the message delivery tree on CSF store the received messages and forward merged
messages, observing the delivery deadline.
We also propose an adaptive tree construction method called ADCT. ADCT
adaptively adjusts the maximum overhead for the nodes to forward messages as the
CSF process. By sharing the message process loads among publisher nodes that are
close to the subscriber segment, the ADCT can flexibly reduce the maximum over-
head and prevent the concentration of network process loads.
memory or disk. In other cases, the message is forwarded to the next node. T is the
margin time for message transfers, which should be specified if the message must
strictly meet the deadline. The value should be larger than the node-to-node latency
multiplied by the maximum number of hops. The maximum number of hops can be
estimated by the maximum finger table entry level that points to a node within the
same topic.
In the publish function, the append function creates a payload of the messages
merging the stored messages. At this time, the deadline is set to the earliest pos-
sible deadline (a process which we refer to as “deadline reduction”). The message
is then forwarded to the next node, and the storage is cleared to store messages that
will be received until the next instance of publishing on the topic. As described ear-
lier, CSF requires an extra process to store and merge messages compared with the
normal overlay implementations. However, it is a simple algorithm to implement. If
the physical storage size is limited, the algorithm can be easily modified to remain
within the predefined storage size.
Figure 5.5 shows the cases of CSF. In these sequence diagrams, the small circle
indicates the timing for invoking the publish function.
There are three publishers, and two of them (publishers 2 and 3) also act as relays.
Figure 5.5a shows a typical case in which the messages can be merged. In this exam-
ple, the deadline of message 1 is after the generation time of message 4 on pub-
lisher 2. Therefore, message 1 is merged with message 4. Likewise, message 4 is
merged with message 8 on publisher 3. As a result, the subscriber receives only one
Distributed Stream Data Collection Schemes 103
FIGURE 5.5 Message merging. (a) Merge without deadline reduction, (b) Merge with dead-
line reduction, (c) Pass through (no merge).
message when receiving the contents of 1, 4, and 8. Figure 5.5b shows another typi-
cal case in which the deadline of the published message on the relay node is reduced
(deadline reduction). Because message 1 is merged with 4, the deadline is reduced to
that of 1 since it has an earlier deadline. Message 4 could not be merged on publisher
3 because the deadline was reduced.
If the deadline of the received message is earlier than the next time to publish,
the message must be forwarded immediately. Figure 5.5c provides an example of
this case.
According to Theorem 1, the messages published for a topic t have chances to be
merged log2 |Nt| times at the maximum and log2 |Nt|/2 times on average, in which Nt
is the number of publishers for t when there is only one subscriber for t.
boundary nodes based on this information. Figure 5.6 shows a boundary node adjust-
ment protocol on ADCT. The responsive subscriber is hereafter written as r. If one
of the boundary nodes becomes overloaded, an INVITE message is sent to the node
that has the minimum key among boundary nodes with the pointer to r. After that,
the message is forwarded to an adjacent node. The node that receives the INVITE
message acts as a boundary node. If all boundary nodes are underloaded, then a
WITHDRAW message is sent to the leftmost boundary nodes, and the number of
boundary nodes is reduced.
On the boundary nodes, the following theorem holds.
Theorem 2
5.2.3 Evaluation
In order to show the effectiveness of our TBPS method, we conducted simulations
to evaluate its performance. We developed an original simulator program to evalu-
ate the performance on a 10,000 node-scale peer-to-peer system, because existing
network simulators do not have the ability to evaluate this scale. Instead of simulat-
ing underlay physical networks, we conducted a peer-to-peer simulation in which
we calculated the network performance using real data obtained from experiments.
We evaluated the received number of messages on a publisher/subscriber, the
received message size on a subscriber, and the estimated network occupation time
for message handling on a publisher/subscriber.
5.2.3.1 Simulation Parameters
The simulation parameters are shown in Table 5.2. To evaluate the basic perfor-
mance of our proposal, the messages were published periodically with the same
cycle (1 time unit), the deadline was set to be the same as the length of the cycle,
106 Data Science
TABLE 5.2
Simulation Setup
Parameters Value
Publish cycles 1 (time unit)
Deadline 1 (time unit)
Message data size 128 (bytes)
Transmission speed Follows Figure 5.3
Number of nodes 1,000–10,000
Elapsed time 128
T 0
Tree structures DAT, RQ, ADCT
and T is set as zero (ignore the transmission latency). The message size is set as
128, which is a relatively small message but enough for sensor data. To estimate the
network occupation time, the result of the preliminary experiment on Figure 5.3 was
used. The simulations were executed ten times for each parameter, and the average
value was plotted.
As a comparison tree structure for ADCT, the aggregation tree structure of
Distributed Aggregation Tree (DAT) and range query by Split Forward Broadcasting
(SFB) (denoted as RQ) were used. Though DAT is proposed as an aggregation
method, it is a basic tree construction method and can be applied to CSF on Chord#.
RQ is the reverse tree structure of a range query from r. The query path was con-
structed for all publishers from r by SFB. RQ requires periodic range queries to be
executed, but DAT and ADCT do not require this process. Example structures of
DAT and RQ are shown in Figure 5.8.
5.2.3.2 Simulation Results
Figures 5.9–5.11 show the results when the number of publishers was changed from
1,000 to 10,000. In these cases, |B| in ADCT was set as 2. Figures 5.12–5.14 show
the results when |B| was changed in ADCT. DAT and RQ become a constant value in
these cases. The number of publishers was set as 10,000.
Figures 5.9 and 5.12 show the number of messages on r and the publisher
node that received the maximum number of messages (hereafter, p). As shown in
Figure 5.9, CSF can reduce the load dramatically. When there are 10,000 nodes,
nearly 98%–99% of the number of messages was reduced on both r and p. When |B|
increased, the load on p was reduced as expected.
Figures 5.10 and 5.13 show the maximum and average size of the messages
received on r. Though the maximum size of the messages exhibits a similar tendency
for all tree structures, the average size of the message on ADCT becomes larger than
other structures, especially when |B| is small.
Figures 5.11 and 5.14 show the network occupation time to receive messages on
r and p. The performance using CSF seems similar in each tree structure. Compared
with the structure without CSF, about 90% of the network occupation time was
reduced on both r and p. When CSF is not used, about 350 ms was consumed on r.
108 Data Science
It is too long as a network process since the result was only for one topic, but as a
TBPS system, multiple topics need to be handled.
The network occupation time on p became lower than other methods when |B| is
larger than 4. The results show that the boundary node adjustment protocol can prop-
erly control the network load. On the other hand, though the number of messages on
r was increased, the network occupation time increased slightly. That is because the
average size of ADCT is kept as large as other methods, even when |B| is increased
as shown in Figure 5.13.
Distributed Stream Data Collection Schemes 109
5.3.1 Problems Addressed
In this section, we present the addressed problems for our data collection scheme
considering phase differences.
5.3.1.1 Assumed Environment
The purpose of this study is to disperse the communication load in the sensor stream
collections that have different collection cycles. The source nodes have sensors so as
to gain sensor data periodically. The source nodes and collection node (sink node)
of those sensor data construct P2P (Peer to Peer) networks. The sink node searches
110 Data Science
source nodes and requires a sensor data stream with those collection cycles in the
P2P network. Upon reception of the query from the sink node, the source node starts
to deliver the sensor data stream via other nodes in the P2P network. The intermedi-
ate nodes relay the sensor data stream to the sink node based on the routing tables.
5.3.1.2 Input Setting
The source nodes are denoted as Ni (i = 1, …, n), and the sink node of sensor data is
denoted as S. In addition, the collection cycle of Ni is denoted as Ci.
In Figure 5.15, each node indicates a source node or sink node, and the branches
indicate collection paths for the sensor data streams. Concretely, they indicate com-
munication links in an application layer. The branches are indicated by dotted lines
because there is a possibility that the branches may not collect a sensor data stream
depending on the collection method. The sink node S is at the top, and the four
source nodes N1, …, N4 (n = 4) are at the bottom. The figure in the vicinity of each
source node indicates the collection cycle, and C1 = 1, C2 = 2, C3 = 2, and C4 = 3.
This corresponds to the case where a live camera acquires an image once for every
second, and N1 records the image once every second, N2 and N3 record the image
once every 2 s, and N4 records the image once every 3 s, for example. Table 5.3 shows
the collection cycle of each source node and the sensor data to be received in the
example in Figure 5.15. The purpose of this study is to disperse the communica-
tion load in the sensor stream collections that have different collection cycles. The
source nodes have sensors so as to gain sensor data periodically. The source nodes
and the collection node (sink node) of those sensor data construct P2P networks. The
sink node searches source nodes and requires a sensor data stream with those col-
lection cycles in the P2P network. Upon reception of the query from the sink node,
the source node starts to deliver the sensor data stream via other nodes in the P2P
network. The intermediate nodes relay the sensor data stream to the sink node based
on their routing tables.
5.3.1.3 Definition of a Load
The communication load of the source nodes and sink node is given as the total of
the load due to the reception of the sensor data stream and the load due to the trans-
mission. The communication load due to reception is referred to as the reception
load, the reception load of Ni is Ii, and the reception load of S is I0. The communica-
tion load due to the transmission is referred to as the transmission load, the transmis-
sion load of Ni is Oi, and the transmission load of S is O 0.
TABLE 5.3
An Example of the Sensor Data Collection
Time N1 (Cycle = 1) N2 (Cycle = 2) N3 (Cycle = 2) N4 (Cycle = 3)
0 * * * *
1 *
2 * * *
3 * *
4 * * *
5 *
6 * * * *
7 *
… … … … …
In many cases, the reception load and the transmission load are proportional to
the number of sensor data pieces per unit hour of the sensor data stream to be sent
and received. The number of pieces of sensor data per unit hour of the sensor data
stream that is to be delivered by Np to Nq (q ≠p; p, q = 1, …, n) is R(p, q), and the
number delivered by S to Nq is R(0, q).
5.3.2 Proposed Method
In this section, we present our SG-based method considering phase differences.
5.3.2.1 Skip Graphs
In this study, we assume an overlay network for the SG-based TBPS, such as Banno
et al. [10].
SGs are overlay networks where the skip lists are applied in the P2P model [9].
Figure 5.16 shows the structure of an SG. In Figure 5.16, squares show entries of
routing tables on peers (nodes), and the number inside each square shows a key of
the peer. The peers are sorted in ascending order by those keys, and b idirectional
links are created among the peers. The numbers below the entries are called
5.3.2.2 Phase Differences
Currently, we have proposed a large-scale data collection schema for distributed
TPBS [12]. In [12], we employ “CSF” that stores and merges multiple small size
messages into a large message along a multihop tree structure on a structured
overlay for TBPS, taking into account the delivery time constraints. This makes
it possible to reduce the overhead of network process even when a large number
of sensor data is published asynchronously. In addition, we have proposed a col-
lection system considering phase differences [13]. In the proposed method, the
phase difference of the source node Ni is denoted as di (0 ≦ di < Ci). In this case,
the collection time is represented as Cip + di (p = 0, 1, 2, …). Table 5.4 shows the
time to collect data in the case of Figure 5.15, where the collection cycle of each
source node is 1, 2, or 3. By considering phase differences like Table 5.4, the col-
lection time is balanced within each collection cycle, and the probability of load
concentration to the specific time or node is decreased. Each node sends sensor
data at the time base on his collection cycle and phase difference, and other nodes
relay the sensor data to the sink node. In this chapter, we call phase differences as
“PS.” Figures 5.17 and 5.18 show an example of the data forwarding paths on SGs
without and with PS, respectively.
5.3.3 Evaluation
In this section, we describe the evaluation of the proposed SG-based method with
PS by simulation.
TABLE 5.4
An Example of the Collection Time Considering Phase Differences
Cycle Phase Difference Collect Time
1 0 0, 1, 2, 3, 4, …
2 0 0, 2, 4, 6, 8, …
1 1, 3, 5, 7, 9, …
3 0 0, 3, 6, 9, 12, …
1 1, 4, 7, 10, 13, …
2 2, 5, 8, 11, 14, …
Distributed Stream Data Collection Schemes 113
TABLE 5.5
Simulation Environments
Environment No. of Nodes Cycles
1 1,000 1, 2, 3
2 1,000 1, 2, …, 10
114 Data Science
the number of targeted nodes at each time. In the simulation environment 1, shown
by Figure 5.19, the case of not considering phase differences collects data from all
1,000 nodes at time 0, 6, 12, …, 96. This is because the collection cycle is 1, 2, or
3, and the lowest common multiple is 6. At other times in the case of not consider-
ing phase differences, the number of nodes extremely and constantly increases/
decreases. On the other hand, the collection time is shifted by the phase difference
in our proposed system, and the number of nodes is probabilistically equalized each
time if the phase difference of each node is determined at random. Therefore, the
probability of load concentration is decreased. Also in the simulation environment
2 shown by Figure 5.20, our proposed system can cause high balancing similar to
the results in the simulation environment 1, while the case of not considering phase
differences changes the number of nodes complexly by a combination of cycles
from 1 to 10.
Distributed Stream Data Collection Schemes 115
FIGURE 5.27 The maximum instantaneous load by the allowable number of stream
aggregation.
FIGURE 5.28 The total loads by the allowable number of stream aggregation.
5.4 DISCUSSION
We described the data collection scheme for distributed TBPS and PS approach
in Sections 2 and 3, respectively. Our experimental results show that our proposed
method can reduce the loads of nodes and realize highly scalable systems to periodi-
cally collect distributed sensor data.
Distributed Stream Data Collection Schemes 119
As the limitations of our current study, we assume that the pieces of data are
not so different from each other. In the real world, however, various types of data
are published at the same time, such as texts, images, and audio. Those pieces of
data have different sizes and loads to be processed. We can clear this limitation
by considering not only the number of data pieces (or transmission/reception) but
also the types of them. Similar to the inconsideration of data types, the inconsid-
eration of nodes’ performances is another limitation of this study. We can clear
this limitation by considering the nodes’ performance, such as processing power,
memory size, and network environment. In addition, this study has a limitation
in a viewpoint of security or privacy. For example, private data are preferred to
be sent to the subscriber via fewer nodes. Encryption of the data or communica-
tion is one of the common approaches, and arrangement of the data forwarding
paths considering security/privacy is another solution to clear this limitation,
e.g., the forwarding paths are directly connected to those valid subscribers for
private data.
5.5 RELATED WORK
Related to the distributed stream data collection, various techniques have been pro-
posed to disperse the communication loads for stream delivery [19].
P2P stream delivery techniques have been proposed to use a P2P architecture and
disperse the communication loads among the processing computers (nodes) [20–24].
The P2P stream delivery techniques are divided into pull type and push type. In the
pull-type technique such as PPLive, DONet [20], and SopCast, the reception nodes
request data from other nodes and receive them. The reception nodes find the nodes
that have not yet been receiving the requested data, and hence redundant communi-
cations do not occur. In the push-type techniques such as AnySee, data are sent from
the transmission node to other nodes [21]. The transmission nodes find the nodes
that have not yet received the requested data, and redundant communications do not
occur. Also, the techniques combining pull type and push type have been proposed,
such as PRIME [22].
Data delivery path construction techniques been proposed, as a multicast tree has
to prevent the concentration of communication loads to the specific node [25–29]. In
the ZIGZAG method, nodes construct clusters, and the multicast tree is constructed
by the clusters [25]. The number of clusters included in each depth of a multicast tree
is made the same, and thus, the loads are dispersed. Multicast trees are constructed
only from information gained in the application layer, and it is not necessary to
understand the physical network structure.
In the MSMT (Mean Scheduled Maintenance Time)/MBST (Main Board Sleep
Time) method, the concentrations of the communication loads are more prevented
than the ZIGZAG method by considering the physical network structure [27].
However, the implementability of the MSMT/MBST method was poor because it
is necessary to understand all the network structures between the nodes. In local-
ity aware clustering (LAC), the loads are more dispersed than the ZIGZAG method
by considering part of the nodes, though the physical network structure cannot be
understood [28].
120 Data Science
In the aforedescribed P2P stream delivery technique, the same data streams are
assumed to be sent to a number of reception nodes. In the delivery of sensor data
streams, however, the same sensor data stream is assumed to have different delivery
cycles to be delivered. In this case, those sensor data streams are delivered as differ-
ent data streams for each delivery cycle. Thus, the communication loads cannot be
efficiently dispersed. On the other hand, our proposed methods consider the different
frequencies or cycles of each data stream and construct delivery paths to efficiently
collect them.
As distributed stream data collection systems, an existing method to reduce the
number of messages to receive data from large-scale nodes is to execute a range
query on key order-preserving overlays. For example, “ SFB” [30] is an efficient way
to construct tree structures for range queries. The data collection from publishers
on a subscriber in TBPS corresponds to the execution of range query for nodes that
have keys with a topic. It can reduce the number of messages by merging responses
from nodes along the reverse path of the query delivery tree structure. However, this
method loses the asynchronous real-time feature of TBPS. The latest sensor data is
not delivered until the subscriber executes a range query. Once the tree structure is
constructed, it can be reused, but the periodic execution of range queries is needed
to catch up with the joins/leaves of publishers and subscribers.
Some existing works address the “aggregation problem” on structured overlays.
DAT [31] constructs a tree to aggregate data from distributed nodes using the Chord
[32] overlay structure. DAT computes the aggregated value of all the local values,
applying a given aggregation function on the distributed nodes. DAT can be used for
data collections if the nodes execute the message merging function as an aggregation
function. However, to merge and collect published messages, the publishers need
to publish messages at exactly the same time, which is not a realistic assumption.
Moreover, the aforementioned methods cause path concentration of data being for-
warded on the nodes that are located close to the subscriber nodes on the overlay. In
addition, once the tree structure is decided, it cannot be changed dynamically. As a
result, a network process overload tends to occur on these nodes. On the other hand,
our proposed ADCT method can construct a flexible collection tree and adaptively
adjust the maximum overhead for the nodes to merge and forward messages.
5.6 CONCLUSION
The chapter gives information about the latest techniques of large-scale data collec-
tion schemes to the readers. We define continuous sensor data with different inter-
vals (cycles) as a sensor data stream and described the techniques for TBPS systems
called “CSF,” “ADCT,” and “PS.” Our experiment results show that our proposed
method can reduce the loads of nodes and realize highly scalable systems to periodi-
cally collect distributed sensor data. The scalability of the data collection systems is
significantly important to accommodate a huge number of objects and encourage the
growth of the data science field.
In future, we will try to clear the current limitations described in Section 5.4.
More specifically, we will consider other information to determine the data forward-
ing paths such as data types, node performances, and security/privacy.
Distributed Stream Data Collection Schemes 121
ACKNOWLEDGEMENTS
This work was supported by JSPS KAKENHI Grant Number 16K16059, 17K00146,
18K11316, Hoso Bunka Foundation, and I-O DATA Foundation.
REFERENCES
1. Steve Hodges, Stuart Taylor, Nicolas Villar, James Scott, Dominik Bial, and Patrick
Tobias Fischer, Prototyping Connected Devices for the Internet of Things, IEEE
Computer, Vol. 46, No. 2, pp. 26–34, Feb. 2013.
2. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec,
The Many Faces of Publish/Subscribe, ACM Computing Surveys, Vol. 35, No. 2, pp.
114–131, June 2003.
3. MQTT Version 3.1.1. http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/mqtt-v3.1.1.pdf
(accessed Jan. 31, 2019).
4. Advanced Message Queuing Protocol. www.amqp.org/ (accessed Jan. 31, 2019).
5. Nik Bessis and Ciprian Dobre, Big Data and Internet of Things: A Roadmap for Smart
Environments, Studies in Computational Intelligence, Springer, Vol. 546, 2014, Berlin
6. Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony Rowstron,
SCRIBE: A Large-Scale and Decentralized Application-Level Multicast Infrastructure,
IEEE Journal on Selected Areas in Communications (JSAC), Vol. 20, No. 8, 2002, pp.
1489 - 1499
7. Sylvia Ratnasamy, Mark Handley, Richard M. Karp, and Scott Shenker, Application-
Level Multicast Using Content-Addressable Networks, in Proceedings of the 3rd
International COST264 Workshop on Networked Group Communication (NGC 2001),
pp. 14–29, Nov. 2001, Seoul, Korea
8. Fatemeh Rahimian, Sarunas Girdzijauskas, Amir H. Payberah, and Seif Haridi,
Vitis: A Gossip-Based Hybrid Overlay for Internet-Scale Publish/Subscribe Enabling
Rendezvous Routing in Unstructured Overlay Networks, in Proceedings of the 25th
IEEE International Parallel and Distributed Processing Symposium (IPDPS 2011), pp.
746–757, May 2011.
9. James Aspnes and Gauri Shah, Skip Graphs, ACM Transactions on Algorithms (TALG),
Vol. 3, No. 4 (37), pp. 1–25, Nov. 2007.
10. Ryohei Banno, Susumu Takeuchi, Michiharu Takemoto, Tetsuo Kawano, Takashi
Kambayashi, and Masato Matsuo, Designing Overlay Networks for Handling Exhaust
Data in a Distributed Topic-based Pub/Sub Architecture, Journal of Information
Processing (JIP), Vol. 23, No. 2, pp. 105–116, Mar. 2015.
11. Yuuichi Teranishi, Ryohei Banno, and Toyokazu Akiyama, Scalable and Locality-
Aware Distributed Topic-Based Pub/Sub Messaging for IoT, in Proceedings of the 2015
IEEE Global Communications Conference (GLOBECOM 2015), pp. 1–7, Dec. 2011,
San Diego, CA, USA
12. Yuuichi Teranishi, Tomoya Kawakami, Yoshimasa Ishi, and Tomoki Yoshihisa,
A Large-Scale Data Collection Scheme for Distributed Topic-Based Pub/Sub, in
Proceedings of the 2017 International Conference on Computing, Networking and
Communications (ICNC 2017), 6 pages, Jan. 2017.
13. Tomoya Kawakami, Yoshimasa Ishi, Tomoki Yoshihisa, and Yuuichi Teranishi,
A Skip Graph-Based Collection System for Sensor Data Streams Considering Phase
Differences, in Proceedings of the 8th International Workshop on Streaming Media
Delivery and Management Systems (SMDMS 2017) in Conjunction with the 12th
International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
(3PGCIC 2017), pp. 506–513, Nov. 2017.
122 Data Science
14. Thorsten Schütt, Florian Schintke, and Alexander Reinefeld, Range Queries on
Structured Overlay Networks, Computer Communications, Vol. 31, No. 2, pp. 280–291,
Feb. 2008.
15. Tomoya Kawakami, Tomoki Yoshihisa, and Yuuichi Teranishi, A Load Distribution
Method for Sensor Data Stream Collection Considering Phase Differences, in
Proceedings of the 9th International Workshop on Streaming Media Delivery and
Management Systems (SMDMS 2018) in Conjunction with the 13th International
Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2018),
pp. 357–367, Oct. 2018, Taichung, Taiwan
16. Kazuyuki Shudo, Collective Forwarding on Structured Overlays, IPSJ Transactions on
Advanced Computing Systems, Vol. 2, No. 3, pp. 39–46, Mar. 2009 (In Japanese).
17. Y. Teranishi, Y. Saito, S. Murono, and N. Nishinaga, JOSE: An Open Testbed for Field
Trials of Large-Scale IoT Services, NICT Journal, Vol. 6, No. 2, pp. 151–159, Mar.
2016.
18. Y. Teranishi, PIAX: Toward a Framework for Sensor Overlay Network, in Proceedings
of CCNC 2009, pp. 1–5, 2009, Las Vegas, US
19. Zhijie Shen, Jun Luo, Roger Zimmermann, and Athanasios V. Vasilakos, Peer-to-Peer
Media Streaming: Insights and New Developments, Proceedings of the IEEE, Vol. 99,
No. 12, pp. 2089–2109, Oct. 2011.
20. Xinyan Zhang and Jiangchuan Liu and Bo Li and Tak-Shing Peter Yum, CoolStreaming/
DONet: A Data-Driven Overlay Network for Peer-to-Peer Live Media Streaming,
in Proceedings of the 24th Annual Joint Conference of the IEEE Computer and
Communications Societies (INFOCOM 2005), pp. 2102–2111, Mar. 2005, Miami, US
21. Xiaofei Liao, Hai Jin, Yunhao Liu, Lionel M. Ni, and Dafu Deng, AnySee: Peer-to-
Peer Live Streaming, in Proceedings of the 25th IEEE International Conference on
Computer Communications (INFOCOM 2006), pp. 1–10, Apr. 2006.
22. Nazanin Magharei and Reza Rejaie, PRIME: Peer-to-Peer Receiver-Driven Mesh-
Based Streaming, in Proceedings of the 26th IEEE International Conference on
Computer Communications (INFOCOM 2007), pp. 1415–1423, May 2007.
23. Linchen Yu, Xiaofei Liao, Hai Jin, and Wenbin Jiang, Integrated Buffering Schemes
for P2P VoD Services, Peer-to-Peer Networking and Applications, Vol. 4, No. 1,
pp. 63–74, 2011.
24. Suguru Sakashita, Tomoki Yoshihisa, Takahiro Hara, and Shojiro Nishio, A Data
Reception Method to Reduce Interruption Time in P2P Streaming Environments, in
Proceedings of the 13th International Conference on Network-Based Information
Systems (NBiS 2010), pp. 166–172, Sept. 2010, Takayama, Gifu Japan
25. Ashwin R. Bharambe, Mukesh Agrawal, and Srinivasan Seshan, Mercury: Supporting
Scalable Multi-Attribute Range Queries, in Proceedings of the ACM Conference
on Applications, Technologies, Architectures, and Protocols for Computer
Communications (SIGCOMM 2004), pp. 353–366, Aug. 2004.
26. Duc A. Tran, Kien A. Hua, and Tai Do, ZIGZAG: An Efficient Peer-to-Peer Scheme
for Media Streaming, in Proceedings of the 22nd Annual Joint Conference of the IEEE
Computer and Communications Societies (INFOCOM 2003), pp. 1283–1292, Mar.
2003.
27. Xing Jin, W.-P. Ken Yiu, S.-H. Gary Chan, and Yajun Wang, On Maximizing Tree
Bandwidth for Topology-Aware Peer-to-Peer Streaming, IEEE Transactions on
Multimedia, Vol. 9, No. 8, pp. 1580–1592, Dec. 2007.
28. Kanchana Silawarawet and Natawut Nupairoj, Locality-Aware Clustering Application
Level Multicast for Live Streaming Services on the Internet, Journal of Information
Science and Engineering, Vol. 27, No. 1, pp. 319–336, 2011.
Distributed Stream Data Collection Schemes 123
29. Tien Anh Le and Hang Nguyen, Application-Aware Cost Function and Its Performance
Evaluation over Scalable Video Conferencing Services on Heterogeneous Networks,
in Proceedings of the IEEE Wireless Communications and Networking Conference:
Mobile and Wireless Networks (WCNC 2012 Track 3 Mobile and Wireless),
pp. 2185–2190, Apr. 2012, Paris, France
30. Ryohei Banno, Tomoyuki Fujino, Susumu Takeuchi, and Michiharu Takemoto,
SFB: A Scalable Method for Handling Range Queries on Skip Graphs, IEICE
Communications Express, Vol. 4 (2015), No. 1, pp. 14–19, Feb. 2015.
31. Min Cai and Kai Hwang, Distributed Aggregation Algorithms with Load-Balancing
for Scalable Grid Resource Monitoring, in Proceedings of the 21st IEEE International
Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1–10, Mar. 2007,
Long Beach, California
32. Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoek,
Frank Dabek, and Hari Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Protocol
for Internet Applications, IEEE/ACM Transactions on Networking, Vol. 11, No. 1,
pp. 17–32, Feb. 2003.
Part II
Data Design and Analysis
6 Big Data Analysis
and Management
in Healthcare
R. Dhaya and M. Devi
King Khalid University
CONTENTS
6.1 Introduction................................................................................................... 128
6.2 Preliminary Studies....................................................................................... 130
6.3 Healthcare Data............................................................................................. 132
6.4 Need of Big Data Analytics in Healthcare.................................................... 135
6.5 Challenges in Big Data Analysis in Healthcare............................................ 137
6.5.1 Capture.............................................................................................. 137
6.5.2 Cleaning............................................................................................. 137
6.5.3 Storage............................................................................................... 138
6.5.4 Security.............................................................................................. 138
6.5.5 Stewardship........................................................................................ 138
6.5.6 Querying............................................................................................ 139
6.5.7 Reporting........................................................................................... 139
6.5.8 Visualization...................................................................................... 139
6.5.9 Updating............................................................................................ 140
6.5.10 Sharing............................................................................................... 140
6.6 Collection of Healthcare Data....................................................................... 140
6.6.1 Importance in Healthcare Data Collection........................................ 141
6.6.2 Complications and Clarifications of Healthcare Data Collection..... 142
6.6.3 Current Data Collection Methods...................................................... 143
6.6.4 Advanced Data Collection Tools....................................................... 144
6.6.5 Healthcare Data Standards................................................................ 145
6.6.6 Inferences of Patient Data Collection in Healthcare......................... 146
6.7 Analysis of Healthcare Data.......................................................................... 148
6.8 Healthcare Data Management....................................................................... 150
6.8.1 Big Data and Care Management........................................................ 150
6.8.2 Advantages of Healthcare Data Management................................... 151
127
128 Data Science
6.1 INTRODUCTION
The guarantee of big data has gotten incredible attention in healthcare as it
inquires about after medication revelation, treatment development, custom-
ized prescription, and ideal patient care that can lessen cost and enhance quiet
results. Billions of dollars have been contributed to capture data laid out in big
activities that are frequently detached. The Continuously Learning Healthcare
System is additionally being supported by the Institute of Medicine to close the
hole between logical disclosure, patient and clinician commitment, and clinical
practice. Be that as it may, the big data guarantee has not yet been acknowledged
to its potential as the minor accessibility of data does not convert into informa-
tion or clinical practice. In addition, because of the variety in data unpredictabil-
ity and structures, inaccessibility of computational advancements, and worries of
sharing private patient data, few ventures of extensive clinical data sets are made
accessible to scientists as a rule [1].
The idea of “big data” isn’t new; the manner in which it is characterized is
continually evolving. Different endeavors at characterizing big data basically
portray it as an accumulation of data components, whose measure, speed, type,
as well as multifaceted nature expect one to look for, receive, and create new
e quipment and programming systems, with the end goal to effectively store, break
down, and imagine data [2]. Healthcare is a prime case of how the three V’s of
data – velocity, variety, and volume – are an inborn part of the data it produces.
This data is spread among numerous healthcare frameworks, health backup
plans, specialists, government substances, etc. Moreover, every one of these data
archives is soloed and innately unequipped for giving a stage to worldwide data
straightforwardness. To add to the three V’s, which is shown in Figure 6.1, the
veracity of healthcare data is likewise basic for its important use towards creating
inspirational research.
Big Data Analysis and Management 129
6.2 PRELIMINARY STUDIES
The relevant research papers on big data in healthcare applications have been sur-
veyed in various streams that include basics of big data in healthcare, collection of
data, and analysis of data and types.
Nada Elgendy and Ahmed Elragal (2014) analyzed various analytic methods and
tools that can be embedded to big data, as well as the chances created by using
big data analytics in several decision-making areas. M. D. Anto Praveena and B.
Bharathi (2017) provided an overview of the big data analytics, issues, challenges,
and different technologies that pertains to big data. E. A. Archenaaa and Mary
Anitab (2015) gave an insight of how we can remove excess value from the informa-
tion produced by healthcare and public sector offices. However, without genuine data
analytic methods, these data will be of no use. Linnet Taylor, Ralph Schroeder, and
Eric Meyer (2014) outlined the places, including the prediction and “nowcasting” of
economic trends and mapping and predicting influence in the context of marketing
or labor market data.
C. Lakshmi and V. Nagendra Kumar (2016) presented the meaning and explana-
tion of big data and define the big data challenges. They also presented a proper
framework to decompose big data systems into four sequential modules, namely data
generation, data acquisition, data storage, and data analytics to form a big data value
Big Data Analysis and Management 131
chain. Ticiana L. Coelho da Silva, Regis P. Magalhaes, et al. (2018) aimed at provid-
ing organizations in the choosing of technologies or platforms more related to their
analytic processes by extending a short review as per some categories of Big Data
problems by processing (streaming and batch), storage, data integration, a nalytics,
data governance, and monitoring. Tasleem Nizam and Syed Imtiyaz Hassan (2017)
presented the architecture of various technologies that can be applied for h andling
Big Data and can be defined in addition to the applications of Big Data system
representation. McKinsey surveyed (2012) a lot of enterprise systems to understand
that big data is something to handle, but the issue they face now is how to apply it
effectively. Jafar Raza Alam, Asma Sajid, Ramzan Talib, and Muneeb Niaz (2014)
studied the key issues why these organizations are not starting their planning stage to
implement and execute the big data strategy, because they do not understand enough
about the big data and its benefits.
Suriya Begum and Kavya Sulegaon (2016) discussed the analysis of Big Data
Analytics concepts and the existing techniques and tools like Hadoop for data secu-
rity. Martin Sarnovsky, Peter Bednar, and Miroslav Smatana (2018) designed a
framework that enlarges the concept of Healthcare Industrial internet of things (IoT)
for principles of edge and cognitive computing to improve the quality of healthcare,
improve security, and minimize costs for cloud services and network traffic in future
health IoT environments. Parupudi Aditya Vineet Kumar (2018) presented to classify
the literature further in the most suitable phase it can apply to improve the use of big
data analytics for the relevant phases of design science that have not been elaborated
properly.
Quek Kia Fatt and Amutha Ramadas (2018) discussed the applications and chal-
lenges of Big Data in Healthcare in applying big data in healthcare, especially in
relation to privacy, security, standards, governance, integration of data, data accom-
modation, data classification, incorporation of technology, etc. Gesundheit Österreich
Forschungs and Und Planungs GmbH (2016) studied Big Data in Public Health,
Telemedicine, and Healthcare, which helps to determine applicable examples of the
use of Big Data in health and create recommendations for their implementation in
the European Union. Ashwin Belle, Raghuram Thiagarajan, et al. (2015) discussed
about the major challenges involved, with a focus on three upcoming and promising
areas of medical research: image, signal, and genomic-based analytics. Potential
areas of research within this domain that have the provision to a meaningful effect
on healthcare delivery are also analyzed. Sanskruti Patel and Atul Patel (2016) fun-
damentally studied the effect of implementing big data solutions in the healthcare
sector, the potential opportunities, challenges and available areas, and tools to imple-
ment big data analytics.
Revanth Sonnati (2017) focused on the analysis of data and the strengths and
drawbacks compared with the conventional techniques available. Lidong Wang and
Cheryl Ann Alexander (2015) introduced the Big Data concept and characteristics,
healthcare data, and some vital issues of Big Data. Muhammad Umer Sarwar et al.
(2017) surveyed the Big Data Analytics in healthcare and highlighted the limita-
tions of existing machine learning algorithms that are outlined for big data analysis
in healthcare. R. Hermon and P. A. Williams (2014) studied a systemic review
methodology to provide a classification of big data use in healthcare, and their
132 Data Science
outcomes represented that the natural classification is not clinical application based,
rather it comes in four broad categories: administration and delivery, clinical deci-
sion support (CDS), consumer behavior, and support services. W. Raghupathi and V.
Raghupathi (2014) addressed the question of biomedical and health analyzers work-
ing in analytics that Big Data needs to know.
P. Groves, B. Kayyali, D. Knott, and S. Van Kuiken (2013) analyzed the effective
tools used for visualization of big data and the implementation of new visualization
tools to create the big data in healthcare industry to know the processes and use of
big data in healthcare management.
6.3 HEALTHCARE DATA
Various cases in healthcare are appropriate for big data. Some scholastic or
research-centered healthcare foundations are either trying different things with
big data or utilizing it in cutting edge to inquire about ventures. Those organi-
zations draw upon data researchers, analysts, graduates, so as to wrangle the
complexities of big data. In the accompanying areas, we’ll address a portion of
those complexities and what’s being done to streamline big data to make it more
available.
In healthcare, we do have huge volumes of data coming in. EMRs (electronic
medical record) alone gather immense measures of data. The greater part of that data
is gathered for recreational purposes, as indicated by Brent James of Intermountain
Healthcare. Be that as it may, neither the volume nor the speed of data in healthcare
is really sufficiently high to require big data today. Our work with health frameworks
demonstrates that just a little portion of the table in an EMR database is significant to
the current routine with regard to medication and its related investigation in utilizing
cases. Along these lines, by far, most of the data accumulation in healthcare today
could be viewed as recreational. In spite of the fact that data may have an incentive
not far off as the quantity of utilization cases grows, there aren’t numerous genuine
utilize cases for quite a bit of that data today.
There is absolute assortment in the data, yet most frameworks fundamentally
gather the same data objects with an incidental change to the model. So, new utilize
cases supporting genomics will absolutely require a big data approach. Figure 6.3
shows the types of healthcare data that shows electronic healthcare data (EHD),
administrative reports, claims data, patient or disease registry data, health survey
data, and clinical trial data.
A type of healthcare data is EHD, which needs to be saved as an electronic health
record (EHR) to denote an advanced adaptation of a patient’s paper diagram. EHRs
are ongoing, quite focused records that make data accessible right away and safely
to approved clients. While an EHR contains the medicinal and treatment chronicles
of patients, an EHR framework is worked to go past standard clinical information
gathered in a supplier’s office and can be comprehensive of a more extensive per-
spective of a patient’s consideration. EHRs can contain a patient’s medicinal history,
analyze meds, treatment designs, inoculation dates, hypersensitivities, radiology pic-
tures, and lab and test outcomes; enable access to confirm-based apparatuses that
suppliers can use to settle on choices about a patient’s consideration; and robotize
Big Data Analysis and Management 133
and streamline supplier work process. One of the key highlights of an EHR is that
health data can be made and overseen by approved suppliers in a computerized
configuration fit for being imparted to different suppliers crosswise over more than
one healthcare association. EHRs work to impart data to other healthcare suppliers
and associations; for example, research facilities, pros, restorative imaging offices,
drug stores, crisis offices, and school and working environment centers, and so they
contain data from all clinicians engaged with a patient’s consideration. The data
set in Table 6.1 presents the age-adjusted death rates for the ten leading causes of
death in the United States from 1999. (Courtesy: NCHS—Leading Causes of Death:
United States, www.healthdata.gov [5].) The objective of presenting this data set is
to categorize the disease name, reason for the disease, year of happenings, place in
which the disease falls, number of deaths, and age-adjusted death rate to analyze
sample healthcare data logically.
Data is based on information from all resident death certificates filed in the
50 states and in the District of Columbia using demographic and medical char-
acteristics. Age-adjusted death rates (per 100,000 populations) are based on
the 2000 U.S. standard population. Populations used for computing death rates
after 2010 are postnasal estimates based on the 2010 census, estimated as of July
1, 2010. Rates for census years are based on populations enumerated in the cor-
responding censuses. Rates for noncensus years before 2010 are revised using
updated intercostal population estimates and may differ from the rates previously
published.
134 Data Science
TABLE 6.1
Sample Healthcare Data Set
Age-
Adjusted
Year Cause Name Cause Name State Deaths Death Rate
2016 Accidents (unintentional injuries) Unintentional Alabama 2755 55.5
(V01-X59,Y85-Y86) injuries
2016 Accidents (unintentional injuries) Wyoming 371 61.9
(V01-X59,Y85-Y86)
2013 Alzheimer’s disease (G30) Alzheimer’s Alaska 72 18.9
2012 Alzheimer’s disease (G30) disease 102 27.2
2002 Alzheimer’s disease (G30) Delaware 128 16.7
2016 Alzheimer’s disease (G30) District of 120 18.3
2012 Alzheimer’s disease (G30) Columbia 129 20.5
2009 Alzheimer’s disease (G30) Rhode 321 21.7
2008 Alzheimer’s disease (G30) Island 359 24.4
2007 Alzheimer’s disease (G30) 328 22.2
2006 Alzheimer’s disease (G30) 297 20.4
2005 Alzheimer’s disease (G30) 298 20.9
2004 Alzheimer’s disease (G30) 283 20.1
2003 Alzheimer’s disease (G30) 303 22
2004 Malignant neoplasms (C00-C97) Cancer Vermont 1212 179.5
2003 Malignant neoplasms (C00-C97) 1210 183.6
2002 Malignant neoplasms (C00-C97) 1224 188.3
2001 Malignant neoplasms (C00-C97) 1249 196.4
2000 Malignant neoplasms (C00-C97) 1240 198
1999 Malignant neoplasms (C00-C97) 1255 203.7
2016 Malignant neoplasms (C00-C97) Virginia 15027 156.1
2014 Malignant neoplasms (C00-C97) 14749 161.5
2013 Malignant neoplasms (C00-C97) 14414 162
2011 Malignant neoplasms (C00-C97) 14376 170.8
2016 Chronic lower respiratory diseases CLRD Idaho 865 45.7
(J40-J47)
2015 Chronic lower respiratory diseases 843 46.3
(J40-J47)
2014 Chronic lower respiratory diseases 819 45.8
(J40-J47)
2013 Chronic lower respiratory diseases 808 46.7
(J40-J47)
2012 Chronic lower respiratory diseases 754 46
(J40-J47)
2011 Chronic lower respiratory diseases 824 51.5
(J40-J47)
2010 Chronic lower respiratory diseases 727 47
(J40-J47)
(Continued)
Big Data Analysis and Management 135
TABLE 6.2
HCUP Summary Statistics Report: KID 2016—Core File Means
of Continuous Data Elements
Variable/Label N N Miss Min. Max. Mean Std. Dev.
HOSP_KID: KID hospital 3,117,413 0 10,001 40,944 27,305.14 10,147.02
number
AGE: Age in years at 3,117,413 0 0.00 20.00 5.88 7.69
admission
AGE_NEONATE: Neonatal 1,681,361 1,436,052 0.00 1.00 0.87 0.34
age (first 28 days after birth)
indicator
AMONTH: Admission month 3,116,797 616 1.00 12.00 6.51 3.48
AWEEKEND: Admission day 3,117,403 10 0.00 1.00 0.21 0.41
is a weekend
DIED: Died during 3,113,802 3,611 0.00 1.00 0.01 0.07
hospitalization
DISCWT: KID discharge 3,117,413 0 0.86 24.44 2.01 2.37
weight
DISPUNIFORM: Disposition 3,113,802 3,611 1.00 99.00 1.34 1.87
of patient (uniform)
DQTR: Discharge quarter 3,116,797 616 1.00 4.00 2.50 1.13
DRG: DRG in effect on 3,117,413 0 1.00 999.0 660.53 250.43
discharge date
DRGVER: DRG grouper 3,117,413 0 33.00 34.00 33.25 0.44
version used on discharge date
DRG_NoPOA: DRG in use on 3,117,413 0 1.00 999.0 660.37 250.42
discharge date, calculated
without POA
DXVER: Diagnosis version 3,117,413 0 10.00 10.00 10.00 0.00
ELECTIVE: Elective versus 3,108,222 9,191 0.00 1.00 0.12 0.33
nonelective admission
FEMALE: Indicator of sex 3,116,216 1,197 0.00 1.00 0.52 0.50
HCUP_ED: HCUP emergency 3,117,413 0 0.00 4.00 0.42 0.86
department service indicator
HOSP_REGION: Region of 3,117,413 0 1.00 4.00 2.66 1.01
hospital
I10_HOSPBRTH: ICD- 3,117,413 0 0.00 1.00 0.42 0.49
10-CM indicator of birth in
this hospital
I10_NDX: ICD-10-CM 3,117,413 0 0.00 30.00 5.33 3.91
number of diagnoses on this
record
I10_NECAUSE: ICD-10-CM 3,117,413 0 0.00 4.00 0.12 0.49
number of external cause
codes on this record
Courtesy: www.hcup-us.ahrq.gov/db/nation/kid/kidsummarystats.jsp#2016.
Big Data Analysis and Management 137
6.5.1 Capture
All information originates from some place, however, deplorably for some medic-
inal services suppliers, it doesn’t generally originate from some place with pure
information administration propensities. Capturing information is spotless, finished,
precise, and arranged effectively, for their use in different frameworks is a continu-
ous fight for associations. Poor EHR, ease of use, tangled work processes, and an
inadequate comprehension of why enormous information is essential to c apture well
would all be able to add to quality issues that will torment information all through
its life cycle.
6.5.2 Cleaning
Healthcare service suppliers are personally familiarized with the significance of
orderliness in the facility. Ungraded information can rapidly crash a major informa-
tion examination venture, particularly when uniting dissimilar information sources
that may record clinical or operational components in marginally unique configura-
tions. Information cleaning that is otherwise called purging or scouring guarantees
138 Data Science
that data sets are exact, right, reliable, applicable, and not undermined at all. While
most information cleaning forms are still performed physically, some IT merchants
do offer mechanized scouring apparatus that utilize rationale standards to think
about, and differentiate huge data sets to guarantee large amounts of precision and
respectability in human service information distribution centers.
6.5.3 Storage
As the volume of human services information increases exponentially, a few suppli-
ers are not anymore ready to deal with the expenses and effects of on-start server
farms. Distributed storage is turning into an undeniably well-known alternative as
costs drop and unwavering quality develops. Nearly 90 percent of human services
associations are utilizing a type of cloud-based well-being IT foundation, including
capacity and applications. The cloud offers agile calamity recuperation, brings down
advance expenses, and demands less extension, despite the fact that associations
must be to a great degree cautious about picking accomplices that comprehend the
significance of Health Insurance Portability and Accountability Act (HIPAA) and
other social insurance explicit consistence and security issues.
6.5.4 Security
Data securities are the main need for social insurance associations, particularly in
the wake of a quick fire arrangement of prominent breaks, hackings, and deliver-
ing product scenes. From phishing assaults to malware to workstations incidentally
left in a taxi, social insurance data is liable to an almost vast cluster of vulnerabili-
ties. The HIPAA Security Rule incorporates an extensive rundown of specialized
shields for associations putting away Protected Health Information (PHI), including
transmission security, validation conventions, and powers over access, uprightness,
and examination. By and by, these shields convert into a sound judgment security
methodology [8], for example, utilizing something like date hostile to infection pro-
gramming, setting up firewalls, encoding delicate data, and utilizing multifaceted
validation. Social insurance associations should oftentimes help their staff individu-
als to remember the basic idea of data security conventions and reliably survey [9–10]
who approaches high-esteem data advantages by keeping noxious gatherings from
causing harm.
6.5.5 Stewardship
Medical services data, particularly on the clinical side, has a long time frame of real-
istic usability. Even it is not required to keep data open for something like 6 years,
suppliers may wish to use derecognized data sets for research ventures, which
make progressing stewardship and curation a vital concern. Data may likewise be
reused or rethought for different purposes, for example, quality estimation or execu-
tion benchmarking. It is difficult to understand when the data was made, by whom,
and for what reason, and additionally, who has recently utilized the data, and why,
or how. Creating a complete, precise, and up coming metadata is a key part of an
Big Data Analysis and Management 139
6.5.6 Querying
Additionally, hearty metadata and solid stewardship conventions make it simpler for
associations to question their data and to find the anticipated solutions. The capacity
to enquire data is primary for detail and investigation, yet human services asso-
ciations do commonly conquer various difficulties, though they can participate in
significant examination of their huge data resources. Right off the bat, they should
defeat data siloes and interoperability issues that keep questioning devices about
getting to the association’s whole storehouse of data. On the off chance that distinc-
tive segments of a data set are held in various walled-off frameworks or in various
arrangements, it may not be conceivable to create an entire picture of an association’s
status or an individual patient’s well-being. Numerous associations utilize Structured
Query Language (SQL) to jump into substantial data sets and social databases, yet
it is viable when a client would first be able to confide in the precision, culmination,
and institutionalization of the current data [11].
6.5.7 Reporting
After suppliers have nailed down the inquiry procedure, they should produce a report
that is clear, brief, and open to the intended interest group. By and by, the exactness
and honesty of the data has a basic downstream effect on the precision and unwaver-
ing quality of the report. Poor data at the start will deliver suspect reports toward
the finish of the procedure, which can hinder the clinicians attempting to utilize
data to treat patients. Suppliers should likewise comprehend the distinction between
“examination” and “revealing.” Reporting is regularly essential for the investigation
of data that must be extricated before it was very well inspected, yet announcing can
likewise remain without anyone else as a final result. While a few reports might be
equipped towards featuring a specific pattern, arriving at a novel resolution to make
an explicit move, others must be introduced in a way that enables the peruse to draw
his or her very own inductions about what the full range of data implies. Suppliers
have various alternatives for meeting these different necessities, including qualified
vaults, revealing apparatuses incorporated with their electronic well-being records,
and web-based interfaces facilitated by Centers for Medicare & Medicaid Services
(CMS) and different gatherings.
6.5.8 Visualization
For consideration purposes, spotless drawing of data perception can make it a lot
less demanding for a clinician to retain and utilize data properly. Shading coding is a
well-known data representation procedure that ordinarily delivers a quick reaction—
for instance, red, yellow, and green are all comprehended to mean stop, alert, and go.
140 Data Science
Associations should likewise consider great data introduction rehearses, for exam-
ple, graphs utilize an appropriate extent to represent differentiating figures and the
right marking of data to diminish the potential disarray [12]. Tangled flow charts,
cramped or covering content, and low-quality designs can baffle and bother benefi-
ciaries, driving them to disregard or confound data.
6.5.9 Updating
Healthcare data isn’t static, and most components will require moderately visited
refreshes, with the end goal to stay current and important. For a few data sets, similar
to persistent imperative signs, these updates may happen like clockwork. Other data,
such as personal residence or conjugal status, may just change a couple of times amid
a person’s whole lifetime. Understanding the instability of huge data, or how regu-
larly and to what degree it changes, can be a test for associations that don’t reliably
screen their data resources. Suppliers must have an unmistakable thought of which
data sets require manual refreshing, which can be computerized, how to finish this
procedure without downtime for end clients, and how to guarantee that updates can
be directed without harming the quality or uprightness of the data set. Associations
ought to likewise guarantee that they are not making superfluous copy records from
refresh to a solitary component, which may make it troublesome for clinicians to get
to important data for the patient’s basic leadership [13]. The challenges of big data in
healthcare are listed in Table 6.3.
6.5.10 Sharing
Hardly any suppliers work in a void, and less patients get the majority of their
consideration at an unaccompanied area. This implies that offering data to outer
accomplices is basic, particularly, as the business moves towards populace well-
being of the executives and esteem-based consideration. Data interoperability is a
perpetual worry for associations of numerous types, sizes, and positions along the
data development range. Major contrasts in the manner in which electronic well-
being records are planned and executed can extremely reduce the capacity to move
data between different associations, regularly leaving clinicians without data as
they have to settle on key choices, catch up with patients, and create techniques to
enhance by and large results. The business is presently endeavoring to enhance the
sharing of data crosswise over specialized and hierarchical boundaries [14].
TABLE 6.3
The Challenges of Big Data in Healthcare
Domain Challenges
Capture Spotless, finished, precise, and arranged effectively for use in different frameworks is
a continuous fight for associations
Cleaning Purging or scouring guarantees that data sets are exact, right, reliable, applicable, and
not undermined at all
Storage Distributed storage is turning into an undeniably well-known alternative as costs drop
and unwavering quality develops
Near 90% of human services associations are utilizing a type of cloud-based
well-being IT foundation, including capacity and applications
Security Main need for social insurance associations, particularly in the wake of a quick fire
arrangement of prominent breaks, hackings, and delivery of product scenes
Stewardship A data steward can guarantee that all components have standard definitions and
arrangements, are archived fittingly from creation to erasure, and stay helpful for the
jobs that need to be done
Querying The capacity to inquiry data is primary for detailing and investigation, yet human
services associations should commonly conquer various difficulties previously they
can participate in significant examination of their huge data resources
Visualization Tangled flowcharts, cramped or covering content, and low-quality designs can baffle
and bother beneficiaries, driving them to disregard or confound data. Regular
instances of data perceptions incorporate warmth maps, bar outlines, pie graphs,
scatterplots, and histograms, all of which have their very own particular use to show
ideas and data
Updating Understanding the instability of huge data, or how regularly, and to what degree it
changes can be a test for associations that don’t reliably screen their data resources.
Sharing This implies offering data to outer accomplices is basic, particularly as the business
moves towards populace, well-being of the executives and esteem based
consideration
Reporting Reporting is regularly essential for investigation—the data must be extricated before
it may be inspected
Sample versus evaluation—some data are gathered for the whole popu-
1.
lace to which they apply; such data are sometime alluded as enumeration
data. One model is the real decennial statistics, which intends to acquire
tallies by geographic area and essential statistic qualities for the whole
occupant populace of the United States. Be that as it may, the term statistics
might be used to allude to any data accumulation for each unit in the num-
ber of inhabitants in. On the other hand, numerous data can’t be gathered
for the whole populace without extreme expense as well as the weight on
respondents. Rather, the data is gathered from a subset of the populace, or
an example that is chosen in a way that makes it illustrative of the whole
populace; along these lines, evaluations can be determined from the exam-
ple that estimated those for the whole populace.
TABLE 6.4
Healthcare Data Collection Complications and Clarifications
Domain Complications Clarifications
Deprived data • Data realization is the underlying • Utilize Spreadmart to produce
quality issue essential data; withstanding this
• To be less productive with regard procedure requires affirmation that
to creating genuine data the got data coordinates the MRNs
Complex relationship • Excel spreadsheets are the best • Locking the records and giving
way to gather and keep up access to one and just the
self-made data individual capable may appear
• This causes another data • As each individual depends on a
storehouse despite this technique specific calendar, this may result in
wrong utilization of working time
Data security • Depends on data gathering in • Utilizing cloud-based stages
nursing and other human service • Require redone instruction for staff
fields individuals
144 Data Science
Some of the examples of health data standards are listed in Table 6.5. (Courtesy:
www.searo. who.int/entity/health_situation_trends/topics/health_data_standrads/.)
146 Data Science
TABLE 6.5
Some of the Examples of Health Data Standards
Standards Definition
International • International standard diagnostic classification for all general,
classification of epidemiological, many health management purposes, and clinical use
diseases (ICD)
LONIC—Logical • A universal code system for identifying laboratory and clinical
observation identifiers observations
names and codes • LOINC has standardized terms that enable exchange and aggregation
of electronic health data from many independent systems and
database that contains the usual categories of chemistry, hematology,
serology, microbiology, toxicology, as well as categories for drugs
and cell counts, antibiotic susceptibilities, and more
HL7—Health level seven • HL7 is an international community to provide a framework for
exchange, integration, and sharing and retrieval of electronic
healthcare information
• It is a coordinated message-based connection between two systems
that allow information exchange reliably between application
programs, and it includes patient demographics
SNOMED— • It designed as a comprehensive nomenclature of clinical medicine for
Systematized the purpose of accurately storing and/or retrieving records of clinical
nomenclature of care in human and veterinary medicine
medicine • It provides the core general terminology for the EHR and contains
active concepts with unique meanings and formal logic-based
definitions organized into hierarchies
• It offers a standard-based foundation for different functions,
e.g., collection of a variety of clinical information, linked to
clinical knowledge bases, information retrieval, data aggregation,
analyses, exchange, etc.
• It contributes to the improvement of patient care by underpinning the
development of systems that accurately record healthcare encounters
and to deliver decision support
SDMX—Statistical data • SDMX provides standard formats for data and metadata, together
and metadata exchange with content guidelines, and an IT architecture for exchange of data
and metadata
• SDMX-HD (Health Domain) is a WHO implementation of the SDMX
standard to allow medical facilities to share and exchange medical
indicators and metadata between medical organizations
At the present time, however, something other than the drug evolves. The manner
in which we collaborate with our specialists and the manner in which our specialists
associate with one another is reclassifying what tolerant care resembles. Specialists
are accessible on interest, therapeutic records can be gotten to without sitting tight
for a dispatch, and the measure of data accessible to a general society is develop-
ing constantly [20]. Worries about security exist paired up with the advantages of
expanded access, and therapeutic offices can be defenseless against an assortment
of digital assaults.
health. In spite of the fact that talking in such expansive terms can influ-
ence the procedure to appear to be far evacuated, proof-based restorative
treatment just stands to profit by having more data illuminating prescribed
procedures. As the database develops [23], the impacts will stream down
to singular patient results through groups of coordinated care suppliers. Be
that as it may, as the data database develops, it likewise turns out to be more
defenseless, and that duty is at present in transition.
The three other primary zones of big data analytics apart from the earlier ones are
in healthcare data:
capacity to distinguish any potential health issues and alarm a care supervisor to
intercede. For instance, if a patient’s pulse spikes, the framework will send a caution
progressively to a care administrator, who would then be able to cooperate with the
patient to recover his circulatory strain into a healthy range. Big data is the main seek
after dealing with the volume, speed, and assortment of this sensor data.
6.9.4 Real-Time Warning
Different instances of big data examination in healthcare share one urgent useful-
ness, which reminds constant caution. In doctor’s facilities, CDS software investi-
gates medicinal data on the spot, furnishing health professionals with exhortation
as they settle on prescriptive decisions. Be that as it may, specialists need patients
to avoid healing centers to maintain a strategic distance from expensive in-house
medications. Individual investigation contraptions, as of now inclining as business
insight trendy expressions, can possibly turn out to be a piece of another technique.
Wearable devices will gather patients’ health data constantly and send this data to the
cloud. Moreover, this data will be gotten to the database on the condition of health of
the overall population, which will enable specialists to look at this data in a financial
setting and alter the conveyance techniques in a like manner. Organizations and care
directors will utilize advanced instruments to screen this enormous data stream and
will respond each time the outcomes will exasperate. For instance, if a patient’s circu-
latory strain increments alarmingly, the framework will send a caution progressively
to the specialist who will at that point make a move to achieve the patient and direct
measures to bring down the weight. Another precedent is that of Asthmapolis, which
has begun to utilize inhalers with Global Positioning System (GPS)-empowered
Big Data Analysis and Management 153
trackers with the end goal to recognize asthma patterns, both on an individual dimen-
sion and taking a gander at bigger populaces. This data is being utilized related to
data from the CDC, with the end goal to provide better treatment for asthmatics.
brought to a great extent overwhelm the dangers. In a like manner, it can help antici-
pate extortion and mistaken cases in a fundamental, repeatable manner. Investigation
helps to streamline the handling of protection claims, empowering patients to show
signs of improvement returns on their cases and caregivers are paid quicker.
6.9.9 Telemedicine
Telemedicine term alludes to conveyance of remote clinical administrations uti-
lizing innovation. It is utilized for essential counsels and introductory conclusion,
remote patient observation, and restorative training for health professionals. Some
more explicit uses incorporate telesurgery specialists to perform activities with the
utilization of robots and fast constant data conveyance without physically being in a
similar area with a patient. Clinicians utilize telemedicine to give customized treat-
ment designs and avoid hospitalization or reaffirmation. Such utilization of health-
care data examination can be connected to the utilization of prescient investigation
as observed beforehand. It enables clinicians to foresee intense restorative occasions
ahead of time and avoid crumbling of patient’s conditions. By fending off patients
from healing facilities, telemedicine diminishes costs and enhances the nature of
administration. Patients can abstain from holding up lines, and specialists don’t sit
idle for pointless meetings and printed material. Telemedicine additionally enhances
the accessibility of care as patients’ state can be checked and counseled anyplace
and whenever.
up to make a program called PreManage ED, which shares records during crisis
divisions.
6.11 CONCLUSION
The effective analysis of management of big data in healthcare has been presented in
this book chapter. The procedure for collecting healthcare data has been discussed in
the initial sections of the article. The four categories of big data analysis parts have
also been illustrated, and the second part includes descriptive, diagnosis, predicate,
and prescriptive types. Of course, the third part of this chapter has been dedicated
to study the management of big data in healthcare and its integration with IoT in big
data healthcare, which includes volume, speed, and data. This article also incorpo-
rates a variety of applications like EHRs, augmenting patient engagement, extrapo-
lative analysis in healthcare, and telemedicine. In addition, the challenges of big
data in healthcare have also been analyzed by means of clearing, storage, security,
stewardship, querying, reporting, visualizing, updating, and sharing. As a conclusive
remark, the utilization of machine learning could be the best solution to improve the
tolerance along with human service information.
156 Data Science
REFERENCES
1. N. Elgendy and A. Elragal, “Big data analytics: A literature review paper”, Lecture
Notes in Computer Science, Industrial Conference on Data Mining, Springer
International Publishing, Vol. 8557, 2014, pp. 214–227.
2. M. D. Anto Praveena and B. Bharathi, “A survey paper on big data analytics,” 2017
International Conference on Information Communication and Embedded Systems
(ICICES), 2017, pp. 1–7, Chennai, India
3. C. Lakshmi and V. V. Nagendra Kumar, “Survey paper on big data,” International
Journal of Advanced Research in Computer Science and Software Engineering, Vol. 6,
No. 8, 2016, pp. 368–381.
4. R. Hermon and P. A. Williams, “Big data in healthcare: What is it used for?” 2014.
5. Sample healthcare data set (Courtesy: NCHS - Leading Causes of Death: United
States). www.hcup-us.ahrq.gov/db/nation/kid/kidsummarystats.jsp#2016.
6. M. U. Sarwar, et al., “A survey of big data analytics in healthcare,” International Journal
of Advanced Computer Science and Applications, Vol. 8, No. 6, 2017, pp. 355–359.
7. www.searo.who.int/entity/health_situation_trends/topics/health_data_standards.
8. J. R. Alam, A. Sajid, R. Talib, and M. Niaz, “A review on the role of big data in busi-
ness,” International Journal of Computer Science and Mobile Computing, Vol. 3,
No. 4, 2014, pp. 446–453.
9. E. A. Archenaaa and M. Anitab, “A survey of big data analytics in healthcare and
government,” Procedia Computer Science, Vol. 50, 2015, pp. 408–413.
10. Q. Memon, “Smarter health-care collaborative network,” Building Next-Generation
Converged Networks: Theory and Practice, Editors: Al-Sakib K. Pathan, Muhammad
M. Monowar, Zubair M. Fadlullah, 2013, pp. 451–476, CRC Press
11. B. Ristevski and M. Chen, “Big data analytics in medicine and healthcare”, Journal of
Integrative Bioinformatics, 2018, Vol. 15, No. 3, pp. 1–5. doi:10.1515/jib-2017-0030.
12. W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: Potent potential,”
Health Information Science and Systems, Vol. 2, No. 23, 2014.
13. Q. K. Fatt and A. Ramadas, “The usefulness and challenges of big data in healthcare,”
Journal of Healthcare Communications, Vol. 3, No. 2:21, 2018, pp. 1–4.
14. Big Data Technologies in Healthcare. Needs, opportunities and challenges, TF7
Healthcare subgroup,Report, 2016, pp. 1–31.
15. M. S. Islam, M. M. Hasan, X. Wang, H. D. Germack, and M. Noor-E-Alam, “A sys-
tematic review on healthcare analytics: Application and theoretical perspective of
data mining,” Journal of Healthcare (Basel), Vol. 6, No. 2, p. 54, 2018. doi:10.3390/
healthcare6020054.
16. J. R. Alam, A. Sajid, R. Talib, and M. Niaz, “A review on the role of big data in
business,” International Journal of Computer Science and Mobile Computing, Vol. 3,
No. 4, 2014, pp. 446–453.
17. I. Barbier-Feraud and J. B. Malafosse, “Big data and prevention from prediction to
Demonstration”, Report, 2016, pp. 1–80.
18. S. Kumar and M. Singh, “Big data analytics for healthcare industry: Impact, appli-
cations, and tools,” Journal of Big Data Mining and Analytics, Vol. 2, No. 1, 2019,
pp. 48–57.
19. P. A. Vineet Kumar, “The use of big data analytics in information systems research,”
2018, https://ssrn.com/abstract=3185883 or http://dx.doi.org/10.2139/ssrn.3185883.
20. S. Patel and A. Patel, “A big data revolution in health care sector: Opportunities,
challenges and technological advancements,” International Journal of Information
Sciences and Techniques (IJIST), Vol. 6, No. 1/2, 2016, pp. 155–162.
21. P. Groves, B. Kayyali, D. Knott, and S. Van Kuiken, “The big data revolution in health-
care,” McKinsey Quarterly, Vol. 2, 2013.
Big Data Analysis and Management 157
22. T. L. Coelho da Silva and R. P. Magalhaes, et al., “Big data analytics technologies
and platforms: A brief review,” LADaS 2018 - Latin America Data Science Workshop,
2018, pp. 25–32, Rio de Janeiro, Brazil
23. T. Nizam and S. I. Hassan, “Big data: A survey paper on big data innovation and its
technology,” International Journal of Advanced Research in Computer Science, Vol. 8,
No. 5, 2017, pp. 2173–2179.
24. S. Kumari and K. Sandhya Rani Dr., “Big data analytics for healthcare system,” 2018
IADS International Conference on Computing, Communications & Data Engineering
(CCODE), 2018, India
25. L. Taylor, R. Schroeder, and E. Meyer, “Emerging practices and perspectives on Big
Data analysis in economics: Bigger and better or more of the same?” Big Data &
Society, Vol. 1, No. 2, 2014, pp. 1–10.
26. M. Sarnovsky, P. Bednar, and M. Smatana, “Big data processing and analytics platform
architecture for process industry factories,” Big Data and Cognitive Computing, Vol. 2,
No. 1, 2018, pp. 2–18.
27. R. Sonnati, “Improving healthcare using big data analytics,” International Journal Of
Scientific & Technology Research, Vol. 6, No. 03, 2017, pp. 142–147.
28. N. El Aboudi and L. Benhlima, “Big data management for healthcare systems:
Architecture, requirements, and implementation,” Journal of Advanced Bioinformatics,
Vol. 2018, 2018. doi:10.1155/2018/4059018.
29. Gesundheit Österreich Forschungs and und Planungs GmbH, “Study on big data in
public health, telemedicine and healthcare,” Final Report, December 2016.
30. L. Wang and C. A. Alexander, “Big data in medical applications and health care,”
Current Research in Medicine, Vol. 6, No. 1, 2015, pp. 1–8.
31. A. Belle and R. Thiagarajan, et al., “Big data analytics in healthcare”, BioMed Research
International Journal, Vol. 2015, 2015, pp. 1–16.
32. E. A. Archenaaa and M. Anitab, “A survey of big data analytics in healthcare and
government,” Procedia Computer Science, Vol. 50, 2015, pp. 408–413.
33. McKinsey, “Big data, big Transformations”, McKinsey Global Survey, Minding Your
Digital Business, 2012.
34. Big Data is the Future of Healthcare, Cognizant 20-20 insights | September 2012.
7 A Case Study Approach
Healthcare Analytics
CONTENTS
7.1 I ntroduction and Background to the Case Study: Framingham
Heart Study.................................................................................................... 159
7.2 Literature Review.......................................................................................... 160
7.3 Introduction to the Data Analytics Framework............................................. 161
7.3.1 Step 1. Defining the Healthcare Problem.......................................... 162
7.3.2 Step 2. Explore the Healthcare Data.................................................. 162
7.3.3 Step 3. Predict What Is Likely to Happen; or Perform
Classification Analysis���������������������������������������������������������������������� 163
7.3.4 Step 4. Check the Modeling Results.................................................. 164
7.3.5 Step 5. Optimize (Find the Best Solution)......................................... 165
7.3.6 Step 6. Derive a Clinical Strategy for Patient Care and
Measure the Outcome����������������������������������������������������������������������� 165
7.3.7 Step 7. Update the CDS System......................................................... 165
7.4 Data Exploration and Understanding of the Healthcare Problem................. 166
7.5 Machine-Learning Model Application.......................................................... 168
7.6 Evaluation of the Machine-Learning Model Results.................................... 169
7.7 Conclusion..................................................................................................... 171
7.8 Future Direction............................................................................................ 171
Acknowledgements................................................................................................. 172
References............................................................................................................... 172
159
160 Data Science
7.2 LITERATURE REVIEW
Today’s healthcare industries are moving from volume-based business into value-
based business, which requires an overwork from doctors and nurses to be more
productive and efficient. Efficient healthcare industries improve healthcare practice;
change individual life styles; drive patients to live longer; and prevent diseases, ill-
nesses, and infections [1].
Healthcare Analytics 161
CVD is one of the most common causes of death globally. Analytics is the way of
developing insight through the efficient use of data and application of quantitative and
qualitative analysis [2]. It can generate fact-based decisions for “planning, management,
measurement, and learning” purposes. The most important risk factors identified for
coronary heart disease (CHD) were age, gender, blood pressure, blood glucose, and the
number of cigarettes per day. Other studies on CHD show similar results: age, smoking
habit, history of hypertension, family history, and history of diabetes [3].
A big data analytical framework that utilizes a ubiquitous healthcare system was
established [4]. The framework analyzes vital signs extracted from accelerometers to pro-
vide healthcare services. Vital signs are continuous time-series data that are unstructured
in nature and have inadequacy to store in the traditional databases. Electrocardiogram
signal (ECG), respiration, and motion data have been accounted as vital signs.
For the past two years, predictive analysis has been recognized as one of the
major business intelligence approaches, but its real-world applications extend far
beyond the business context. Parkland hospital in Dallas, Texas has launched a pre-
dictive system that scans all patients’ details and information to identify potential
risks and outcomes. As a result, the hospital has saved more than half a million
dollars, especially in heart failure and disease predictions in terms of performing
patients’ monitoring and avoiding future complications [5].
The CDS aims to increase the quality of healthcare services by enhancing the
outcomes. The primary focus of the CDS system is to provide right information to
the right people and proper customized healthcare management process not limited
to clinical guidelines, documentations, and diagnosis [6].
By analytics on data, the current state of the health of patients provides insight to
them to take more ownership of their healthcare. The information sharing mecha-
nism increases productivity and reduces overlapping of data. Thereby, it enhances
the coordination of care. Big data will further personalize medicine by determining
the tests and treatments needed for each patient. The provision of earlier treatment
can reduce health costs and can eliminate the risk of chronic diseases [7].
being saved. Healthcare analytics applications are used for a variety of reasons,
ranging from improving patient service to improve the hospital’s capability to pre-
dict patient heart attacks and to offer valuable real-time patient insights on hospital
computer monitoring dashboards.
Today, these real-time dashboards may be used to effectively track key perfor-
mance clinical indicators with the most current information and patient data available.
These dashboards help the clinician to make key decisions in a timely manner, thereby
increasing the number of early diagnosis for a particular disease condition, reducing
patient mortality, and overall increasing the efficiency of the medical organization.
Today, many medical organizations are exploiting healthcare analytics to enable
proactive clinical decision-making; in other words, they are switching from reacting
to healthcare situations to anticipate them.
For whatever purpose the healthcare analytics is applied, the key outcome is the
same: The process of solving a healthcare problem using relevant data and turning
it into insights is by providing the clinician with the knowledge he or she needs to
make clinical decisions. So what is healthcare analytics? Essentially, healthcare ana-
lytics is a seven-step process, outlined as follows.
derive actionable insights from the patient data, it is at this stage that the analyst
uses statistical hypothesis testing, to compare different patient groups using different
assumptions.
To identify possible correlation between factors, the analyst will plot the patient
data using scatterplots and apply statistical methods such as correlation analysis to
identify relationships between factors.
The analyst will also perform simple regression analysis to see whether simple
predictions can assist with clinical decision-making. At this stage, the analyst is
already looking for general patterns and actionable insights to achieve the healthcare
analytics goal.
Two very important checks that need to be done before the modeling stage are as
follows:
1. Check whether two or more variables are highly correlated with each other
(multicollinearity). If there are two or more variables highly correlated with
each other, only one of these variables need to be included in the model. If
the highly correlated variables are not removed, some of the model coef-
ficients will have the wrong coefficient sign or extremely high values or
extremely low values. In other words, some of the model coefficients will
likely be unreliable if multicollinearity is not removed.
2. Check whether the target variable classes are balanced. Class imbalance
problems are quite common in clinical applications. Let us consider an
example that demonstrates the class imbalance problem where the clinical
dataset has 5% of patients with heart disease and 95% of patients without
heart disease. The model may have an overall accuracy of close to 100%
and the specificity () metric close to 100%, while the sensitivity (measures
the ratio of the number of actual heart disease patients that are correctly
identified) metric is less than 20%. This clearly demonstrates that the clini-
cal model is not good because we are more interested in a model that can
accurately predict the heart disease cases (which is the minority class in this
example).
To overcome the imbalance problem, the bagging and oversampling techniques may
be used.
Once the multicollinearity and imbalance have been removed, depending on the
healthcare analytics goal, the analyst may choose to perform predictive or classifica-
tion modeling.
1. To check whether the model results are reasonable, we need to check the
following:
• Are all the input variables statistically significant? The input variables
are statistically significant when their p-values are less than 0.05.
• Do the model coefficients have the right signs? If a coefficient has
a positive sign and the expected sign should be negative, further
investigation needs to be made. For example, the analyst needs to check
whether there are outliers in the data or whether there are variables in
the model that are highly correlated.
• Is the size of the coefficients realistic? For example, if some coefficients
are extremely small, almost zero, and some coefficients are extremely
large, this will indicate that the input data needs to be scaled. This can
be done by standardizing the data to z-scores (rescale the data to have a
mean of 0 and standard deviation of 1 (unit variance). Another method
is to normalize the data (rescales the values into a range of [0, 1]).
• Is the standard deviation of the model coefficient acceptable? We some-
times find that the standard deviation of some model coefficients is
extremely large. When we have this situation, it indicates that there are
probably a few outliers for that variable, and we may need to remove the
outliers so that the coefficient estimate of the variable is more accurate
and reliable because the standard deviation will reduce and will not be
so large.
2. To check whether the model performs well, we need to use the confusion
matrix. The confusion matrix is a tool that is commonly used to evaluate
the accuracy of the prediction model.
Once the clinical prediction model is built, it is important for the analyst to evaluate
how good the prediction model is by testing the prediction model on unseen data
(validation dataset). Using the validation dataset, the analyst needs to compute the
overall accuracy, sensitivity, and specificity of the prediction model.
As a rule of thumb, the overall accuracy should be at least 70%.
From the confusion matrix, the analyst computes the sensitivity of the model,
also known as the true positive rate (TPR) or recall. The sensitivity of the model
Healthcare Analytics 165
measures the ratio of the number of patients with heart disease correctly identified.
The sensitivity measure should be at least 70%.
From the confusion matrix, the analyst computes the specificity, also known
as the true negative rate (TNR) and measures the ratio of the number of patients
without heart disease that are correctly identified. The specificity measure should
be at least 70%.
If the overall accuracy, sensitivity, and specificity of the model are at least 70%,
this indicates the acceptance of the clinical predictive model and confirms that the
model is performing well.
Every year, the clinical model may need to be modified because patient group’s
lifestyles have changed over the years, and so the clinical model needs to adapt
accordingly with more relevant input variables.
If the CDS tool is not monitored over time, it is possible that newer important
input variables for clinical decision-making become available over time, and if these
variables are not used for clinical modeling, the CDS tool will become less accurate
over time, and eventually clinical treatment will become ineffective over time.
For example, if the initial accuracy of the clinical model is 95%, over time, this
accuracy reduces to 85%, and the analyst should get a serious message to investigate
whether the model needs to be revamped completely and built from scratch.
It may happen that every 3 years, a fresh clinical model may need to be built as
the number of changes in the patient lifestyle is much greater, and the accuracy of the
clinical model has decreased considerably where a small modification to the model
is not sufficient.
It is highly recommended that the analyst builds a data-driven clinical decision-
making system that includes trigger alerts to the analyst when unacceptable changes
to the model accuracy occur. Changes to the clinical model accuracy must be com-
municated to all clinicians and healthcare professionals. The clinicians and health-
care professionals need to consult with their patients to better understand the patient
lifestyle changes and health changes so that the analyst has better insights on how to
modify or revamp the clinical model. This communication will likely improve the
clinical model and patient treatment, resulting in better patient health.
The benefit of monitoring the CDS system is that it allows the analyst to be proac-
tive in investigating the clinical models when the accuracy of the model reduces con-
siderably. This way, the patients will receive the right treatment when their lifestyle
and health conditions change.
So, in a nutshell, an automated CDS system will enable clinicians and healthcare
professional to be proactive and to enlighten the analyst through the decision support
system when patient lifestyle changes take place. The automated CDS system will
improve the communication between the analyst and clinicians, resulting in a more
effective patient care system. Analysts will receive patient treatment and health con-
dition updates in a timely manner, allowing the CDS system to automatically trigger
to the clinician the right treatment for the patient, at the right time, resulting in an
effective patient care system.
An initial data analysis using 4,240 patients of the framingham.csv [8] dataset
was undertaken to understand the quality of the patient health data and to correct any
errors if found. Using frequency counts, it was found that 582 (13.7%) patients had
some missing data. As missing data was only 13.7%, patients who had missing data
were excluded from the study analysis. The study now consisted of 3,658 patients
with 2,035 (56%) females and 1,623 (44%) males.
Further, the predictive model input data was standardized to enable all input vari-
ables in the model to have a fair chance of being significant. Predictive modeling is
a process that uses data mining and probability to forecast outcomes. Each model is
made up of a number of risk factors, which are variables that are likely to influence
Ten Year CHD, the target variable of interest [9]. Once data has been collected for
the relevant risk factors, a statistical model is formulated.
The dataset had four types of risk factors:
Our logistic regression model was carried out using the backward stepwise regres-
sion method. Variables with a p-value greater than 0.05 were deemed as insignificant
to “TenYearCHD” and were removed from the model. The process stopped when the
model was left with only variables with p-values less than 0.05.
TABLE 7.1
Significant Risk Factors for CHD
Coefficient Estimate Std. Error Z-Value p-Value Odds Ratio
Intercept −2.02182 0.069 −29.216 0.000 0.132
Male 0.30334 0.063 4.797 0.000 1.354
Age 0.61417 0.662 9.276 0.000 1.848
cigsPerDay 0.22166 0.058 3.819 0.000 1.248
sysBP 0.38852 0.056 6.954 0.000 1.475
Glucose 0.20659 0.047 4.376 0.000 1.229
170 Data Science
TABLE 7.2
Confusion Matrix for Test Data
Reference Prediction
0 1 Total
0 654 276 930
1 58 109 167
Total 712 385 1,097
From the confusion matrix, 654 plus 109 patients were classified correctly as not
having a risk of CHD or having a risk of CHD, respectively. Hence, from Table 7.2,
the overall accuracy of the CHD risk factor model is 70%. However, 276 patients
who did not have CHD were classified as having CHD, and 58 patients who were
classified as having CHD were predicted as having no CHD.
True positive
The “True Positive Rate /Recall /Sensitivity ” value =
True positive + False negative
109
=
(109 + 58)
is 65%, which means for every 100 patients who have a TenYearCHD risk, the
model is able to classify 65 of them correctly.
which means for every 100 patients without TenYearCHD, the model is able to
classify 70 of them correctly.
Sensitivity and specificity are inversely proportional to each other. So, when we
increase sensitivity, specificity decreases and vice versa.
The AUC is plotted with the TPR on the y-axis against the false positive rate on the
x-axis [12]. See Figure 7.1 later.
The AUC is a performance measurement for a classification problem at various
threshold settings. Our classification problem was to classify patients who were at
risk of a “TenYearCHD” and those who were not at risk. The AUC tells us how
much our model is capable of distinguishing between the “TenYearCHD” class and
the “Non-TenYearCHD” class. The higher the AUC value, the better the model is
at distinguishing between patients with disease and without disease. An excellent
model has an AUC value near to 1, which means it has good measure of separa-
bility. A poor model has AUC near to 0, which means it has a poor measure of
separability.
Healthcare Analytics 171
Further, the AUC for our classification model was 71%, demonstrating a good
measure of how well our classification model’s performance is.
The logistic regression model results were considered good, and hence, the train-
ing CHD risk factor model was accepted for predicting patients at risk of CHD.
7.7 CONCLUSION
The logistic regression model was used to identify which patients are likely to be at
high risk of CHD. The outcome from this study confirms first that machine-learning
techniques, such as logistic regression technique, can be used to classify patients as
high-risk CHD patients.
Further, from 15 risk factors, five risk factors were identified as CHD indicators.
Age was the most important risk factor, the older the patient, the more likely they
were at risk of a CHD. Male patients were found to have a higher risk of CHD than
females. Systolic blood pressure was a significant factor, indicating that diastolic
blood pressure can also be seen as a significant factor as systolic blood pressure and
diastolic blood pressure were highly correlated. So, patients with high blood pressure
(hypertension) have a high risk for CHD. Lastly, cigsPerDay was also a significant
risk factor for CHD, the more cigarettes a patient smoked, the more likely they were
at risk of CHD.
7.8 FUTURE DIRECTION
Data analytical techniques, particularly machine-learning techniques, can be very
valuable for clinicians, as these techniques can accurately identify which patients
are high-risk CHD patients and require treatment that can reduce their risk of CHD
or prevent them from getting a heart attack. The coronary risk factor model can be
used by clinicians as a CDS tool (CDST). The CDST can serve as a good guide for
the clinicians when diagnosing patients by providing the required support and con-
fidence to the clinician when decisions about the patient’s risk of a heart attack have
to be made.
172 Data Science
With the advancement of technology, clinicians can quickly update patient infor-
mation in the CDST, and the patient CHD risk level will be immediately informed to
the clinician. The use of a CDST will assist clinicians in making important clinical
decisions faster; hence, administration of patient treatment will be earlier, preventing
or reducing the coronary heart risk of the patient early.
ACKNOWLEDGEMENTS
The Framingham Heart Study is supported by Contract No. HHSN268201500001I
from the National Heart, Lung, and Blood Institute (NHLBI) with additional support
from other sources. This manuscript was not approved by the Framingham Heart
Study. The opinions and conclusions contained in this publication are solely those of
the author and are not endorsed by the Framingham Heart Study or the NHLBI and
should not be assumed to reflect the opinions or conclusions of either.
REFERENCES
1. Alkhatib M. A., Talaei-Khoei A., Ghapanchi A. H. Analysis of Research in Healthcare
Data Analytics. Australasian Conference on Information Systems, Sydney, 2015.
2. Simpao A.F., Ahumada L.M., Gálvez J.A., Rehman M.A. A review of analytics and clini-
cal informatics in health care. J. Med. Syst. 2014;38:45. doi:10.1007/s10916-014-0045-x.
3. Karaolis M., Moutiris J.A., Papaconstantinou L., Pattichis C.S. Association rule analy-
sis for the assessment of the risk of coronary heart events, Proceedings of the Annual
International Conference of the IEEE Engineering in Medicine and Biology Society,
Minneapolis, MN, USA. 3–6 September 2009, pp. 6238–6241.
4. Kim, T.W., Park, K.H., Yi, S.H., Kim, H.C. A big data framework for u-healthcare sys-
tems utilizing vital signs. Proceedings – 2014 International Symposium on Computer,
Consumer and Control, IS3C, 2014, pp. 494–497. doi:10.1109/IS3C.2014.135.
5. Jacob, S. Young parkland physician makes a splash with predictive modeling software,
D Healthcare Daily, Dallas, 2012, http://healthcare.dmagazine.com/2012/12/10/young-
parkland-physician-makes-asplash-with-predictive-modeling-software/, Retrieved 09
August, 2015.
6. Sonnati, R. Improving healthcare using big data analytics. Int. J. Sci. Tech. Res. March
2017;6(03), ISSN 2277-8616.
7. Big Data Offers Big Opportunities in Healthcare, Retrieved from www.villanovau.
com/resources/bi/big-data-healthcare-opportunities/#.VnfRArZ95kg.
8. https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/
framingham.csv.
9. https://en.wikipedia.org/wiki/Predictive_modelling.
10. Menard, S.W., NetLibrary, I. Applied Logistic Regression Analysis. Thousand Oaks,
CA: Sage Publications.
11. Hosmer, D.W., Lemeshow, S., Sturdivant, R.X. Applied Logistic Regression (third ed.).
Hoboken, NJ: Wiley.
12. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5.
8 Bioinformatics Analysis of
Dysfunctional (Mutated)
Proteins of Cardiac Ion
Channels Underlying
the Brugada Syndrome
Carlos Polanco
Universidad Nacional Autónoma de México
Manlio F. Márquez
Instituto Nacional de Cardiología Ignacio Chávez
Vladimir N. Uversky
University of South Florida
Thomas Buhse
Universidad Autónoma del Estado de Morelos
CONTENTS
8.1 I ntroduction................................................................................................... 174
8.2 Results............................................................................................................ 175
8.2.1 Brief Description of Unique BrS-Related Proteins........................... 176
8.2.2 PIM-Based Analysis of the Unique BrS-Related Proteins................ 178
8.2.3 Intrinsic Disorder Analysis of the BrS-Related Proteins................... 178
8.2.4 Kolmogorov–Smirnov Test................................................................ 185
8.3 Discussion...................................................................................................... 185
8.4 Materials and Methods.................................................................................. 186
8.4.1 Evaluation of Polar Profile................................................................. 186
8.4.1.1 Weighting of Polar Profiles................................................. 188
8.4.1.2 Comparison of Polar Profiles.............................................. 188
173
174 Data Science
8.1 INTRODUCTION
This work illustrates a possible physicochemical trait that is unique to mutated pro-
teins that underlie a malignant disease known as Brugada syndrome (BrS). We spec-
ulate that this information could be incorporated into a bionanosensor to identify the
related proteins by this notorious physicochemical trait.
The electrical activity of the heart is based on the proper function of ion channels.
There are specific pores located mainly on their cytoplasmic membranes that permit the
passage of ions through it. In subjects who have BrS [1], specific defects in some of these
channels have been reported, resulting in a characteristic electrocardiographic pheno-
type and the possibility to generate malignant ventricular arrhythmias [2]. Mutations in
BrS-related proteins affect the physical form of these ion channels. Currently, the only
treatment that can effectively prevent this syndrome is the insertion of an automatic
implantable cardiac defibrillator, or, in subjects who already have such a device, the
recurrence of ventricular arrhythmias can be prevented with a quinidine treatment [3].
Medical studies on close relatives of subjects diagnosed with BrS have determined
that this syndrome can be inherited [4], as some of these relatives also have mutations
in BrS-related proteins. In the UniProt database [5], there are 4,388 overrepresented
(redundant) BrS mutated proteins from 36 proteins [2] that underlie this abnormality.
This overrepresentation or duplicity is not exclusive of this database, since it often
comes from the multiple functions found for the same protein.
To develop a new method for the molecular diagnosis of the syndrome, based on
parameters other than the detection of the mutation, this work introduces a highly
discriminating metric based on the concept of electronegativity in Pauling’s work
[6] called “polar profile” [7]. Several protein groups were used in the evaluation of
the BrS mutated proteins that were selected based on one of the following factors:
(i) they are peptides or proteins previously used by this team to carry out analysis of
functional identification, (ii) there is a large number of corresponding proteins found
in nature, (iii) they have a leading role in diseases that affect humans, and (iv) they
are closely related to the BrS mutated protein, as it is the case of BrS proteins. These
protein groups are (i) a set of BrS proteins from which the BrS-related mutated pro-
teins derive; (ii) a set of antibacterial peptides named selective cationic amphipathic
antibacterial peptides (SCAAP) [8], and this group stands out for being highly toxic
to bacteria but causes almost no harm to human cells; (iii) three sets of antimicrobial
proteins: bacteria, fungi, and viruses extracted from UniProt and APD2 [9] databases;
(iv) six sets of lipoproteins extracted from UniProt database [5] and associated with
Analysis of Dysfunctional (Mutated) Proteins 175
coronary and heart diseases; (v) two sets of proteins with different degrees of struc-
tural disorder named intrinsically disordered (unfolded) and partially folded proteins,
taken from the work of Oldfield and coworkers [10]. This group is particularly impor-
tant because it is used as a reference on the degree of disorder [11] for other protein
groups; and (vi) a set of 557,713 “reviewed” proteins from UniProt database [5] that
includes all the proteins whose annotation is confirmed as correct. This group was
not used in the calibration of the Polarity Index Method® (PIM), but in the search of
proteins with a “polar profile” similar to that of BrS mutated proteins.
To obtain the “polar profile” of BrS mutated proteins, a computational system
called PIM [7] was used. It was designed and programmed to perform two func-
tions: to obtain the BrS “polar profile” of the mutated proteins and to compare that
profile with the polar profiles of other protein groups. The PIM is a nonsupervised
system, with a set of programs written in Fortran 77 [12] and scripts in Linux [13]
that extensively evaluate a single physicochemical property, the “polar profile.” The
PIM system is trained in evaluating different groups of proteins and making auto-
matic changes on the “polar profile” of the target protein group, to reach the “polar
profile” that best characterizes this target. The process is independent of the number
of protein groups participating. Its metric only uses the linear representation of the
protein, in FASTA format, reading one pair of amino acids at a time from one end to
the other to find the polarity type that represents each pair.
The discriminatory efficiency of the PIM system was verified by comparison of
the proportion of accepted/rejected proteins from the BrS mutated protein group
and BrS protein group with respect to the real proportion of proteins in the same
protein groups. This analysis was performed using the nonparametric two-sided
Kolmogorov–Smirnov test (Section 8.4.4). An analysis of the same protein groups
was carried out to obtain the characterization of their “degree of disorder” using the
set of supervised per-residue disorder predictors, such as PONDR® FIT, PONDR®
VLXT, PONDR® VSL2, PONDR® VL3, FoldIndex, IUPred, and TopIDP (Section
8.4.2). This analysis was aimed at verifying the efficiency of PIM when identifying
and discriminating the “polar profile” of the set of BrS mutated proteins and getting
a “fingerprint” of the degree of disorder of this group.
This article comprises three sections: (i) A computational analysis of the degree
of disorder of each BrS mutated protein as well as its association with other diseases,
(ii) A bioinformatic characterization (through the PIM®) of the BrS mutated proteins,
and their contrast with a large and diverse set of protein groups of different structural
and functional types, with the objective to obtain a “fingerprint” of the BrS mutated
protein set, and (iii) the use of this “fingerprint” on all of the reviewed proteins listed
in the largest known public primary database UniProt containing protein sequence
and functional information. We speculate that this characterization could be used for
the rapid identification of this syndrome at early stages, potentially even before the
development of major symptoms.
8.2 RESULTS
We provide a workflow of the PIM® (see Figure 8.1) to clarify and make the results
and procedures of this nonsupervised computational algorithm understandable.
176 Data Science
FIGURE 8.1 Workflow of PIM®. (i) Assembly of the target set whose polar profile will be
obtained by the PIM system. (ii) Assembly of protein sets whose functions and/or structures
are known experimentally. (iii) Extraction of the polar profile characteristic of the target set.
(iv) Polar profile of each protein in protein sets, whose functions and/or structures are known.
(v) Comparison and reacquisition of the polar profile of the target set, based on the calculation
of similarity of the target set and the group of protein type (ii) (see Section 8.4.1).
disordered mutations
Brugada proteins 3 16 6 18 7 0 64 11
Brugada mutated 0 5 8 0 7 0 25 67
proteins
Similarities (%) found by PIM system in the protein groups. The score represents the percentage of proteins (column) with a similar “polar profile” (row). For example, the
PIM system calibrated with the BRUGADA group has a similarity of 1% with the “polar profile” of the completely disordered protein group. See Section 8.4.1.
179
180 Data Science
FIGURE 8.2 Graphical representation of the BrS protein and BrS protein mutations.
(a) Column-Normal Excel software. (b) Column-Stacked Excel software. The X-axis repre-
sents the 16 polar interactions.
protein and PONDR® VSL2-based PPID values. Some of the data from Table 8.2
was used to generate Figure 8.3, which provided a compelling overview of the over-
all intrinsic disorder predisposition of these proteins.
In fact, Table 8.2 and Figure 8.3 clearly show that all BrS proteins analyzed
in this study contain noticeable levels of intrinsic disorder, and a vast majority of
them clearly belong to the category of hybrid proteins containing ordered domains/
regions and intrinsically disordered regions of different lengths. This conclusion is
Analysis of Dysfunctional (Mutated) Proteins 181
further illustrated by the mean disorder profiles for these proteins provided in the
Supplementary Materials section, where the presence of intrinsically disordered
regions in almost all these proteins is clearly seen. The conclusion on the highly
disordered nature of BrS proteins is further supported by grouping these proteins
using the accepted classification of intrinsic disorder-containing proteins based on
their PPID values, where proteins are considered as highly ordered, moderately
disordered, or highly disordered, if their PPID < 10%, 10% ≤ PPID < 30%, or PPID
30%, respectively [14]. This analysis based on the investigation of the mean disorder
profiles revealed that only three proteins from this dataset, SCN3B, SCN1B, and
GPD1L, were predicted as highly ordered, whereas 6 and 11 BrS proteins were clas-
sified as moderately and highly disordered (see Table 8.2). According to a similar
classification based on the analysis of PONDR® VSL2 profiles, all BrS proteins are
either moderately or highly disordered (6 or 14 proteins, respectively). Furthermore,
seven BrS proteins were predicted to have PPID values exceeding 50% by both
approaches. Since the intrinsic disorder is crucial for the functionality of many
proteins [14–22], and since high levels of intrinsic disorder are commonly found in
proteins associated with various diseases [23–29], the results of this analysis suggest
that structural plasticity is crucial for functionality and pathological implications of
BrS-related proteins. Despite these important conclusions, disorder-based profiles of
BrS proteins do not contain any common features that could be used for their clas-
sification and differentiation from other disorder-containing proteins.
Similarly, analysis of the effects of BrS-related mutations on the disorder propen-
sity of carrier proteins did not show any repeated peculiarity that can be used for the
182
TABLE 8.2
Peculiarities of Intrinsic Disorder Analysis of the Unique BrS-related Proteins
Standard Mean PPID PONDR® VSL2
Protein (Disease) UniProt ID Length pI (Charges) Mean Disorder Deviation (%) PPID (%)
HEY2 (BRGDA) Q9UBP5 337 8.31 (+29/−27) 0.54 0.17 57.86 84.91
NPPA (ATFB6; ATRST2) P01160 153 6.59 (+17/−17) 0.57 0.17 71.90 75.97
ANK3 (BRGDA; MRT37) Q12955 4,377 6.07 (+544/−608) 0.57 0.14 59.36 73.46
CALM1 (LQT14; CPVT4) P0DP23 149 4.09 (+14/−38) 0.52 0.12 60.40 70.00
CACNB2 (BRGDA4) Q08289 660 8.11 (+94/−92) 0.58 0.14 65.45 69.14
MYH7 (CMH1; MSMA; SPMM; P12883 1,935 5.63 (+316/−358) 0.46 0.15 54.99 66.43
CMD1S; MPD1; LVNC5)
HCN4 (BRGDA8; SSS2) Q9Y3Q4 1,203 9.07 (+113/−99) 0.52 0.11 50.21 58.47
KCNH2 (SQT1; LQT2) Q12809 1,159 8.20 (+120/−115) 0.39 0.11 37.62 45.95
KCNE3 (BRGDA6) Q9Y6H6 103 8.82 (+12/−10) 0.34 0.15 22.33 41.35
KCNQ1 (ATFB3; LQT1; JLNS1; P51787 676 9.88 (+86/−53) 0.36 0.12 30.92 41.21
SQT2; NIDDM)
KCND3 (BRGDA9; SCA19) Q9UK17 655 8.56 (+69/−62) 0.35 0.11 34.05 40.09
CACNA1C (BRGDA3; TS) Q13936 2,221 6.33 (+225/−240) 0.32 0.12 29.81 40.05
SCN5A (BRGDA1; LQT3; Q14524 2,016 5.34 (+187/−229) 0.32 0.11 30.06 36.64
ATFB10; SSS1; VF1; PFHB1A;
ATRST1; CMD1E)
SCN10A (BRGDA; FEPS2) Q9Y5Y9 1,956 5.67 (+189/−218) 0.29 0.11 23.26 34.95
TGFB3 (ARVD1; LDS5) P10600 412 8.31 (+54/−50) 0.33 0.12 20.15 29.78
SCN2B (ATFB14; BRGDA; O60939 215 5.98 (+25/−28) 0.32 0.12 16.74 25.00
SIDS)
(Continued )
Data Science
TABLE 8.2 (Continued )
Peculiarities of Intrinsic Disorder Analysis of the Unique BrS-related Proteins
Standard Mean PPID PONDR® VSL2
Protein (Disease) UniProt ID Length pI (Charges) Mean Disorder Deviation (%) PPID (%)
KCNJ8 (SIDS; HTOCD; Q15842 424 9.38 (+52/−40) 0.28 0.13 15.09 23.53
BRGDA)
SCN3B (BRGDA7; ATFB16) Q9NY72 215 4.66 (+19/−32) 0.25 0.13 8.83 22.22
SCN1B (BRGDA5; EIEE52; Q07699 218 4.86 (+21/−33) 0.23 0.13 4.13 17.81
GEFS+1; ATFB13)
Analysis of Dysfunctional (Mutated) Proteins
GPD1L (BRGDA2) Q8N335 351 6.61 (+40/−41) 0.197 0.097 2.85 10.51
Some basic properties of the unique BrS-related proteins are shown. We list proteins names and associated diseases, UniProt IDs for proteins, length of their sequences
(as corresponding number of residues), pI, and the number of positively and negatively charged residues provided by ProtParam tool (https://web.expasy.org/protparam/) of
ExPaSy Bioinformatics Resource Portal. For each query protein, we also list a mean disorder score evaluated by averaging the per-residue disorder profiles generated by
individual disorder predictors and the corresponding standard deviation as well as a mean PPID (which is a content of residues predicted to be disordered; i.e., those with
the disorder scores 0.5 threshold) calculated using the mean per-residue disorder profile generated for each query protein and also PONDR® VSL2-based PPID values.
183
184 Data Science
identification of such mutations. To illustrate this point, Figure 8.4 shows the muta-
tion effect on intrinsic disorder profiles of one of the most disordered proteins in the
dataset, NPPA (Figure 8.4a), whereas Figure 8.4b shows the disorder profiles of the
wild-type and mutated forms of one of the most highly ordered proteins, SCN1B.
It can be seen that, in both these cases, BrS-related mutations induce some changes
(increases or decreases) in local intrinsic disorder propensity of regions in close
proximity to mutations. Again, there are no features in the corresponding disorder
profiles that can discriminate BrS-related mutations from any other mutations in
these mutated proteins. This is in contrast to the PIM-based analysis, which gener-
ates specific polar profiles containing characteristic features that can be used for
finding computable differences to identify BrS mutated proteins.
(a)
(b)
FIGURE 8.4 Illustrative examples of the effect of BrS-related mutations on the intrinsic
disorder propensity of a highly disordered protein, NPPA (a) and a mostly ordered protein,
SCN1B (b). In both cases, wild-type proteins are shown by black lines, whereas different
lines represent disorder profiles of several BrS-related mutated proteins.
Analysis of Dysfunctional (Mutated) Proteins 185
8.2.4 Kolmogorov–Smirnov Test
Motivation: There are several statistical tests (parametric and nonparametric) that
respond to the correlation between two or more samples. In particular, n onparametric
tests do not presuppose a normal distribution. This is an advantage for the analysis,
where it is not known what distribution would follow the “polar profile” of a protein.
Within the nonparametric tests, two tests are equivalent and robust: Mann–Whitney U
test and Kolmogorov–Smirnov test, but the last one is less demanding. The statistical
two-sided test confirmed (with alpha = 0.01) that the proportion of proteins accepted/
rejected by the PIM system correlates with the actual proportion of the groups of
BrS proteins and BrS mutated proteins. This also provided support to the conclusion
that the “polar profile” of each one of these groups contains group-specific features.
8.3 DISCUSSION
The nonsupervised PIM® can be implemented in parallel mode. Since it uses the
master-slaves scheme, its processing time is drastically reduced to one-tenth of the
corresponding CPU time of the methods utilizing the monoprocessor scheme. We were
able to implement a version in CUDA (Compute Unified Device Architecture) lan-
guage to be executed on NVIDIA microprocessors. This implementation is viable
because the only element that resides in memory is the representative vector of the
polar profile of the target protein and the protein whose function and/or structure
is known from experimental studies. We believe that the PIM system constitutes a
fortunate computational innovation, because its high-performance computing scheme
corresponds to master-slaves, which substantially reduces the processing time, and
from a biological point of view, its metrics use only a linear representation of the pro-
tein and not its three-dimensional structure, which effectively reduces the complexity
in the programming in comparison with the metrics that acts in three-dimensional
space. Another important consideration is the physicochemical nature of the only
property assessed by the PIM system, the polarity, which represents the electromag-
netic balance of a protein. This is one of the four major forces that affect matter.
A distinctive “polar profile” was obtained for the set of BrS mutated proteins
that discriminated the other protein groups analyzed in this study, including a set of
36 BrS proteins the BrS mutated proteins came from. When comparing the “polar
profile” of BrS mutated proteins with the “reviewed” proteins from the UniProt
database [5], it was identified that 6.4% of them share the same “polar profile”.
The metrics of the PIM system named “polar profile” only takes the polarity of
a peptide or a protein and expresses that measure with 16 numbers as an incidence
matrix, equivalent to the 16 possible polar interactions from {P+, P−, N, NP} ×
{P+, P−, N, NP}. A typical algorithm evaluating protein polarity (which is not PIM-
based) generates a single real number to characterize the polarity status of a query
protein. However, we assume that it is more precise when the algorithm uses 16
real numbers instead of a single one. It is possible that this particularity is what
makes the PIM system effective; using the polar interaction with 16 values gives
more representability to the amino acids and its physical properties at the molecu-
lar level. Therefore, our method can effectively identify a specific protein or find
186 Data Science
groups {P+, P−, N, NP} based on their polar charge affinity: P+ (polar positively
charged) = {H, K, R}, P− (polar negatively charged) = {D, E}, N (polar neutral) =
{C, G, N, Q, S, T, Y}, and NP (nonpolar) = {A, F, I, L, M, P, V, W}.
The PIM algorithm starts by replacing each amino acid of the protein with its
numeric equivalent according to this rule: if the amino acid is P+ = {H, K, R}, the
equivalent would be 1; if the amino acid is P− = {D, E}, the equivalent would be 2;
if the amino acid is N = {C, G, N, Q, S, T, Y}, the equivalent would be 3; and if the
amino acid is NP = {A, F, I, L, M, P, V, W}, the equivalent would be 4. For instance, if
the proteins studied were (i) MSWQSYVDDHLMCDVE and (ii) FLPlEDGTY, their
numeric equivalent would be (i) 4343334221443242, and (ii) 444422333, respectively.
From this numeric representation of the protein, a 4 × 4 A[i, j] incidence
matrix is built, where i is the row and j is the column, each row and column
representing one of the four types of polarity [7]. To illustrate this process, the
protein (i) MSWQSYVDDHLMCDVE will be taken with its numerical equivalent
4343334221443242. The first step is to set the matrix to zero A[i, j] = 0 for all (i, j) =
(1, 4). Then, the protein is read in numerical notation by pairs, one digit at a time,
from left to right until the end of the protein. In this example, the first pair is “43,”
which means that to element (i, j) = (4, 3) in the matrix A[i, j] will be added 1. Then,
taking one digit to the right is pair “34,” therefore, to element (3, 4) in the matrix,
A[i, j] will be added 1 (Table 8.3, a). If this procedure is repeated with protein
(b) FLPlEDGTY, the B[i, j] incidence matrix will be obtained (Table 8.2, b). These
two examples show how to get the “polar profile” of a protein.
To obtain the “polar profile” of a group of proteins, this procedure will be applied
to each protein of the group, but the polar incidences will be accumulated in the
same matrix. As a result, the incidence matrix of the group of proteins studied, in
this case, the BrS protein mutations, will be unique, and it will be called the “target
incidence matrix” A[i, j].
The PIM system only requires the “target” protein group and the other protein
groups it will be compared to. With this information, the PIM system calculates the
incidence matrix representative of the target A[i, j], normalizes it, and compares it
TABLE 8.3
Evaluation of Polar Profiles
(a) P+ P− N NP
AMSWQSYVDDHLMCDVE[i,j] = P+ 0 0 0 1
P− 1 1 0 1
N 0 1 2 2
NP 1 1 3 1
(b) P+ P− N NP
BFLPlEDGTY[i,j] = P+ 0 0 0 0
P− 0 1 1 0
N 0 0 2 0
NP 0 1 0 3
with the incidence matrix of each protein of the other protein groups. The incidence
matrix of each of these proteins will be called incidence matrix Bk[i, j], where “k”
is the identifier of each protein. Now it will be explained in detail how the incidence
matrices A[i, j], y Bk[i, j] are compared.
TABLE 8.4
Weighting of Polar Profiles
(a) P+ P− N NP
BFLPlEDGTY[i,j] = AMSWQSYVDDHLMCDVE[i,j] + BFLPlEDGTY[i,j] = P+ 0.0000 0.0000 0.0000 0.0370
P− 0.0370 1.0370 1.0000 0.0370
N 0.0000 0.0370 2.0741 0.0741
NP 0.0000 1.0741 0.1111 3.0370
(b) P+ P+ P− N NP
AMSWQSYVDDHLMCDVE[i,j] = P− 0.0000 0.0000 0.0000 0.3333
N 0.3333 0.3333 0.0000 0.6666
NP 0.0000 0.6666 1.0000 0.3333
TABLE 8.5
Comparison of Polar Profiles
Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A[i,j] 15 14 12 11 16 10 8 6 5 4 13 9 7 3 2 1
B[i,j] 16 11 14 6 7 15 12 10 8 5 4 13 9 3 2 1
Similarity × × × × × × × × × × × × × ✓ ✓ ✓
Comparison of sample proteins by position (see Section 8.4.1.2). (✓) The position matches in the
matrices. (×) The position does not match in the matrices.
Analysis of Dysfunctional (Mutated) Proteins 189
level of similarity would be 1 in 16, i.e., (1/16 = 6.25%). This procedure continues
until the lowest relative frequency in both matrices is determined (Table 8.5, posi-
tion 16). When comparing two groups of proteins, the corresponding results will be
referred to as the “level of similarity,” which will be stated with the corresponding
percentage. Getting back to the example, the level of similarity would be 3 in 16,
which means 18.75% (3/16 = 18.75%).
FIGURE 8.5 Graphical representation of sample proteins. (a): Column-Normal Excel soft-
ware. (b) Column-Stacked Excel software. The X-axis represents the 16 polar interactions.
190 Data Science
8.4.3 Data Files
The files were formed based on the different groups of proteins. There are proteins
and mutated proteins associated with BrS located in the UniProt database [5] that
are duplicated. In each case (Table 8.6), they are highlighted in the “duplicated pro-
tein” column and in the mutated proteins evaluated by the PIM system. The PIM
losses sensitivity when identical sequences are input, for that reason, the identical
sequences were eliminated from all protein groups described here.
The sets of proteins, with which the BrS mutated proteins will be compared
(Tables 8.8 and 8.9), intend to establish (or discard) their functional similarity with
respect to these groups, in order to characterize (by association) the BrS mutated
proteins. The PIM system automatically measures the similarity between the target
protein set and each of the proteins that is part of the other groups, and for this
reason, it is important to know the preponderant function of proteins in the groups
used for comparison. We think that it is logical to consider possible associations of
BrS mutated proteins with the group of lipoproteins, and in addition, to establish,
in a similar manner, some similarity with proteins from fungi, virus, and b acteria
groups. Particularly, the bacteria group contains the SCAAP group, which, together
with the lipoprotein group, has a similar mechanism of action to the BrS mutated
proteins.
Analysis of Dysfunctional (Mutated) Proteins 191
TABLE 8.6
Whole Set of Brugada Proteins Found in UniProt
Number of Unique Number of Unique
# ID UniProt BrS Proteins BrS Mutated Proteins Reference
1 ANK3 1 6 [41]
2 ARVD1 1 2 [42]
3 ATFB10 1 252 [37]
4 ATFB3 1 153 [43]
5 ATFB6 1 3 [44]
6 BRGDA1 0 0 [22]
7 BRGDA2 1 4 [45]
8 BRGDA3 1 30 [46]
9 BRGDA4 1 3 [47]
10 BRGDA5 1 8 [47]
11 BRGDA6 1 3 [48]
12 BRGDA7 1 9 [49]
13 BRGDA8 1 3 [46]
14 BRGDA9 1 8 [50]
15 CACNA1C 0 0 [46]
16 CACNB2 0 0 [47]
17 CALM1 0 0 [51]
18 GPD1L 0 0 [45]
19 HCN4 0 0 [46]
20 HEY2 1 2 [52]
21 KCND3 0 0 [50]
22 KCNE3 0 0 [48]
23 KCNH2 1 168 [53]
24 KCNJ8 1 4 [54]
25 KCNQ1 0 0 [43]
26 LQT14 1 0 [53]
27 LQT3 0 0 [51]
28 MYH7 1 137 [55]
29 SCN10A 1 8 [56]
30 SCN1B 0 0 [47]
31 SCN2B 1 4 [57]
32 SCN3B 1 4 [49]
33 SCN5A 0 0 [51]
34 SQT1 0 0 [53]
35 SSS1 0 0 [51]
36 VF1 0 0 [51]
Total 20 804
Unique BrS proteins and BrS mutated proteins extracted from UniProt database [5].
192 Data Science
TABLE 8.7
Analysis of the Overrepresentation of Brugada Groups
PIM System Trained with Protein/ ProteinsP/Mutated-ProteinsM with “Polar
# Mutated Protein Profile” Similar
1 BRGDA8 HCN4PM
2 BRGDA3 CACNA1CPM
3 BRGDA6 KCNE3PM
4 BRGDA7 SCN3BPM
5 BRGDA4 CACNB2PM
6 BRGDA5 SCN1BPM
7 BRGDA2 GPD1LPM
8 BRGDA9 KCND3PM
9 BRGDA1 SSS1,PM LQT3,PM SCN5A,PM VF1,PM and
ATFB10P
10 SSS1 BRGDA1,PM LQT3,PM SCN5A,PM VF1,PM and
ATFB10P
11 CACNB2 BRGDA4,PM BRGDA1,PM SSS1,PM KCND3,M
VF1,M SCN5A,P KCNH2,P and ATFB10P
12 LQT3 BRGDA1,PM SSS1,PM
13 LQT14 CALM1PM
14 SQT1 KCNH2PM
15 GPD1L BRGDA2PM
16 SCN3B BRGDA7PM
17 SCN5A BRGDA1,PM SSS1,PM LQT3,PM VF1,P ATFB10,P
BRGDA1,PM SSS1,PM LQT3,M CACNA1C,M
and ANK3M
18 KCND3 BRGDA9PM
19 KCNE3 BRGDA6PM
20 SCN1B BRGDA5PM
21 HCN4 BRGDA8PM
22 CACNA1C BRGDA3PM
25 KCNQ1 ATFB3PM
27 CALM1 LQT14PM
28 VF1 BRGDA1,PM SSS1,PM LQT3,PM SCN5A,PM and
ATFB10PM
29 KCNH2 SQT1PM
31 ATFB3 KCNQ1PM
35 ATFB10 BRGDA1,PM SSS1,PM LQT3,PM SCN5A,PM and
VF1PM
Overrepresentation of Brugada groups. For example, the PIM system identified that the “polar profile” of
protein/mutation LQT3 is similar to the polar profile of proteins BRGDA1,P SSS1,P SCN5A,P KCNH2,P
and ATFB10P; and to mutated proteins BRGDA1,M SSS1,M KCND3,M and VF1M. The proteins not men-
tioned in this table have no similarity.
194 Data Science
8.4.4 Kolmogorov–Smirnov Test
The Kolmogorov–Smirnov two-sided test (alpha = 0.01) [58] was applied to the
BrS proteins (36 proteins) and the BrS mutated proteins (4,388 proteins), count-
ing the number of matches and rejections produced by the PIM system. The Excel
file and Kolmogorov–Smirnov test related to these proteins can be found in the
Supplementary Materials section.
8.4.5 Test Plan
8.4.5.1 Polar Profile
The PIM system implements two procedures to characterize the mutated proteins
associated with BrS. It registers the number of similarities (see Section 8.4.1.2) to
identify the BrS protein group (Table 8.6) and compares it with the other groups
(Tables 8.7 and 8.8). For instance, if the PIM system is calibrated with the unique
BrS mutated proteins, it is tested with the unique groups of BrS proteins (Table 8.6),
antimicrobial proteins (Table 8.8), disordered proteins, and lipoproteins (Table 8.9).
This computational test generates false positives and false negatives. In addition,
the “polar profiles” of unique proteins and mutated proteins associated with BrS
are also compared (Figure 8.2). At this stage, the “polar profile” representative of
the unique mutated proteins associated with BrS is obtained and, with this infor-
mation, the PIM system looks for the coincidences of this “polar profile” with the
“polar profiles” generated for the total reviewed proteins registered in the UniProt
database [5].
TABLE 8.8
Antimicrobial Peptides
# Number Group References
1 21 SCAAP [8, Table 8.7]
APD2 Database
2 469 Bacteria [9]
3 86 Fungi [9]
4 21 Virus [9]
UniProt Database
5 117 Bacteria [5]
6 46,342 Fungi [5]
7 1,104 Virus [5]
TABLE 8.9
Group of Lipoproteins Analyzed in This Study
# Quantity Description Symbol Search Engine
1 19 High-density HDL (“high density lipoprotein” AND “homo
sapiens”) AND hdl AND reviewed:yes
NOT “lipoprotein binding protein” NOT
(disease:“high density lipoprotein”)
2 0 Intermediate-density IDL (“intermediate density lipoprotein” AND
lipoprotein “homo sapiens”) AND idl AND
reviewed:yes NOT “lipoprotein binding
protein” NOT (disease:“intermediate
density lipoprotein”)
3 51 Low-density LDL (“low density lipoprotein” AND “homo
lipoprotein sapiens”) AND ldl AND reviewed:yes
NOT “lipoprotein binding protein” NOT
(disease:“low density lipoprotein”)
4 17 Very-low-density VLDL (“very low density lipoprotein” AND
lipoprotein “homo sapiens”) AND vldl AND
reviewed:yes NOT “lipoprotein binding
protein” NOT (disease:“very low
density lipoprotein”)
5 14 Chylomicron density Chylomicrons (“chylomicrons” AND “homo sapiens”)
lipoprotein AND chylomicrons AND reviewed:yes
NOT “lipoprotein binding protein” NOT
(disease:“chylomicrons density
lipoprotein”)
6 11 Lipoproteins relevant Atherosclerosis (lipoproteins atherosclerosis AND
to atherosclerosis organism:“Homo sapiens (Human)
[9606]”)
Unique lipoproteins extracted from UniProt database [5], and only human proteins are included that have
the annotation “reviewed.”
8.5 CONCLUSIONS
In this work, a bioinformatics algorithm PIM was tested and validated; its
“fingerprint” enables the rapid identification of BrS mutated proteins. The polar
profile of the BrS mutated proteins analyzed in this study is different from the polar
profiles of proteins from the structural groups: intrinsically disordered proteins, or
those of proteins from the functional groups: lipoproteins, or fungi, virus, or bac-
teria. On the other hand, the polar profile of each protein group associated with
BrS mutated proteins serves as an effective discriminator. Therefore, it is concluded
that it is possible to use the polar profile as a “fingerprint” that identifies, with high
efficiency, those mutated proteins. The polar profile constitutes a metric that only
196 Data Science
measures the electromagnetic balance of the protein using its amino acid sequence,
which makes it useful for database analysis.
Acknowledgments: The authors thank Concepción Celis Juárez for
proofreading.
Author contributions: Theoretical conceptualization, and design: CP.
Computational performance: CP, and VNU. Data analysis: CP, MFM, VNU, TB,
and MAE. Results discussion: CP, MFM, VNU, TB, a MAE. Competing interests:
We declare that we do not have any financial and personal interest with other people
or organizations that could inappropriately influence (bias) our work. Data and
materials availability: Copyright & Trademark. All rights reserved (México), 2018:
PIM®, PONDR® FIT, PONDR® VLXT, PONDR® VSL2, PONDR® VSL2, PONDR®
VL3, and PONDR® VSL2-based PPID values. Software & Hardware: Hardware:
The computational platform used to process the information was HP Workstation
z210—CMT—4 x Intel Xeon E3–1270/3.4 GHz (Quad-Core) —RAM 8 GB—SSD
1 x 160 GB—DVD SuperMulti—Quadro 2000—Gigabit LAN, Linux Fedora 14,
64-bits. Cache Memory 8 MB. Cache Per Processor 8 MB. RAM 8. Software:
PONDR® FIT, PIM®, PONDR® VLXT, PONDR® VSL2, PONDR® VL3, FoldIndex,
IUPred, and TopIDP, as well as PONDR® VSL2-based values. Supplementary
Materials: The test files and the PIM ® system was supplied as support of the
manuscript to the journal, but it can be requested from the corresponding author
(polanco@unam.mx). The materials related to “Intrinsic disorder propensity in 20
unique BrS-related proteins,” were supplied as support of the manuscript to the
journal.
REFERENCES
1. P. Brugada, J. Brugada, Right bundle branch block, persistent ST segment elevation
and sudden cardiac death: a distinct clinical and electrocardiographic syndrome.
A multicenter report. J Am Coll Cardiol 20, 1391–1396 (1992).
2. S. G. Priori, C. Blomström-Lundqvist, A. Mazzanti, N. Blom, M. Borggrefe, J. Camm,
P. M. Elliott, D. Fitzsimons, R. Hatala, G. Hindricks, P. Kirchhof, K. Kjeldsen, K. H.
Kuck, A. Hernandez-Madrid, N. Nikolaou, T. M. Norekvål, C. Spaulding, D. J. Van
Veldhuisen, ESC Scientific Document Group. 2015 ESC Guidelines for the manage-
ment of patients with ventricular arrhythmias and the prevention of sudden cardiac
death: the task force for the management of patients with ventricular arrhythmias and
the prevention of sudden cardiac death of the European Society of Cardiology (ESC).
Endorsed by: Association for European Paediatric and Congenital Cardiology (AEPC).
Eur Heart J. 36, 2793–2867 (2015).
3. M. F. Márquez, A. Bonny, E. Hernández-Castillo, A. De Sisti, J. Gómez-Flores,
S. Nava, F. Hidden-Lucet, P. Iturralde, M. Cárdenas, J. Tonet, Long-term efficacy
of low doses of quinidine on malignant arrhythmias in Brugada syndrome with an
implantable cardioverter-defibrillator: a case series and literature review. Heart
Rhythm. 9, 1995–2000 (2012).
4. J-M. J. Juang, M. Horie, Genetics of Brugada syndrome. Journal of Arrhythmia. 32,
418–425 (2016).
5. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res.
43(database issue), D204–D212 (www.ncbi.nlm.nih.gov/omim), accessed May 17,
2016 (2015).
Analysis of Dysfunctional (Mutated) Proteins 197
6. L. Pauling, The Nature of the Chemical Bond and the Structure of Molecules and
Crystals: An Introduction to Modern Structural Chemistry. ISBN:9780801403330
(Cornell University Press, 1960).
7. C. Polanco, Polarity Index in Proteins-A Bioinformatics Tool. doi:10.2174/97816810
826911160101, eISBN: 978-1-68108-270-7, 2016, ISBN: 978-1-68108-269-1 (Bentham
Science Publishers, Sharjah, 2016).
8. C. Polanco, J. L. Samaniego, J. A. Castañon-González, T. Buhse, M. Leopold Sordo,
Characterization of a possible uptake mechanism of selective antibacterial peptides.
Acta Biochim Pol. 60, 629–633 (2013).
9. G. Wang, X. Li, Z. Wang, APD2: The updated antimicrobial peptide database and
its application in peptide design. Nucleic Acids Res. 37, D933–D937 (2012).
10. C. J. Oldfield, A. K. Dunker, Intrinsically disordered proteins and intrinsically
disordered protein regions. Annu Rev Biochem. 83, 553–584 (2014).
11. C. Polanco, J. L. Samaniego, V. N. Uversky, J. A. Castañón-González, T. Buhse,
M. Leopold-Sordo, A. Madero-Arteaga, A. Morales-Reyes, L. Tavera-Sierra,
J. A. González-Bernal, M. Arias-Estrada, Identification of proteins associated with
a myloidosis by polarity index method. Acta Biochim Pol. 62, 41–55 (2014).
12. J. M. Claverie, A common philosophy and FORTRAN 77 software package for
implementing and searching sequence databases. Nucleic Acids Res. 12, 397–407 (1984).
13. M. Hanke, Y. O. Halchenko, Neuroscience Runs on GNU/Linux. Front Neuroinf.
5, 8 (2011).
14. P. E. Wright, H. J. Dyson, Intrinsically unstructured proteins: re-assessing the protein
structure-function paradigm. J Mol Biol. 293, 321–331 (1999).
15. V. N. Uversky, A. K. Dunker, Understanding protein non-folding. Biochim Biophys
Acta. 1804, 1231–1264 (2010).
16. A. K. Dunker, C. J. Oldfield, J. Meng, P. Romero, Y. J. Yang, J. W. Chen, V. Vacic, Z.
Obradovic, V. N. Uversky, The unfoldomics decade: an update on intrinsically disor-
dered proteins. BMC Genom. 9(Suppl 2), S1 (2008).
17. V. U. Uversky, J. R. Gillespie, A. L. Fink, Why are “natively unfolded” proteins
unstructured under physiologic conditions? Proteins. 41, 415–427 (2000).
18. A. K. Dunker, J. D. Lawson, C. J. Brown, R. M. Williams, P. Romero, J. S. Oh, C. J.
Oldfield, A. M. Campen, C. M. Ratliff, K. W. Hipps, J. Ausio, M. S. Nissen, R. Reeves,
C. Kang, C. R. Kissinger, R. W. Bailey, M. D. Griswold, W. Chiu, E. C. Garner, Z.
Obradovic, Intrinsically disordered protein. J Mol Graph Model. 19, 26–59 (2001).
19. A. K. Dunker, Z. Obradovic, The protein trinity--linking function and disorder. Nat
Biotechnol. 19, 805–806 (2001).
20. A. K. Dunker, I. Silman, V. N. Uversky, J. L. Sussman, Function and structure of inher-
ently disordered proteins. Curr Opin Struct Biol. 18, 756–764 (2008).
21. V. U. Uversky, A decade and a half of protein intrinsic disorder: biology still waits for
physics. Protein Sci. 22, 693–724 (2013).
22. A. L. Darling, V. U. Uversky, Intrinsic disorder and posttranslational modifications: the
darker side of the biological dark matter. Front Genet. 9, 158 (2018).
23. V. U. Uversky, C. J. Oldfield, A. K. Dunker, Intrinsically disordered proteins in human
diseases: introducing the D2 concept. Annu Rev Biophys. 37, 215–246 (2008).
24. U. Midic, C. J. Oldfield, A. K. Dunker, Z. Obradovic, V. N. Uversky, Unfoldomics of
human genetic diseases: illustrative examples of ordered and intrinsically disordered
members of the human diseasome. Protein Pept Lett. 16, 1533–1547 (2009).
25. V. N. Uversky, Intrinsic disorder in proteins associated with neurodegenerative
diseases. Front Biosci. (Landmark Ed). 14, 5188–5238 (2009).
26. V. N. Uversky, C. J. Oldfield, U. Midic, H. Xie, B. Xue. S. Vucetic, L. M. Iakoucheva,
Z. Obradovic, A. K. Dunker, Unfoldomics of human diseases: linking protein intrinsic
disorder with diseases. BMC Genomics. 10(Suppl 1), S7 (2009).
198 Data Science
Kirsty M. Smith
Bradford Teaching Hospitals
CONTENTS
9.1 I ntroduction...................................................................................................202
9.2 Literature Review..........................................................................................203
9.2.1 Skin Burns......................................................................................... 203
9.2.2 Causes of Burn Injuries..................................................................... 203
9.2.3 Burns Category..................................................................................204
9.2.4 Burn Assessment Techniques............................................................204
9.2.4.1 Clinical Assessment............................................................204
9.2.4.2 Blood Perfusion Measurement............................................205
9.3 Machine Learning......................................................................................... 210
9.3.1 Convolutional Neural Networks ....................................................... 210
9.3.1.1 Convolution Layer............................................................... 211
9.3.1.2 Pooling Layer...................................................................... 211
9.3.1.3 Output/Classification Layer................................................ 212
9.3.2 Training a ConvNet........................................................................... 212
9.3.3 Common ConvNet Models................................................................ 213
9.3.3.1 AlexNet............................................................................... 213
9.3.3.2 GoogleNet........................................................................... 214
201
202 Data Science
9.1 INTRODUCTION
Early identification of burn injuries and the precise evaluation of burn depth is of
great importance to the patient towards management of an injury, thereby ensur-
ing early decision-making on whether surgical intervention is required or not.
Accurate assessment is important because it helps to decide the potential healing
times of any identified burn wound that allows proper treatment to be performed as
early as possible. Burns that take a long period to heal, such as superficial dermal
and deep dermal burns, can be treated as early as possible if they are identified,
which may minimize further complications due to assessment delay [1]. Although
superficial epidermal and full-thickness burns are easy to be identified clinically,
can machine learning (ML) algorithms be used to differentiate superficial burns
and full-thickness burns?
According to a report released by the World Health Organization (WHO) in 2008,
burn injury is ranked as the fourth most devastating injury after road accident and
internal conflicts, with approximately 11 million [2,3] patients, while in United State
alone, almost 450,000 patients are treated for burn injuries every year. Burns are
ranked as the third most common injury affecting children below the age of 5 in
England. This was reported in a study by Kandiyali et al. [4], and the investiga-
tion shows that more than 7,000 children have been reported at hospital in 2015
for burn management. Similar cases of burn incidences are rampant in low- and
middle-income countries such as Africa. Report also shows that children are mostly
affected by burn accidents in Africa, where sub-Saharan Africa has the highest rate
of mortality cases with about 4.5 per 100,000 compared with 2.5 per 100,000 across
103 countries [5].
A large volume of medical images is being collated everyday due to the increase
of imaging capturing devices. These images are processed by visual inspection by
medical experts, but concern has been raised on the subjective nature (assessment
inconsistency) associated with humans [6,7]. Furthermore, evaluation of burn inju-
ries by visual inspection is subjected to two potential errors—underestimation and
overestimation [8]. When burns are underestimated, patients are likely to spend a
long time in hospitals, and further complications may arise due to assessment delay.
Overestimation may subject patients to unnecessary surgical diagnosis such as skin
grafting. In both cases, unnecessary hospital cost increases.
Discrimination of Burns Using Machine Learning 203
9.2 LITERATURE REVIEW
In this section, various causes of burn injuries and its classification are presented.
Additionally, the section presented a review on some effective diagnostic procedure
of addressing burn injuries and highlighted some associated limitations.
9.2.1 Skin Burns
According to WHO “burn is an injury caused by heat (hot objects, gases, or flames),
chemicals, electricity and lightening, friction, or radiation” [9]. It is an injury that
damages the body tissues, affecting people of different ages. It is one of the most
disturbing injuries in the world affecting children and adults, though children are
affected in most cases.
9.2.3 Burns Category
A burn is categorized into three stages depending on the severity or how deep the
body tissue is affected:
• First-degree burn
• Second-degree burn
• Third-degree burn
First-degree burn is a situation where burn affects only the uppermost layer of the
skin (epidermis) [11]. This type of burn is mostly common in Caucasian people,
where sunlight is usually the causative agent. This category of burn does not raise
much concern due to its ability to heal spontaneously within the first 7 days after
injury. The most common feature that identifies this type of burn is reddish color,
absence of swelling with no blistering, but slightly painful. Burn does more harm
when its impact reaches the layer beneath the epidermis (dermis). This is because
the demise housed most of the essentials organs that keep the skin in good shape.
Most of the nutrients and oxygen supplied to the epidermis are contained in the der-
mis. Therefore, when burn affects this layer, this means that the body won’t be able
to regulate body temperature as well as vitamin D synthesis. In a nutshell, when
burn destroyed the epidermis and affects the dermis layer, this category of burn is
referred to as second-degree burn [12]. Burns extending deep to the subcutis layer,
destroying the whole dermis, thereby exposing the subcutaneous layer including
muscles and tendons is called third-stage (degree) burns [10,12]. In other words, this
category of burn is called full-thickness burn since all the three layers of the skin
are affected.
9.2.4.1 Clinical Assessment
Clinically, burn assessment begins by capturing the history of injury, such as mode
of injury, contact duration with the heat, and the type of first aid given to the patient.
Afterwards, burn is assessed based on the physical appearance of the wound by
considering color appearance, sensitivity to pain, and capillary refill. The features
observation is conducted by human expert, and diagnosis is based on human judg-
ment, which depends on the level of experience. Color is the most considered charac-
teristic used by medical personnel when assessing burns visually and during blister
formation. Burns that only affect the upper layer of skin (superficial) is known to be
reddish, dry, and with no blistering. If the skin disruption and the reddish appearance
of the burn wound is considered as a criterion to identify burn injury, there exist skin
injuries that share similar physical characteristics with burn, such as bruises, as such
a potential misdiagnosis may occur. Therefore, in a situation that involves similar
characteristics shared by different skin injuries, probability of misdiagnosis is very
high [13]. Misdiagnosis in this context can be likely of assessing and categorizing
Discrimination of Burns Using Machine Learning 205
The investigation was conducted by the authors in [19], where patients were
recruited following their willingness and acceptance to participate, all of which sus-
tained burns that are challenging to diagnose with visual assessment. Most of the
burns sustained by these patients were within 24 h and were suspected to be deep
dermal burns with less than 12% body surface area. To compare the effectiveness
of this new technique, samples were collected using punch biopsies at 72 h after
injury. The scanning was conducted, which captures images of the burn wounds
lying 1.6 m from the scanner in a controlled room temperature of 25°C, and the scan
images were generated by the computer. The result of the study shows that there was
a 77% correlation in the assessment by both LDI and clinical assessment. Superficial
dermal or epidermal depth that was correctly assessed clinically indicated high per-
fusion rate using LDI, where deep dermal and full-thickness burns indicate a low
perfusion rate. However, 15.3% were assessed clinically as deep dermal wounds, but
with LDI, there was an indication of high perfusion rate. This shows the overestima-
tion by clinicians. Studies were conducted later by different scholars, ranging from
its ability to predict healing times at early hours after injuries and assessment of
burns in children.
Using LDI to diagnose burns in children was described as a challenging task,
because scanning is expected to be carried out while the patient is still [21], and the
restlessness is believed to affect the scanning process and render inaccurate out-
come. As such, the study did investigate the prediction of burn wounds outcome
in pediatrics using LDI. The study comprises 57 patients and was approved by the
ethical committee of children’s hospital at Westmead, Australia. Colored images of
the burn wounds were captured using Olympus Camedia C-2500L digital camera.
The scanning was conducted once in 36 to 72 h after injury. Each scan takes about
5–7 min for children who cooperated and stood still, while for the restless children
the scanning time was reduced between 90 s and 3.5 min, although some generated
images are noisy due to movement artifacts. The acquired images were then assessed
Discrimination of Burns Using Machine Learning 207
reliable technique. The authors proposed a study to determine the LDI accuracy
in assessing burn depth and healing times on different days, starting from day 0.
This study was aimed to provide vital information regarding on which day after the
sustenance of burn injury does LDI assesses burn depth effectively when compared
with clinical evaluation. They conducted the study on 40 patients (7 women and 33
men), all with intermediate burn wounds at the Burns center in University of Gent,
Belgium. Both clinical and LDI techniques were used to determine the accuracy
of assessments. The authors highlighted that, due to unforeseen condition, not all
patients attended all the days. This is as a result of not arriving or reporting to
the hospital on the day that the injury occurred, or due to restriction suggested by
medical personnel for further scanning. The assessment was conducted on days 0,
1, 3, 5, and 8, As shown in Table 9.1, LDI is less accurate at day 0, and afterwards,
the accuracy of LDI in assessing burn depth increases, with 100% accuracy at day
8 after burn incidence. Similarly, clinical evaluation shows inefficiency at day 0,
increases at day 1 after burn, deteriorates at day 3, and regains accuracy at day 5,
and shows its full effectiveness (100%) at day 8 after burn. Deteriorating or lack of
good assessment at day 3 was reported to be as a result of dead tissue presence on
the wound surface. However, the adherence of dead tissue on day 3 shows to have
no effect on the assessment using LDI. The result also shows that LDI might not be
required at day 8 after the burn, because clinical assessment is easy, less costly, and
achieved a 100% accuracy.
It was stated that differentiating first-degree and third-degree burns is easier and effec-
tive with just clinical evaluation, but differentiating burns that include second-degree
burn is the challenge faced by medical personnel even with experienced surgeons. A
comparison of LDI and clinical assessments is presented in [24]. The authors’ aim was
to compare the performance efficiency of clinical assessment and LDI in discriminat-
ing superficial burn and intermediate burn wound (deep partial-thickness burns). The
investigation was conducted at a burn unit in Mayo Hospital Lahore, Pakistan, where 34
patients were recruited, and a total of 92 burn wounds are considered from these par-
ticipants within a period of 21 months (March 2015–November 2016). Laser Doppler
scanning was done between days 3 and 5 after injury, because the burn depth is more
reliable after the third day. Clinical evaluation was simultaneously performed by expe-
rienced surgeons but blinded by laser Doppler scanning outcome. The assessment has
shown an accuracy of LDI outperforming clinical evaluation, with an LDI sensitivity
of 92.75% against 81% for clinical assessment, and both achieved a specificity of 82%,
positive predictive value of 94% for LDI against 93% for clinical assessment, negative
predictive value of 79% for LDI against 59% for clinical assessment, and diagnostic
accuracy of 90.12% for LDI against 81.52% for clinical assessment.
Moreover, another study was approved by the ethics committee at the Children’s
Hospital Westmead in Australia, in which 400 patients were included in the study
over a period 12 months. This study served as a follow-up to a study presented in [21],
where the effective performance of using LDI for the assessment of burn wound in
children between 48 h and 72 h after injury was investigated. How effective is LDI
in assessing burns in children before 48 h? This is the aim of the study in [25]. The
authors found that the sensitivity and specificity using LDI are 78% and 74%, respec-
tively, compared with those scans conducted after 48 h, which resulted in 75% and
Discrimination of Burns Using Machine Learning 209
TABLE 9.1
Performance Comparison of Clinical Assessment and LDI
Days Number Clinical Evaluation (%) LDI (%)
0 31 40.6 54.7
1 39 61.5 79.5
3 40 52.5 95.0
5 34 71.4 97.0
8 25 100 100
85% for sensitivity and specificity, respectively. Statistically, its shows no significant
difference. This indicates an inefficiency of LDI to detect burns at early hours, as
reported by [23] and the need to have an efficient technique to do the job at early
hours is of paramount importance.
The advantage of using LDI instead of visual assessment and histological means
are as follows:
i. Studies have shown that, in the first 24-h postburn, the LDI assessment is
very poor, which is found to be good after 2–5 days. Late assessment of
burn wounds is inaccurate due to the healing process and the formation of
new tissue (also known as granulation tissue).
ii. The penetration power of laser light is limited by certain factors, such as
blisters, cream, unremoved dead tissues, and coverings. Other factors that
make burn evaluation so challenging include tattoos that were found absorb
laser light [1,26] and natural skin pigment in a situation where it is bound to
the dermis. However, natural skin pigmentation, in some cases, was found
to be a noninfluencing factor because the pigment is detached along with
the epidermis.
iii. Patient sickness and some vasoactive medications have been discovered to
affect LDI measurement. Vasoactive medications reduce blood flow and as
such affect burn wound assessment using LDI. Similarly, sickness affects
skin blood flow.
210 Data Science
iv. Skin appendages that are essential for restoration of a damaged tissue play
a vital role when probing a burn wound with LDI. However, areas that were
previously grafted due to a certain reason give an inaccurate LDI prediction.
v. In a situation where patient is restless or refuses to stand still, the movements
were found to increase the LDI flux and tend to diminish the image appearance.
vi. Parts of the body, such as those at the edges of the limb and other parts that
bent off from the LDI spot, have been found to receive limited LDI laser
illumination, which requires to be rescanned from an appropriate direction.
vii. Another factor that hinders the use of LDI is the expensive modality
reported by [6].
9.3 MACHINE LEARNING
ML is a subset of artificial intelligence (AI) that provides how to make machines
intelligent and enable them to act like humans [27–29]. The terms AI and ML are
sometimes used interchangeably; however, the terms differ but with a strong relation-
ship. AI is a concept of making machines intelligent, while ML is a technique of how
to achieve AI. ML basically can be grouped into two main categories: supervised ML
and unsupervised ML. In supervised learning technique, machines are trained with
labeled data so as to learn to map an input data with the corresponding output. The
labeled data means a known data (i.e., a problem at hand with the known solution),
and the idea is to enable machines to learn the relationship between data and output
by learning unique representations that associated the input with the output. The
goal is to enable accurate prediction of unseen data when presented to the machine
without human intervention. The process of learning that involves only input data
with no output information is called unsupervised learning. In this type of learning,
machines are allowed to figure out and group data based on the similarity of repre-
sentations. Unsupervised learning is considered more of true AI by some research-
ers than supervised learning approach, because during the learning process, there is
complete absence of human intervention to guide the learning process. Some exam-
ples of supervised ML algorithms are support vector machines (SVMs) and artificial
neural networks, while clustering is an example of unsupervised learning strategy.
Neural network has remained limited to not more than three layers, which consist
of an input layer, a single hidden layer, and lastly the output layer till around 1990s.
Subsequently, the evolution keeps growing to networks with multiple hidden layers.
However, multiple layered networks were hindered by lack of training and evaluation
datasets and unavailability of powerful computational machines to run the experi-
ments. In 2010, the availability of a large database of annotated clean images called
ImageNet [30] has revived the vison research. Since then, each year, vision commu-
nities have been participating in a challenge to train and evaluate their innovations.
and the lack of machines with computational power to train with. It became more
popular in 2010 with the availability of a large database of annotated images com-
prising about 1,000 categories of different classes of images and were made publicly
available to researchers who are willing to train and test their proposed models [30].
All the ConvNet models are composed of similar architectural layers, such as
input layers, convolutional layers, pooling layers, and output layers or fully con-
nected layers, but may differ in the layout they are presented. Input layer is the first
layer that interfaces the outside world responsible for receiving input to be processed.
However, input layer is not counted as part of the layer of a ConvNet model.
9.3.1.1 Convolution Layer
Convolutional layer in neural network is the most crucial part of the network. It is a
layer of the CNN architecture that serves as a learning layer. It is a special layer of
ConvNet architecture described in [32], and it consists of numerous neurons grouped
together as filters. This layer performs the convolution (learning) operation on every
portion of the input (usually image) and learns each pixel value [33], and the convolu-
tion operation means sliding a filter over an image while learning every single part
of an input image.
A good way to understand the filter is by introducing a torchlight analogy.
Assuming one is looking for a needle that falls in a room, say on a carpet in a sitting
room that is very dim and makes searching a difficult task. One of the best means to
look for it is to get a bright light into the searching environment. One might decide
to use torchlight for the search, although the torchlight may not provide the desired
brightness that can illuminate the entire region in the room. One may realize that
the torchlight is brighter in a single focused area, and in ConvNet, this focused area
is called receptive field. However, the filter size can be of different sizes: 3 × 3 filter,
5 × 5 filter, 7 × 7 filter, etc., depending on the required architecture defined by the
developer. The operation is performed by a convolving (sliding) filter over the input
image, where on every receptive field covered by the filter, the elements of the filter
are multiplied with the receptive field’s elements. The output of this operation is
called feature map, and that should serve as the input to the next layer in the net-
work. In addition, the convolving or sliding operation is guided by how many pixels
a filter can move on at a time. This operation is called stride. For example, a stride
of 1 means a filter moves from one receptive field to another by one pixel, a stride
of 2 means moving a filter from a receptive field position onto another by skipping
two pixels.
9.3.1.2 Pooling Layer
Thereafter, pooling layers are used to reduce the number of feature outputted by
convolution layers so as to minimize computational cost and increase efficiency [34].
Pooling operations are performed similar to convolution operation by convolving
against the features extracted by a convolution layer, and the operation is either max-
imum pooling operation or average pooling operation. Pooling operation reduces
the number of features obtained through convolution operations into specific, small
features. This is much like discarding unnecessary features, thereby minimizing
the cost of computation and increasing the efficiency [34]. Examples of pooling
212 Data Science
operations are maximum pooling (max-pooling) where the maximum value in the
receptive field is selected to be the new pixel value in the condensed feature map,
average pooling where the average of the features in the receptive fiedd is deter-
mined, the resulting average value serves as the new feature in the subregion of the
new feature map. Examples of both maximum and average pooling filters are shown
in Figures 9.2 and 9.3, respectively, both with 2 × 2 filter and a stride of 2.
9.3.1.3 Output/Classification Layer
The last layer of ConvNet model is the output layer, which outputs the information
regarding the class to which each instance learned by the previous belongs. In a nut-
shell, the output layer is a classification layer.
9.3.2 Training a ConvNet
Training a ConvNet model requires a huge amount of data of very good quality.
The presence of noise in the training dataset may hamper the performance
effectiveness of the learning model and failed to achieve the desired expected out-
put. Machines with high computational power are also a prerequisite for training
deep learning models, as the emergence of a graphics processor unit greatly aids in
wide popularity of deep learning. Training is normally performed by dividing the
data into three subsets: training set used for the training; validation set used during
the training to examine how accurate the model is on the unseen sample; and lastly,
the testing set used after training to inspect the performance of the model. Thanks
to the ImageNet Large Scale Visual Recognition Task (ILSVRC) introduced as an
object recognition and classification benchmark in 2010 [35,36]. However, with lim-
ited database of images, it is now possible to train pretrained models via a concept
Discrimination of Burns Using Machine Learning 213
called transfer earning. Transfer learning is a good strategy to train a ConvNet model
with limited datasets, where the model that was trained on a large database can be
employed to reuse the lower layers as feature extractors. This approach has been
reported in many literatures [37,38] to be a good practice when dealing with insuf-
ficient data. A good example of the application of this strategy is in the health sector,
not because the data is limited, but because the availability is hampered by ethical
issues [33]. Therefore, the experiment provided in section 4 utilizes the concept of
reusing a pretrained ConvNet model as a feature extractor while applying a classifi-
cation algorithm for the classification task.
9.3.3.1 AlexNet
AlexNet is the first popular ConvNet architecture developed by Alex Krizhevsky
et al. [39], which in 2012 participated in the ILSVRC and outperformed all other
models presented in that year by achieving a top five error rate of 16%. The AlexNet
has a total of eight layers, which consists of five convolutional layers (both convolu-
tion and pooling layers) and three fully connected layers. Convolution is performed
using a 11 × 11 filter size, while the pooling operation is performed using 3 × 3
filters in the first layer with a stride of 2. In the second layer, the same operation is
repeated but with a filter of size 5 × 5, while in all the remaining three layers (third,
fourth, and fifth), 3 × 3 filters were used. Figure 9.4 gives a graphical view of the
AlexNet model.
9.3.3.2 GoogleNet
GoogleNet by Szegedy et al. [40] from Google achieved a top five error rate of 6.67%
in 2014. The GoogleNet model has multiple convolution layers in each layer arranged
in parallel, which was proposed and incorporated as inception layers, as shown in
Figure 9.5. These convolution layers are equipped with variable sizes, and the out-
puts are concatenated to form a single input to the next layer. GoogleNet has 22
layers, which makes it a deeper network than any model proposed before it.
9.3.3.3 VGGNet
Similarly, the Visual Geometry Group (VGG) at the University of Oxford proposed
a model and won a second place in the ILSVRC challenge in 2014 [31]. Two variants
of this network (VGG-16 and VGG-19) were the best performing models and are
very deep with a similar architectural design but different numbers of layers and uses
smaller convolution filters (i.e., 3 × 3) compared with 11 × 11 filter sizes in AlexNet.
VGGNet differs with AlexNet in the placement of pooling layers—AlexNet is
stacked with pooling layer every convolution, while VGGNet uses a pooling layer
after two or three convolution layers.
9.3.3.4 Residual Network
In 2015, Microsoft proposed a deep network called Residual Network (ResNet) [41],
which is almost eight times deeper than VGGNet and won the first position on the
ImageNet ILSVRC challenge, with an error rate of 3.57%. The unique feature of
this network is the residual connection (shortcut connection). In a nutshell, shortcut
connection (or rather in some context, skip connection) is the addition of input of the
previous layer with an output of a lower layer, and the resulting value then passes
through an activation function that serves as the input for the next layer down the
network. This provides the possibility of a stacking network with more layers, with
increasing accuracy as shown in Figure 9.6.
Skip connection was adopted after every few stacked convolution layers, which is
represented by a shortcut path as shown in Figure 9.6. Mathematically, the resulting
operation is represented by the following equation:
y = f ( z ) + ( z ) = f ( z ) + z (9.1)
The ( z ) function, which is a shortcut presented in Figure 9.6, allows the gradient to
be maintained and gives an opportunity to train the network quicker and with more
layers stacked, and f ( z ) is the output of the convolution layer (weight layer). The
convolution layers may vary depending on the number of hyperparameters contained
in each layer.
Figure 9.6 shows that ResNet has two parts: the main part and the shortcut part.
The main part comprises the regular convolution (weight layer) and activation
(ReLU) layers, while the shortcut path is the input of the previous layer directly con-
catenated to the output of the subsequent layers. After concatenation, nonlinearity
is applied as represented in the following equation: y′ = σ ( f ( z ) + z ), where σ is the
activation operation (nonlinearity) after the elementwise operation, as depicted in
Figure 9.6, and y′ is the output of the nonlinearity operation.
Accuracy =
1
n ∑ A (9.2)
i =1
i
where n is the number of folds and Ai is the accuracy obtained in each iteration.
TABLE 9.2
Classification Result
Target Classes
a b c
Predicted classes a 660 0 1
b 0 660 0
c 0 0 659
predicted as burns are referred to as TPs, and this can be obtained in cell [c, c].
Moreover, cell [b, b] also presents a TP class for superficial epidermal burns.
9.5.2 Classifier Performance
The performance evaluation of the classification algorithm in categorizing each
instance class is presented in Table 9.3. The classifier mapped an instance from
the three classes into one of the healthy, superficial burn, full-thickness burn class
labels. The performance evaluation metrics such as precision, sensitivity (recall), and
specificity are computed as well. Precision gives the proportion of relevant instances
retrieved by the classifier (i.e., fraction of relevant instances among the total retrieved
instances) [46] as represented in Equation (9.3). Recall is the number of accurately
categorized positive samples divided by the number of positive samples contained
in the data. Recall is a term sometimes referred to as sensitivity (accurate prediction
of positive instances), as formulated in Equation (9.4). Specificity is the measure of
negative sample recognition [47], as formulated in Equation (9.5). Moreover, the
overall accuracy (OA) of the classification algorithm and the corresponding preci-
sion metric are shown in Table 9.4.
TP
Precision = (9.3)
TP + FP
TP
Sensitivity ( Recall ) = (9.4)
TP + FN
TN
Specificity = (9.5)
TN + FP
TABLE 9.3
Performance Metric of an Individual Class
Metrics Healthy Superficial Full-Thickness
Precision (%) 99.848 100.000 100.000
Sensitivity (Recall) (%) 100.000 100.000 99.848
Specificity (%) 99.924 100.000 100.000
Discrimination of Burns Using Machine Learning 219
TABLE 9.4
OA and OP of the Classification Algorithm
OA (%) OP (%)
99.949 99.924
where TP(S) is the TP for healthy skin class, TP(S) is the TP for superficial epider-
mal burns class, and TP(F) is the TP for the full-thickness burns class. The result
obtained by Equation (9.6) is presented in Table 9.4.
The overall precision (OP) of the classification algorithms was obtained as
expressed in Equation (9.7)
TP(S) + TP(F)
OP = (9.7)
TP(S) + TP(F) + FP
9.6 CONCLUSIONS
The conclusive objective of diagnosing burns is to provide immediate assessment,
reduce length of hospital delays, avoid complications that are likely to be acquired
due to the long hospitalization, and prevent loss of lives as a result of unavailability of
medical facilities. Thus, ML algorithm is used to discriminate healthy skin, superficial
burns, and full-thickness burns. The extraction of image features was conducted using
a pretrained model (ResNet101) and a subsequently utilized SVM classifier for the
classification task. The result shows a performance accuracy of up to 99. An interest-
ing part of this investigation is that the future extraction model was trained on object
recognition and classification task, but it seems good when applied to medical images
for the feature extraction. Application of ML can be utilized to tackle the unavailabil-
ity of medical personnel in some remote locations where access to health facility and
experienced health practitioners is subjecting patients to difficult situations.
One of the limitations of the investigation in this work is noninclusion of interme-
diate burns. This was due to unavailability of such data at our disposal. Second, the
investigation was conducted on Caucasian patients (people with white skin), as it is
worth to include data from people of different ethnicities, such as burn images from
skin of color ethnicities. Therefore, these provide the room for further investigation.
REFERENCES
1. C. Wearn, K. C. Lee, J. Hardwicke, A. Allouni, A. Bamford, P. Nightingale, and N. J.
B. Moiemen, “Prospective comparative evaluation study of Laser Doppler Imaging and
thermal imaging in the assessment of burn depth,” Burns, vol. 44, no. 1, pp. 124–133,
2018.
2. A. D. Gilbert, E. Rajha, C. El Khuri, R. B. Chebl, A. Mailhac, M. Makki, and
M. El Sayed, “Epidemiology of burn patients presenting to a tertiary hospital emer-
gency department in Lebanon,” Burns, vol. 44, no. 1, pp. 218–225, 2018.
3. P. Brassolatti, P. S. Bossini, H. W. Kido, M. C. D. Oliveira, L. Almeida-Lopes,
L. M. Zanardi, M. A. Napolitano, L. R. d. S. de Avó, F. M. Araújo-Moreira, and N. A.
Parizotto, “Photobiomodulation and bacterial cellulose membrane in the treatment of
third-degree burns in rats,” Journal of Tissue Viability, vol. 27, no. 4, pp. 249–256, 2018.
4. R. Kandiyali, J. Sarginson, L. Hollén, F. Spickett-Jones, and A. Young, “The man-
agement of small area burns and unexpected illness after burn in children under five
years of age—A costing study in the English healthcare setting,” Burns, vol. 44, no. 1,
pp. 188–194, 2018.
5. S. Wall, N. Allorto, R. Weale, V. Kong, and D. Clarke, “Ethics of burn wound care in a
low-middle income country,” AMA Journal of Ethics, vol. 20, no. 6, p. 570, 2018.
6. D. McGill, K. Sørensen, I. MacKay, I. Taggart, and S. Watson, “Assessment of burn
depth: a prospective, blinded comparison of laser Doppler imaging and videomicros-
copy,” Burns, vol. 33, no. 7, pp. 833–842, 2007.
7. A. D. Jaskille, J. C. Ramella-Roman, J. W. Shupp, M. H. Jordan, and J. C. Jeng, “Critical
review of burn depth assessment techniques: part II. Review of laser Doppler technol-
ogy,” Journal of Burn Care & Research, vol. 31, no. 1, pp. 151–157, 2010.
Discrimination of Burns Using Machine Learning 221
46. V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into clas-
sification with imbalanced data: Empirical results and current trends on using data
intrinsic characteristics,” Information Sciences, vol. 250, pp. 113–141, 2013.
47. M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for clas-
sification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427–437,
2009.
10 A Study and Analysis of
an Emotion Classification
and State-Transition
System in Brain
Computer Interfacing
Subhadip Pal, Shailesh Shaw, Tarun Saurabh,
Yashwant Kumar, and Sanjay Chakraborty
TechnoIndia
CONTENTS
10.1 I ntroduction................................................................................................... 226
10.2 Literature Review.......................................................................................... 228
10.3 Proposed Work.............................................................................................. 229
10.3.1 Classification Processes..................................................................... 231
10.3.1.1 SVM Classifier.................................................................... 231
10.3.1.2 K NN Classifier................................................................... 232
10.3.1.3 Random Forest Classifier.................................................... 232
10.3.2 State-Transition Machine................................................................... 234
10.3.2.1 Proposed Algorithm of Emotional State Transition
Based on Channel Value for a Fixed Time Interval���������� 234
10.4 Result Analysis.............................................................................................. 236
10.4.1 Requirement....................................................................................... 236
10.4.2 Result Comparisons of SVM, KNN, and Random
Forest Classifiers ������������������������������������������������������������������������������ 236
10.4.3 SVM Polynomial Kernel Performance Analysis............................... 237
10.4.4 Analysis of the State-Transition Machine.......................................... 238
10.4.5 Comparison with Previous Works..................................................... 243
10.4.6 Computational Complexity................................................................244
10.5 Conclusion.....................................................................................................244
Acknowledgment.................................................................................................... 245
References............................................................................................................... 245
225
226 Data Science
10.1 INTRODUCTION
In the human world, emotion plays a vital role for successful communication.
Ability to understand the emotional states of a person whom we are communicating
with comes to humans naturally. But that is not the case with machines. Whenever
we think about machine communication, we assume its artificial machine tone and
lack of mutual rapport. However, considering the explosion of machines in our
recent lives, there should be no hesitation to admit that any interface that disregards
human affective states in the interaction will appear cold and socially inept to users.
To approach the effective human–machine interaction, one of the most important
prerequisites is a reliable emotion recognition system that stands tall in some param-
eters like recognition accuracy, robustness against any artifacts, and adaptability to
practical applications. Emotion recognition is performed by classifying emotional
features measured from the implicit emotion channels of human communication,
such as speech, facial expression, gesture, pose, physiological response, etc. The
first kind of approach revolves around the analysis of facial expression or speech
[1–3]. However, these techniques can be made a victim of deception. Peripheral
physiological signals provide the basis for another kind of approach. Changes in
the autonomic nervous system in the periphery, such as electrocardiography (ECG),
skin conductance (SC), respiration, and pulse can help us in detecting changes
in the emotional state [4–6]. This is a more detailed and complex technique than
audiovisual-based techniques.
The second kind of approach focuses on brain signals captured from the central
nervous system (CNS), such as electroencephalography (EEG), where electrodes
are placed at different regions on the skull, which measure how active the part of the
brain is using voltage fluctuation (Figure 10.1), electrocorticography (ECoG), and
functional magnetic resonance imaging (fMRI). Among these, EEG signals have
proven to be more informative about emotional states.
The overall brain–computer interface (BCI) setup is shown in Figure 10.2. BCI
is an interface between the brain and the outside world, without the direct interven-
tion of muscular activities. BCI technology mainly consisted of four sections, such
as signal acquisition, signal processing, feature extraction and classifications, and
an application interface. The brain signal is captured using EEG signal acquisition
technique and then the scalp voltage level is transferred into some physical move-
ments. In paper [7,8], an efficient algorithm of cursor movement has been proposed
to reach the desired target in minimum time. In our work, we have implemented
an emotion classification technique, and we have also implemented a new model of
emotional state-transition machine using a supervised learning approach. With the
help of an emotional state-transition system, we can identify the change of emotional
states at a fixed time interval. According to the two-dimensional model of emotion
described by Davidson et al. [9], emotion is represented as a two-dimensional space
(arousal and valence). There are multiple ways to explain emotion states, such as
(i) visual (images/pictures), (ii) audiovisual (clips/video clips), (iii) audio (songs/
sounds) [10], etc. In paper [2], the author has described emotion detection based
on facial expression and speech. Physiological signal analysis is another popular
approach to emotion recognition. Various studies show that peripheral physiologi-
cal signals like ECG, skin conductive resistance (SCR), and blood volume pres-
sure (BVP) can also change the emotions [11]. Davidson et al. [12] suggested that
frontal brain activity is related to positive and negative emotions. In paper [13], the
author has recorded EEG signals to analyze the oscillatory brain activity while the
subjects were listening to music. Qing et al. [14] have implemented an integrated
emotion recognition system using brian activity, facial expression, and gist to dis-
cover the emotions from the neutral scene. In paper [15], the author has developed
a system that can recognize different emotional states by analyzing the speaker’s
voice/recordings. In our work, we have recognized four different types of emotion
228 Data Science
(positive, negative, depressed, and harmony). After classifying the emotions com-
pletely, we proceed for the transition of different emotional states from the neutral
state at a fixed time interval using state-transition machine. In this chapter, we use
support vector machine (SVM), k-nearest neighbors (KNN) and random forest for
classification. We are trying to recognize the emotional state and predict the next
state with some time interval. To recognize emotions defined by an accepted psy-
chological model, in EEG, by the use of any classification algorithm (KNNs). This
entails an investigation of the problems areas (emotions, EEG, and BCIs), i.e., com-
monly used computational methods in these areas; this should result in a method
and an implementation that accomplish the goal of recognizing different emotions
in EEG. The main motivation of this work is to classify emotions in such a way that
helps in rational decision-making, perception, human interaction, and human intel-
ligence processes. Hence, emotions are fundamental components of being human,
as they motivate action and add meaning and richness to virtually all human experi-
ence. They hold a great potential for people who are paralyzed or otherwise unable
to use their hands or who cannot talk.
This rest of the book chapter is organized as follows. A brief literature review has
been done in Section 10.2. In Section 10.3, we have described our proposed work
with a suitable flow chart diagram, which includes emotion classification and transi-
tion function for the state-transition system. Besides that, we have also analyzed a
state-transition system based on the increased or decreased activity of specific cortex
of the brain in Section 10.3. Then, we have done a detailed performance analysis of
our approach in Section 10.4. We have compared our proposed work with previous
studies related to different parameters of classification in Section 10.4, and finally,
Section 10.5 describes the conclusion of this work.
10.2 LITERATURE REVIEW
The research activities on human emotions have been existing for a long time in
psychology and psychophysiology. But in engineering, research in that field is rela-
tively new. However, many efforts have been made to recognize human emotions
using audiovisual channels of emotion expression like speech, facial expression,
and gestures till now. These audiovisual techniques help in noncontact detection
of emotions, and so the subject is always in comfort. These techniques are however
deception prone. Some of the audiovisual recognition techniques are elaborately
discussed in [1–3]. However, the attention is now gradually moving towards using
physiological measures [4–7] and, most recently, towards EEG signal. Recent theo-
ries on emotion [16,17] explains that physiological activity is an important aspect
of emotion, or rather, we can say that physiological activities help to construct a
certain emotion. Several previous works show the associativity of increased activity
of certain parts of our brain with basic emotions, which can be captured through
an EEG signal. Yazdani et al. [18] proposed using a BCI based on P300 evoked
potential to emotionally tag videos with basic emotions. This book chapter proposes
an effective algorithm to classify EEG signals, explains the activities of different
parts of the brain based on values measured with electrodes, and demonstrates
how the transition of emotion can happen for a human. In Bhardwaj, Gupta, Jain,
Emotion Classification and State-Transition System 229
Rani, and Yadav, 2015, they worked on six different emotional states and used lin-
ear discriminant analysis (LDA) and SVM classification to determine the different
emotional states. Other papers have lower classification accuracy in comparison to
our proposed algorithm, as in the papers (Anh et al., 2012; Jatupaiboon, Panngum,
& Israsena, 2013; Jirayucharoensak et al., 2014; Wijeratne & Perera, 2012; Yoon
& Chung, 2013). In this paper, a framework was proposed to optimize EEG-based
emotion recognition systematically: (i) seeking emotion-specific EEG features and
(ii) exploring the efficacy of the classifiers. They classified four emotional states (joy,
anger, sadness, and pleasure) using SVM and obtained an averaged classification
accuracy of 82.29% ± 3.06% across 26 subjects [19]. In paper [18], for implicit emo-
tional tagging of multimedia content, the author proposed a BCI system based on
P300 evoked potential. Their system performs implicit emotional tagging and naïve
subjects. This system can be used efficiently on who have not participated in the
training. They had given a subjective metric called as “emotional taggability.” The
recognition performance of the system was analyzed, and the degree of ambiguity
was given, which exists in terms of emotional values that are associated with multi-
media content [18]. BCI is able to access brain activity, which leads to understand the
emotional state of any human. One very interesting work of EEG signal-based emo-
tion classification using LDA classifier with “Correlation-Based Subset Selection”
technique was introduced in [20].
In paper [21], they used this information in two manners: (i) The intention of
the user is correctly interpreted in spite of signal deviations induced by the sub-
ject’s emotional state, because the influence of the emotional state on brain activity
patterns can allow the BCI to adapt its recognition algorithms. (ii) The user can
use more natural ways of controlling BCI through affective modulation and can
potentially lead to higher communication throughput by the ability to recognize
emotions.
Inspired by the earlier works, we have used some popular supervised learning
algorithms for classification (SVM, random forest, and KNNs) in our work. This
study has been applied to various machine-learning algorithms to categorize EEG
dynamics according to subject self-reported emotional states while watching videos.
In our work, we have classified emotions into four states (negative emotion, posi-
tive emotion, depression, and harmony). Also, we have proposed a “State-Transition
Algorithm” as well and tried to transit from one emotional state to another.
10.3 PROPOSED WORK
• Analysis of EEG data through signal processing
Here we have used “kaggle NER” dataset [22], where the EEG signal
is captured at a 1-min time gap, and the EEG signal is sampled at 600 Hz.
Then, the EEG signal is preprocessed to check the presence of artifacts
and which artifacts are removed using adaptive filtering. After filtering,
we get useful features from the signal by applying a suitable feature sub-
set selection method [20]. After some useful feature extraction [20], we
proceed to emotion classification using SVM, KNN, random forest, and
emotion transition. We have described the flow chart of the entire proposed
230 Data Science
10.3.1 Classification Processes
10.3.1.1 SVM Classifier
SVM multiclass classification with “one against all” approach uses a set of binary
classifiers that is trained to separate each class from all others. Then, each data object
is classified to a particular class, for which the largest decision value was determined
(Hsu and Lin, 2002) [23,24]. This method trains four SVMs (where four is the number
of classes) and four decision functions. The basic idea is to reduce the multiclass into
a set of binary SVM problems. Briefly, the approach is stated, as mentioned below:
232 Data Science
The training data are represented by {xi, yi}, i = 1, 2, …, N and c1, c2, c3, and c4 are
classes representing positive, negative, depressed, and harmony. For implementation
purpose, we can use widely used SVM library in python [25] called scikit-learn [26].
10.3.1.2 KNN Classifier
KNN is also known as lazy learning classifier. Decision tree and rule-based clas-
sifiers are designed to learn a model that maps the input attributes to the class
labels as soon as the training data becomes available, and thus, they are known
as eager learning classifiers. Unlike eager learning classifier, KNN does not con-
struct a classification model from data, and it performs classification by matching
the test instance with K training examples and decides its class based on the simi-
larity to KNNs. Briefly, the approach is stated, as mentioned below:
d ( x , xi ) = ∑( x − x )
1
i
2
(finding Euclidean distance)
End for
Sort n Euclidean distances in non-decreasing order.
Let k be a positive integer, take the first k distances from
this sorted list.
Find those k-points corresponding to these k-distances.
Let ki denotes the number of points belonging to the ith class
among k points i.e. k ≥ 0.
If ki >kj ∀i ≠ j then put x in class i.
function RandomForest(S, F)
H ← ∅
for i∈ 1, . . ., B do
S (i) ← A bootstrap sample from S
hi ← RandomizedTreeLearn(S (i), F)
H ← H ∪ {hi}
end for
return H
end function
function RandomizedTreeLearn (S, F)
At each node:
f ← very small subset of F
Split on best feature in f
return The learned tree
end function
The algorithm works as follows: for each tree in the forest, we select a bootstrap sample
from S, where S(i) denotes the ith bootstrap. We then learn a decision-tree using a modi-
fied decision-tree learning algorithm. The algorithm is modified as follows: at each node
of the tree, instead of examining all possible feature splits, we randomly select some
subset of features f ⊆ F, where F is the set of features. The node then splits on the best
feature in f rather than F. In practice, f is much smaller than F. Deciding on which feature
to split is oftentimes the most computationally expensive aspect of decision-tree learn-
ing. By narrowing the set of features, we drastically speed up the learning of the tree.
In KNN, the predicted output of the input feature vector is the average output
value of all the “K” neighbors within the feature space. KNN is used to classify the
spectrogram image using brainwave balancing application in BCI. It is also used to
classify normal brain activity through EEG signal. This algorithm was used to clas-
sify chronic mental stress and performs satisfactory accuracy and sensitivity analy-
sis. KNN algorithm is also used to classify the features of left- and right-hand motor
imagery from EEG signals.
SVM uses optimal hyperplane to separate feature vectors between two classes.
Different types of kernel functions are available for feature classification. The
Gaussian kernel function in SVM has been applied in BCIs to classify P300
evoked potentials. SVM has implemented in multiclass channel analyzed EEG
signal. In a similar work, a semisupervised SVM has significantly reduced the
training effort of P300-based BCI speller. Often in data science, we have hun-
dreds or even millions of features, and we want a way to create a model that only
includes the most important features. This has three benefits. First, we make our
model simpler to interpret. Second, we can reduce the variance of the model, and
therefore overfitting. Finally, we can reduce the computational cost (and time)
of training a model. The process of identifying only the most relevant features
is called “feature selection.” Random forests are often used for feature selection
in a data science workflow. The reason is that the tree-based strategies used by
random forests naturally ranks by how well they improve the purity of the node.
This mean decreases in impurity over all trees. Nodes with the greatest decrease
in impurity happen at the start of the trees, while notes with the least decrease
234 Data Science
in impurity occur at the end of trees. Thus, by pruning trees below a particular
node, we can create a subset of most important features.
10.3.2 State-Transition Machine
The machine (M) can be described using six tuples that are mentioned as follows.
where
Q = Finite number of emotional states: {Neutral, Harmony, Depressed,
PositiveEmotion, NegativeEmotion, etc.}
Σ = Finite number of channels: {FP1, FP2, T7, T8, F3, F4, F7, F8, PO7}.
Γold = The old value of the channels.
Γnew = The new value of the channels after a fixed (“T”) time interval.
F = Transition function which is described as follows.
F: (q1* Σ *Γold * Γnew * T) -> q2, where q1, q2 ⊂ Q and Presentstate = q1
and Nextstate = q2.
T = Fixed time interval for changing the states.
In the model of emotional state-transition machine, the “Neutral” is the starting state
of emotion from where human emotion can be changed to different states based on
alpha/beta/gamma/delta wave activation in the specific region of the hemisphere.
The increased or decreased signal value of specific channels within the hemisphere
are captured for a fixed time interval, and then transfer the current state, channel
current value, and new value to the transition function. Then, transition function of
the machine calls the proposed transition function algorithm, which will take the
decision of state change with respect to the given inputs. A model of an emotional
state-transition machine is illustrated in Figure 10.4.
There is no predefined final state of this machine, as the emotion of people can be
changed to any state (positive, negative, depressed, and harmony) from the neutral
state after a fixed time interval and that state will be the final state of the machine at
that particular time instance.
10.4 RESULT ANALYSIS
10.4.1 Requirement
The proposed transition function algorithm and classification are performed on the
following computing platform.
Initially, the signal is captured at a fixed time interval and then fast Fourier transform
(FFT) is applied on that signal to remove the noise from the raw EEG signal. After
filtering and smoothing processes, we classify the samples of the useful channels
related to emotion detection. Once the classification process is over, then we proceed
for implementation of the emotional state-transition machine.
TABLE 10.1
Average Values of Precision, Recall, and f-Score
Algorithm Precision (%) Recall (%) F-Score (%)
SVM 90 92 91
Random forest 78 71 74
KNN 71 70 70
TABLE 10.2
Kernel Accuracy (SVM)
Kernels Accuracy
Linear 62
Polynomial 90
Gaussian radial function 87
Sigmoid 62
how much our classification model gets “confused” during predictions. It gives us
insight not only on the errors being made by a classifier but also, more importantly,
on the types of errors that are being made. Each row of the table represents the
instances in a predicted class versus the instances in an actual class in the column.
The name is taken from the fact that it makes it easy to see whether the system is
confusing two classes (i.e., commonly mislabeling one as another).
Here, we have evaluated the classification accuracy of our proposed work using
the confusion matrix. Sensitivity/recall refers to the true prediction of each class
(emotional state) when it is actually true. Specificity refers false prediction of each
class (emotional state) when it is actually false. Precision means how many of the
truly classified samples are relevant in each class. F-measure computes the accuracy
of the multiclass problem, and it is calculated using precision and recall values.
From Table 10.3 and Figure 10.8, we can conclude that emotional state “Harmony”
(class 4) has the maximum sensitivity, so the classifier shows the maximum true
prediction rate for that class, whereas emotional states “Positive emotion” (class 1),
“Angry” (class 3), and “Harmony” (class 4) have specificity in the same range with
least specificity for state “Negative emotion” (class 2). “Negative emotion” (class 2)
also has the least precision value, indicating the minimum number of correctly clas-
sified samples that belong to a given class. Class 4 has the highest accuracy based on
maximum F-measure, with class 2 giving us the least. We also observe that class 3
also has the least sensitivity and hence produces the least positive outcome when it
is actually positive(True).
TABLE 10.3
Predefined Classwise Statistical Feature Values
Performance Analysis of Each Class Using Statistical Parameters
Parameters (%) Positive Emotion Negative Emotion Angry Harmony
Sensitivity 90 92 86 96
Precision 94 80 92 97
Specificity 96 92 97 98
F-measure 91 86 89 97
Emotion Classification and State-Transition System 239
this model, we illustrate how the emotion of a subject has changed from the neutral
state to a different emotional state after a fixed time interval based on alpha/beta/
gamma/delta activity of EEG signal on different cortices (frontal, frontopolar, tempo-
ral, etc.). The increased or decreased activity on the left or right side of the different
cortices are measured by the proper channel (frontal-F3/F4, frontopolar-FP1/FP2,
temporal-T7/T8) placed over those cortex regions. The emotional state of a subject can
be changed using our proposed transition function algorithm based on the increased
or decreased value of each channel. Initially, from the neutral or starting state, if the
transition algorithm finds suitable parameters, then the machine will move to the next
state; otherwise, the machine will remain in the same state. Different emotional states
and corresponding channel value after a fixed time interval mapping is mentioned in
Table 10.4. The state-transition activity based on the old and new value of each chan-
nel is shown (Figures 10.9–10.13). There is no predefined final state of this machine
as the emotion of people can be changed to any states (positive, negative, depressed,
and harmony) from the neutral state after a fixed time interval and that state will be
the final state of the machine at that particular time instance.
The emotional state-transition machine describes the transition from one emo-
tional state to another using a modified value of different channels in a fixed time
interval. Here we assume that neutral is the initial or starting emotional state of
the machine. Electrodes placed over the frontal section of the brain are responsible
for the transition from a neutral state to depressed emotion and negative to positive
emotion. Electrodes used in the temporal section are useful for the transition from
neutral to positive emotion. The machine moves from one state to another using
the proposed transition function algorithm. The transition function algorithm was
described in the “proposed work” section. For implementing the state-transition pro-
cess, we have noticed the initial and final signal values of different channels for
each subject, which is described in the state-channel table (Table 10.4). The table
contains different emotional states and the initial, new value of all the required chan-
nels at a specific instance of time for all combination of transition possible on our
state-transition diagram. After some fixed time interval (1-min), we check the value
of those required channels with respect to alpha/beta/gamma activity of EEG and
240 Data Science
TABLE 10.4
State-Channel Table
Old Value New Value
Time (min): Time (min):
Sl. No. Current State Electrode 0.005 1.005 Next State
1 Neutral T7 −861 −706 Positive
2 Neutral FP1 −840 −657 Negative
3 Neutral F3 −628 −851 Depressed
4 Neutral PO7 193 202 Harmony
5 Negative F3 147 319 Positive
6 Positive F4 246 412 Negative
7 Neutral F7 298 197 Depressed
8 Neutral PO7 202 193 Harmony
9 Neutral T8 378 653 Positive
10 Neutral FP2 103 263 Negative
11 Negative F7 −177 −113 Positive
12 Positive F8 600 717 Negative
13 Positive (F4, F8) {511, 471} {356, 128} Positive
14 Negative (F3, F7) {319, 424} {105, 295} Negative
15 Neutral T7 175 102 Neutral
FIGURE 10.9 State transition from neutral (starting) state to positive (final) state with cor-
rect inputs.
FIGURE 10.10 State transition from neutral (starting) state to negative (final) state with
correct inputs.
Emotion Classification and State-Transition System 241
FIGURE 10.11 State transition from neutral (starting) state to depressed (final) state with
correct inputs.
FIGURE 10.12 State transition from neutral (starting) state to harmony (final) state with
correct inputs.
FIGURE 10.13 State transition from neutral (starting) state to the same state with wrong
inputs.
execute our transition function algorithm. The next state form of the finite machine
is verified using the prediction of the classification algorithm. We describe the state
channels table using both initial and final values of the machine in Table 10.4 and
illustrate the process of state transition in Figures 10.9–10.13. The state-transition
machine (M) is described using six tuples mentioned in the proposed work section.
242 Data Science
In Figure 10.13, we have shown how the different emotional states are classified
based on different electrode values discussed in Table 10.4. From Table 10.4, we can
conclude that increased value of channel (PO7) placed over parietooccipital area
leads to harmony emotions from the neutral state. Harmony emotions have a high
correlation in alpha band power over the parietooccipital area [27]. A depressed
subject has less activation in the left frontal area than the normal subject [28]. An
increase in the value of the frontopolar area leads to negative emotions (anger, fear,
sad, etc.). The EEG activity in the frontal and frontopolar cortex is strongly cor-
related to adjust emotion regulation [29]. Hence, negative emotions and depressed
states are highly correlated. An increase in the value of channels placed over a tem-
poral area leads to positive emotions. In paper [30], the author describes that stable
patterns of positive emotions exhibited due to an increased value of lateral temporal
area for beta and gamma bands and stable pattern of higher gamma response in
frontopolar cortex leads to negative emotions. The state-transition diagram has been
drawn using “JFLAP” software [31]. From Figures 10.9, 10.10, and 10.12, we have
observed that the subject’s emotion has changed from neutral to positive, negative
and harmony states due to the increased value of specific channels, but the emotion
of the subject has changed to a depressed state due to the decreased value of chan-
nel, as shown in Figure 10.11. All the inputs are taken from Table 10.4. Here we
assume that, initially, all the subjects are in “Neutral” state, so we consider “Neutral”
state as a starting state in our machine. Though we have captured the brain signal
through various channels, the machine moves to only one destination state at a fixed
time interval after taking the user inputs, so we only consider one specific channel
value (T7/FP1/F3/PO7, etc.) as input at a fixed time interval. For wrong inputs (T7.
oldvalue and T7.newvalue), machine does not move to the final emotional state (posi-
tive), but it only loops within the same starting state (Figure 10.1). In Figure 10.14, we
have plotted the state transition based on the changed value of different electrodes,
and increased channel value of FP1, T8, and PO7 indicate negative, positive, and
harmony emotions, respectively, whereas decreased channel value of F3 indicates a
depressed state. Neutral state (starting state in emotional state-transition machine) of
the machine shows no change in the channel’s value. Figure 10.15 (3D plot) describes
FIGURE 10.15 3D plotting of emotions and corresponding electrodes with the amplitude
value of those electrodes.
the different emotional states based on the corresponding electrodes with their
amplitude/channel value.
FIGURE 10.16 Accuracy comparison of the proposed approach with some popular
approaches.
244 Data Science
TABLE 10.5
Comparison of Proposed Model with Other Popular Models
Property of Study
Number Sampling
of Frequency Number of
Study Electrodes (Hz) Classifier Emotions Stimuli Accuracy
Proposed 13 512 SVM 4 Video 90
Wang et al., 2014 128 200 SVM 3 Music, 90
video
Bastos-Filho 32 256 kNN 2 Video 69.5
et al., 2012
D. Chakladar, 4 512 LDA 4 Video 82
S. Chakrobarty
Chanel et al., 19 256 LDA 3 Video, 63
2011 games
four emotions than the model referred by [28], but the number of emotions recog-
nized by our model is four, which is greater thanthat of [28]. So, considering all the
parameters, we can conclude that our proposed classification work stands apart from
the other models of emotion detection.
10.4.6 Computational Complexity
In case of classification, training of dataset using SVM will execute with time com-
plexity, which depends on both training set and support vector, and can be bounded
by function n3 approximately, hence, O(n3). For state-transition process, the algo-
rithm will run with linear time complexity O(n), for n input size, each input takes a
constant time for execution.
10.5 CONCLUSION
In this chapter, we have done a comprehensive survey of various classification
methods used in BCI research as well as investigated different emotional states of
humans. SVM, KNN, and random forest are the techniques that we have been spe-
cifically interested in and worked on. The classification accuracy has been computed
using the confusion matrix and different statistical parameters (sensitivity, specific-
ity, prevalence, and precision). After comparing with other well-known models, we
can conclude that our proposed model outperforms others. The classification pro-
cess for a large number of nonlinear data using different SVM kernels and figures
shows that the efficient kernel for emotion classification is the polynomial kernel.
After achieving 90% accuracy in classification, we use the state-transition model
for different emotional states. For performance analysis of the classifier, we have
used a confusion matrix, and based on different parameters (sensitivity, specificity,
prevalence, precision), we measure the accuracy and prediction of the classification
Emotion Classification and State-Transition System 245
process. Both cases of performance analysis show quite optimistic results. The state-
transition process will be useful for emotional state prediction method in the near
future. We also try to implement the state-transition machine for paralyzed people in
the near future, and this will boost the performance of BCI system.
ACKNOWLEDGMENT
No research funding has been received for this work.
REFERENCES
1. Petrushin, V. (1999, November). Emotion in speech: Recognition and application to call
centers., In Engr, St. Louis, MO, pp: 7–10, 1999.
2. Black, M. J., & Yacoob, Y. (1997). Recognizing facial expressions in image sequences
using local parameterized models of image motion. International Journal of Computer
Vision, 25(1), 23–48.
3. Anderson, K., & McOwan, P. W. (2006). A real-time automated system for the rec-
ognition of human facial expressions. IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 36(1), 96–105.
4. Wagner, J., Kim, J., & André, E. (2005, July). From physiological signals to emotions:
Implementing and comparing selected methods for feature extraction and classifica-
tion. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on
(pp. 940–943). IEEE., Amsterdam, Netherlands
5. Kim, K. H., Bang, S. W., & Kim, S. R. (2004). Emotion recognition system using short-
term monitoring of physiological signals. Medical and Biological Engineering and
Computing, 42(3), 419–427.
6. Brosschot, J. F., & Thayer, J. F. (2003). Heart rate response is longer after negative emo-
tions than after positive emotions. International Journal of Psychophysiology, 50(3),
181–187.
7. Memon, Q., & Mustafa, A. (2015). Exploring mobile health in a private online social
network. International Journal of Electronic Healthcare, 8(1), 51–75.
8. Murugappan, M., Ramachandran, N., & Sazali, Y. (2010). Classification of human
emotion from EEG using discrete wavelet transform. Journal of Biomedical Science
and Engineering, 3(04), 390.
9. Chakladar, D. D., & Chakraborty, S. (2017, March). Study and analysis of a fast mov-
ing cursor control in a multithreaded way in brain computer interface. In International
Conference on Computational Intelligence, Communications, and Business Analytics
(pp. 44–56). Springer, Singapore.
10. Davidson, R. (1979). Frontal versus perietal EEG asymmetry during positive and nega-
tive affect. Psychophysiology, 16(2), 202–203.
11. Chakladar, D. D., & Chakraborty, S. (2018). Multi-target way of cursor movement in
brain computer interface using unsupervised learning. Biologically Inspired Cognitive
Architectures, Elsevier, 25, 88–100.
12. Picard, R. W. (2000). Toward computers that recognize and respond to user emotion.
IBM Systems Journal, 39(3.4), 705–719.
13. Davidson, R. J., & Fox, N. A. (1982). Asymmetrical brain activity discriminates
between positive and negative affective stimuli in human infants. Science, 218(4578),
1235–1237.
14. Baumgartner, T., Esslen, M., & Jancke, L. (2006). From emotion perception to emotion
experience: Emotions evoked by pictures and classical music. International Journal of
Psychophysiology, 60(1), 34–43.
246 Data Science
15. Zhang, Q., & Lee, M. (2010). A hierarchical positive and negative emotion understand-
ing system based on integrated analysis of visual and brain signals. Neurocomputing,
73(16), 3264–3272.
16. Cornelius, R. R. (1996). The Science of Emotion: Research and Tradition in the
Psychology of Emotions. Prentice-Hall, Inc.
17. Sander, D., Grandjean, D., & Scherer, K. R. (2005). A systems approach to appraisal
mechanisms in emotion. Neural Networks, 18(4), pp. 317–352.
18. Yazdani, A., Lee, J. S., & Ebrahimi, T. (2009, October). Implicit emotional tagging
of multimedia using EEG signals and brain computer interface. In Proceedings of the
First SIGMM Workshop on Social Media (pp. 81–88). ACM., Beijing, China
19. Lin, Y. P., Wang, C. H., Jung, T. P., Wu, T. L., Jeng, S. K., Duann, J. R., & Chen, J. H.
(2010). EEG-based emotion recognition in music listening. IEEE Transactions on
Biomedical Engineering, 57(7), 1798–1806.
20. Chakladar, D. D., & Chakraborty, S. (2018). EEG based emotion classification using
correlation based subset selection. Biologically Inspired Cognitive Architectures,
Elsevier, 24, 98–106.
21. Molina, G. G., Tsoneva, T., & Nijholt, A. (2009, September). Emotional brain-
computer interfaces. In 2009 3rd International Conference on Affective Computing
and Intelligent Interaction and Workshops (pp. 1–9). IEEE., Amsterdam, Netherlands.
22. Kaggle dataset on eeg based emotion detection. www.kaggle.com/c/inria-bci-challenge.
23. Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2003). A Practical Guide to Support Vector
Classification. Dept. of Computer Sci. National Taiwan Uni, Taipei, 106, Taiwan.
24. Wang, Z., & Xue, X. (2014). Multi-class support vector machine. In Support Vector
Machines Applications, Editors: Yunqian Ma, Guodong Guo, (pp. 23–48). Springer,
Cham.
25. Python Software Foundation. Python Language Reference, version 2.7.
26. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., & Layton,
R. (2013). API design for machine learning software: experiences from the scikit-learn
project. arXiv preprint arXiv:1309.0238.
27. Hu, X., Yu, J., Song, M., Yu, C., Wang, F., Sun, P., & Zhang, D. (2017). EEG correlates
of ten positive emotions. Frontiers in Human Neuroscience, 11, 26.
28. Henriques, J. B., & Davidson, R. J. (1991). Left frontal hypoactivation in depression.
Journal of Abnormal Psychology, 100(4), 535.
29. Dennis, T. A., & Solomon, B. (2010). Frontal EEG and emotion regulation: Electrocortical
activity in response to emotional film clips is associated with reduced mood induction
and attention interference effects. Biological Psychology, 85(3),456–464.
30. Zheng, W. L., Zhu, J. Y., & Lu, B. L. (2017). Identifying stable patterns over time for
emotion recognition from EEG. Preprint. IEEE Transactions on Affective Computing.
31. www.jflap.org/.
Part III
Applications and New Trends
in Data Science
11 Comparison of Gradient
and Textural Features
for Writer Retrieval in
Handwritten Documents
CONTENTS
11.1 I ntroduction................................................................................................... 249
11.2 Literature Review.......................................................................................... 250
11.3 Adopted Features........................................................................................... 251
11.3.1 Local Binary Pattern ........................................................................ 252
11.3.2 Histogram of Oriented Gradients...................................................... 253
11.3.3 Gradient Local Binary Pattern ......................................................... 253
11.3.4 Pixel Density...................................................................................... 255
11.3.5 Run Length Feature........................................................................... 255
11.4 Matching Step................................................................................................ 256
11.5 Experimental Evaluation............................................................................... 256
11.5.1 Evaluation Criteria............................................................................. 257
11.5.2 Experimental Setup........................................................................... 258
11.5.3 Retrieval Results................................................................................ 261
11.6 Discussion and Comparison.......................................................................... 262
11.7 C onclusion..................................................................................................... 263
References...............................................................................................................264
11.1 INTRODUCTION
In the past years, numerous handwriting analysis systems were developed for real-
life purposes, such as identity verification and bank check reading. Such systems
are commonly based on the recognition of handwritten signatures, words, or digits.
More recently, new applications have emerged in this field. For instance, the require-
ment of navigation and indexing tools dealing with the huge amount of digitized
249
250 Data Science
handwritten documents promoted the use of word spotting and Writer Retrieval.
Introduced by Atanasiu et al. in 2011 [1], the Writer Retrieval aims to find all docu-
ments belonging to the same writer, despite of being written at different moments,
in different languages, and so, having different textual contents. The retrieval is
achieved without the need of any prior knowledge about the writer’s identity. In
fact, the system tries to select all documents sharing the same writing traits with the
query. Conventionally, Writer Retrieval systems are composed of feature generation
and matching steps (see Figure 11.1).
To generate features, all descriptors employed for handwriting recognition, such
as gradient features, textural features, and topological features can be used [1]. The
matching step is performed through simple dissimilarity measures. Thereby, to
improve retrieval scores, researchers focused on developing robust features.
Presently, we investigate the performance of different kinds of feature generation
schemes for solving Writer Retrieval. Precisely, local binary patterns (LBP) and the
rotation invariance LBP are used for texture characterization. As gradient feature,
we propose the use of histogram of oriented gradients (HOG) and gradient LBPs
(GLBP). Finally, pixel density and run length feature (RLF) are used as topological
features. To achieve the retrieval stage, various similarity and dissimilarity measures
are used. Experiments are conducted on the two versions of ICDAR (International
Conference on Document Analysis and Recognition)-2011 writer identification
dataset.
The remaining chapter is organized as follows: Section 11.2 reviews the state of the
art. Section 11.3 describes the methods adopted for feature generation, while Section
11.4 presents the similarity measures employed in the retrieval step. Experiments
and discussion are reported in Sections 11.5 and 11.6, respectively. The last section
gives the conclusion of this work.
11.2 LITERATURE REVIEW
Writer Retrieval is a new topic in the handwriting recognition field introduced in
2011 [1]. For this reason, there are few research works dealing with this topic that
is commonly overlapped with writer identification. Since the retrieval step is car-
ried out through distance-based matching, the performance is controlled using the
feature generation step. Therefore, the state of the art reveals the research efforts
to find robust features for characterizing handwritten documents. Roughly, various
kinds of features such as statistical, topological, as well as trainable features have
been employed for Writer Retrieval. In [1], Atanasiu et al. employed the local ori-
entation of handwriting contours to generate discriminative handwriting features.
The retrieval step was based on Euclidean distance. Experiments were conducted
using samples of 301 writers selected from the Institute of Applied Mathematics
(IAM) dataset [2]. Each writer is represented by five samples. Results reveal that
the principal mode of the orientation pdf provides the best performance. On the
other hand, Shirdhonkar et al. [3] proposed the use of contourlet transform associ-
ated with Canberra distance. Experiments were carried out on a private dataset of
15 writers represented by 16 handwritten texts. The contourlet was compared with
a curvelet transform. Using soft retrieval ranking, both methods derived precisions
lower than 60%, which means that precisions will be substantially worse when using
a hard retrieval score. This outcome reveals that global transforms do not p rovide
a good characterization. Therefore, Fiel and Sablatnig [4] introduced the use of
scale-invariant feature transform (SIFT) as local document features. The Chi-square
distance was adopted for the matching step. The retrieval accuracy obtained for the
IAM dataset reached 97.2% by considering the percentage of correct ranking among
the two TOP similar documents (TOP-2). In [5], retrieval tests were performed on
the Computer and Vision Laboratory (CVL) dataset. Several features, such as con-
tour orientation, edge orientation, run length, and local microstructure features, were
compared using Euclidian distance. The best retrieval score that is about 96.8% at
TOP-2 is obtained by using microstructure features. Thereafter, this score has been
improved to 97.1% by using local SIFT features, for which visual words were repre-
sented using Gaussian mixture models [6]. However, the results obtained were not
replicable for other datasets. For instance, a substantially lower performance was
obtained when applying the same system on ICDAR-2011 dataset, which contains
images of two-line texts. For this dataset, the retrieval score did not exceed 87.7%.
Furthermore, in [7], authors employed a histogram of templates (HOT) for feature
generation associated with reinforced retrieval step that uses support vector machine
(SVM). The result obtained on CVL dataset is about 70% at TOP-2. After that,
the same research team replicated the system using gradient features. Experiments
conducted on ICDAR-2011 dataset shows a high accuracy [8]. Finally, in [9], Fiel
and Sablatnig introduced the use of convolutional neural networks (CNN) as feature
generators, where CVL score was improved to 98.3% at TOP-2.
The common observation on all these research works evince that the retrieval
performance is mainly related to the feature’s robustness, since the retrieval step
is carried out through distance measures. Moreover, experiments reveal that
there is no feature that can give an optimal performance whatever the dataset.
In this respect, this chapter addresses the comparison of different descriptors
and analyses their behavior with different similarity measures. Precisely, new
features such as LBPs and gradient features are proposed and evaluated with
respect to classical features.
11.3 ADOPTED FEATURES
To develop the Writer Retrieval system, a set of textural, gradient, and topological
features are investigated. These features are presented in the following sections.
252 Data Science
LBP( P , R ) ( x , y) = ∑(s ( g
p= 0
p )
− gc ) × 2 p (11.1)
With
1 l≥0
s(l ) = (11.2)
0 l < 0
The LBP code is obtained by multiplying the thresholded values with weights given
by the corresponding pixels and summing up the result, which replaces the central
pixel value, as shown in Figure 11.2.
A major drawback of LBP is related to its variability towards small rotation or
illumination changes. To get rotation invariance within LBP, the rotation invariant
uniform LBP LBPriu was introduced in [11,12]. It is calculated according to the fol-
lowing equation:
P
LBP ( x , y) =
riu ∑s(g
p= 0
p − gc ) , U ( P) > 2
(11.3)
P + 1, otherwise
P −1
U ( P) = s ( g p −1 − gc ) − s ( g0 − gc ) + ∑ s(g
p =1
p − gc ) − s ( g p −1 − gc ) (11.4)
With g0 = g p
Gx ( x , y) = I ( x + 1, y) − I ( x − 1, y) (11.5)
G y ( x , y) = I ( x , y + 1) − I ( x , y − 1) (11.6)
2. Calculate the gradient magnitude and phase by using Equations (11.7) and
(11.8), respectively:
G ( x , y) = Gx 2( x , y) + G y 2( x , y) (11.7)
G ( x , y)
θ ( x , y) = tan −1 y (11.8)
Gx ( x , y)
The size of the GLBP matrix is defined by all possible angle and width values.
Specifically, there are eight possible Freeman directions for angle values, while the
number of “1” in uniform patterns can go from 1 to 7 (see Figure 11.4). This yields
a 7 × 8 GLBP matrix, in which gradient features are accumulated. Finally, the L2
normalization is applied to scale features in the range [0, 1].
FIGURE 11.4 Summary of the GLBP calculation for a central pixel from [19].
Writer Retrieval in Handwritten Documents 255
11.3.4 Pixel Density
Density features are calculated by applying a uniform grid over document images
(see Figure 11.5). This consists of dividing the document images into a predefined
number of cells having the same size as much as possible. For each cell, the pixel
density corresponds to the ratio between the number of text pixels and the cell size.
In this respect, each cell is substituted by the density value. Then, the size of the
density feature vector is equal to the number of cells.
11.4 MATCHING STEP
The matching step consists in evaluating the similarity between the feature vector
of the query document with feature vectors of all documents of a given database.
In this stage, dissimilarity or similarity measures can be used. To calculate retrieval
accuracies, similarities obtained for all referential documents are sorted from the
largest to the smallest value, and in case of dissimilarities, from the smallest to the
largest value. Table 11.1 reports the similarity and dissimilarity measures adopted
in this work
The matching process is carried out by calculating the dissimilarity between
the query and all documents available in the reference dataset. Then, documents
expressing highest similarities with the query are supposed to belong to the query
writer.
11.5 EXPERIMENTAL EVALUATION
Experiments were conducted on a public dataset, which was proposed for writer
identification contest organized in International Conference on Document
Analysis and Recognition (ICDR 2011). The original version of this dataset
consists of 208 documents written by 26 writers. Each writer contributed eight
documents containing text in four languages (English, French, German, and
Greek).1 A cropped corpus has been extracted by considering the first two lines
1 www.cvc.uab.es/icdar2011competition.
Writer Retrieval in Handwritten Documents 257
TABLE 11.1
Adopted Similarity and Dissimilarity Measures
Dissimilarity/Similarity Measure Equation
1
n
2
Euclidean distance DEuclidean (a, b) =
∑ i =1
ai − bi
2
n
Manhattan distance DManhattan (a, b) = ∑a −bi =1
i i
n
ai − bi
Canberra distance DCanberra (a, b) = ∑ a +b
i =1
i i
∑(
ai − bi )
n 2
Chi-square distance Dχ 2 ( a , b ) =
i =1
bi
∑ (a * b ) i i
∑ a + ∑ b − ∑ ( a * b )
i =1
i
i =1
i
i =1
i i
a and b are two feature vectors, and n is the size of these vectors.
of each document [23]. Figure 11.7 shows the samples from the original and
cropped corpuses.
11.5.1 Evaluation Criteria
Two typical performance measures for information retrieval systems are used. The
first is the TOP-N precision, which is the most commonly used criterion [3,5,6].
It corresponds to the percentage of correct documents among the N most similar
documents to the query. Since each writer is represented by eight documents, we
consider the retrieval from TOP-2 to TOP-7.
The second criterion is the mean average precision (MAP). It allows a global
evaluation of the Writer Retrieval system, since it considers the mean of the average
precision (AP), which is calculated for all query documents. Considering only the
ranks where documents are relevant, the AP is calculated as follows:
∑[P@k ] × rel(k )
AP = k =1
(11.10)
M
258 Data Science
With
P: Precision at rank k.
M: The number of relevant documents in the dataset (seven for ICDAR-2011).
11.5.2 Experimental Setup
Some of the adopted features such as LBP and HOG require an experimental tuning
of their setup parameters. Therefore, a set of experiments is performed for all pos-
sible configurations to find the best retrieval performance. For this step, we consider
the original ICDAR. In the case of LBP, experiments were conducted over the whole
images by varying the number of considered neighbors P as well as the radius of the
neighborhood R. Results obtained are shown in Figure 11.8, where the LBP16×2 gives
the best MAP, which is about 78.34%.
Pixel density and HOG are locally calculated by applying a uniform grid over the
document images. Consequently, the suitable number of cells needs an experimental
Writer Retrieval in Handwritten Documents 259
FIGURE 11.9 Variations of MAP according to the number of cells used to calculate HOG.
(The best score appears in light gray in pdf version.)
selection. In this respect, we varied this parameter from 4 to 36 cells for HOG
and from 4 to 100 for pixel density. Figures 11.9 and 11.10 depict the MAP scores.
Roughly, the MAP fluctuates in the range of 50%–70% for HOG features, while it
varies from 27% to 43% when using pixel density. This weak performance is due
to the fact that we substitute each cell by a single value that is pixel density. On the
contrary, HOG provides 81 features for each cell. Thereby, its best MAP, which is
about 62.67%, is obtained for 3 × 6 cells. Pixel density exhibits a lower score with
41.87% for 8 × 6 cells.
In a second experiment, features were tested with all dissimilarity and similarity
measures to select the suitable measure for each descriptor. Table 11.2 presents the
260 Data Science
FIGURE 11.10 Variation of MAP according to the number of cells used to calculate pixel
density. (The best score appears in light gray.)
MAP scores obtained for original ICDAR-2011. It is clear that there is no similarity
measure that gives the best performance for all features. We note that Manhattan
distance is the most suitable for HOG, pixel density, and RLF. On the other hand,
Canberra distance fits with LBPriu and GLBP features, while the Chi-square dis-
tance gives the best performance for LBP. Moreover, one can see that dissimilarity
measures outperform similarities, which seem less effective for Writer Retrieval.
So, in the retrieval experiments, each feature is associated with the most suitable
dissimilarity.
Overall, the best configuration for LBP descriptors is (16 × 2) associated with
the Chi-square distance for LBP and Canberra distance for LBPriu. HOG and pixel
density features reach their best performance using uniform grids with (3 × 6) and
(8 × 6) cells, respectively. On the other hand, Manhattan distance gives the best
performance for pixel density, HOG, and RLF. Finally, GLBP is more efficient
TABLE 11.2
MAP (%) Obtained Using different Similarity and Dissimilarity Measures
Descriptor Manhattan Euclidean Canberra Chi-square Cosine Jaccard
LBP16x2 54.74 27.97 63.10 78.34 60.96 6.34
LBP16riu× 2 27.17 22.51 50.43 42.91 33.63 6.34
HOG 62.67 53.45 54.33 49.11 53.29 50.99
GLBP 49.25 46.77 79.01 58.55 61.55 9.62
Pixel density 41.87 41.38 41.63 41.33 31.89 7.92
RLF 92.22 85.15 87.84 90.88 84.87 50.48
Writer Retrieval in Handwritten Documents 261
when it is associated with Canberra distance. Once each feature has been asso-
ciated to its suitable dissimilarity measure, we performed the Writer Retrieval
experiments.
11.5.3 Retrieval Results
Writer Retrieval test is carried out by considering each document from the d ataset
as a query. Specifically, a vector containing dissimilarities between the query and all
existing documents in the dataset is calculated and ranked to select the most similar
documents. Tables 11.3 and 11.4 jointly report the retrieval results obtained at different
ranks and MAP scores using all descriptors with their suitable dissimilarities. For
both datasets, RLF-based system outperforms all other systems. In fact, the precision
reaches 99.03% at TOP-2 for the o riginal ICDAR-2011 and 96.15% for the cropped
version. The MAP scores are about 92.22% and 83.16%, respectively. On the other
hand, GLBP and LBP16×2 show competitive performance. However, the accuracy
drops significantly for farther ranks. For instance, it falls from 96.39% at TOP-2
to 80.15% at TOP-7 when using LBP16×2, and from 97.59% at TOP-2 to 79.87% at
TOP-7 for GLBP features.
TABLE 11.3
Retrieval Results (%) for Original ICDAR-2011
TOP-2 TOP-3 TOP-4 TOP-5 TOP-6 TOP-7 MAP
LBP16×2 + Chi-square 96.39 93.31 90.62 87.78 84.37 80.15 78,34
LBP16riu× 2 + Canberra 88.70 78.36 70.31 65.48 61.21 56.79 50,43
HOG + Manhattan 88.70 81.50 77.16 72.30 69.15 65.65 62.67
GLBP + Canberra 97.59 95.51 92.30 88.36 83.73 79.87 79,01
Pixel density + Manhattan 75.00 65.54 59.02 54.51 50.32 46.91 41,87
RLF + Manhattan 99.03 98.55 97.43 96.27 94.61 92.14 92,22
TABLE 11.4
Retrieval Results (%) for Cropped ICDAR-2011
TOP-2 TOP-3 TOP-4 TOP-5 TOP-6 TOP-7 MAP
LBP16×2 + Chi-square 76.68 65.54 58.53 53.55 50.00 47.18 40.93
LBP16riu× 2 + Canberra 64.90 51.12 43.99 37.98 33.01 30.35 21.71
HOG + Manhattan 73.07 60.89 52.76 47.78 43.58 40.93 32.83
GLBP + Canberra 89.18 82.37 78.24 73.84 70.59 66.27 62.89
Pixel density + Manhattan 69.47 57.37 49.87 45.19 41.26 37.98 28.50
RLF + Manhattan 96.15 93.50 92.30 90.14 86.92 82.45 83.16
262 Data Science
TABLE 11.5
Comparison with the State of the Art for Original ICDAR-2011
References Feature Matching TOP-2 TOP-7
[6] SIFT and GMM Cosine similarity 99.3% 91.2%
[8] HOG Cosine Similarity 86.5% 53.3%
GLBP Cosine Similarity 88.5% 65.9%
Proposed methods LBP16×2 Chi-square distance 96.4% 80.2%
LBP16riu× 2 Canberra distance 88.7% 56.9%
HOG Manhattan distance 88.7% 65.7%
GLBP Canberra distance 97.6% 79.9%
Pixel density Manhattan distance 75.0% 46.9%
RLF Manhattan distance 99.0% 92.1%
TABLE 11.6
Comparison with State of the Art for Cropped ICDAR-2011
References Feature Matching TOP-2 TOP-7
[6] SIFT and GMM Cosine similarity 87.0% 69.2%
[8] HOG Cosine similarity 69.2% 41.8%
GLBP Cosine similarity 80.8% 48.9%
Proposed methods LBP16x2 Cosine similarity 76.7% 47.2%
LBP16riu× 2 Chi-square distance 64.9% 30.4%
HOG Canberra distance 73.1% 40.9%
GLBP Manhattan distance 89.2% 66.3%
Pixel density Canberra distance 69.5% 37.9%
RLF Manhattan distance 96.2% 82.5%
Writer Retrieval in Handwritten Documents 263
• Topological features are the most adequate for Writer Retrieval in hand-
written documents. RLF descriptor shows the best accuracy for the two
datasets. Actually, even with two lines of text, the system can achieve a
high precision with small loss at farther ranks. In fact, when testing cropped
ICDAR-2011, the MAP reaches 83.16%, and the loss of accuracy from
TOP-2 to TOP-7 is about 13.7%, which is small compared with all other
systems.
• The pixel density descriptor provides insufficient information, since it
replaces each cell by a single density measure. Such characterization can-
not reflect the writing style of each individual, the reason for which retrieval
scores were substantially weak. Compared with RLF, there is a loss of
accuracy of about 24% at TOP-2 and 45% at TOP-7.
• Furthermore, classical LBP outperforms LBP riu with a gain of 7.69% and
12% at TOP-2, with original and cropped sets, respectively. However, this
descriptor seams to need more than two lines of text, since its perfor-
mance drops significantly with the cropped dataset. Actually, the MAP
loses 37.41%.
• Gradient descriptors show competitive performance, especially GLBP, for
which the precision reaches 97.6% for the original set and 89.2% for the
cropped set. HOG feature requires a local calculation using uniform grid
to enhance retrieval accuracies. However, it scores still lower than those
collected using GLBP, RLF, and LBP16×2.
11.7 CONCLUSION
Currently, libraries contain a huge number of digitized documents, including
historical manuscripts and other archives. Since the manual handling of these data
is not trivial, researchers tried to develop automatic navigation tools such as s potting
and retrieval techniques. In this respect, Writer Retrieval systems are developed to
find all documents belonging to the same person. Specifically, such systems should
recognize the handwriting of the same individual, whatever the language and
textual content. In return, they should differentiate handwritings of different indi-
viduals. This can be done by using features that reduce the intrawriter variability
while increasing the interwriter variability. To find suitable features for the Writer
Retrieval task, this work evaluates various kinds of features. Specifically, we used
LBPs and its uniform versions LBPriu, HOGs, and GLBP as gradient descriptors.
Besides, two topological features such as RLF and pixel density are used. To make a
straightforward evaluation, several similarity and dissimilarity measures, including
Manhattan, Euclidean, Canberra, Chi-square, Jaccard distance, and Cosine similar-
ity are employed in experiments. Experiments conducted on the two ICDAR-2011
datasets showed that RLF descriptor outperforms all other descriptors as well as the
state of the art. These findings reveal that pixel run length provides a pertinent char-
acterization of the writing style. On the other hand, textural and gradient features,
such as LBP16×2, GLBP, and HOG provide a satisfactory retrieval, especially when
associated with a suitable dissimilarity measure.
264 Data Science
REFERENCES
1. V. Atanasiu, L. Likforman-sulem, N. Vincent, Writer retrieval exploration of a novel
biometric scenario using perceptual features derived from script orientation, in: 11th
International Conference on Document Analysis and Recognition, 18–21 September,
Beijing, 2011, pp. 628–632.
2. C. Djeddi, I. Siddiqi, L. Souici-Meslati, A. Ennaji, Text-independent writer recognition
using multi-script handwritten texts, Pattern Recognition Letters 34 (2013) 1196–1202.
3. M.S. Shirdhonkar, M.B. Kokare, Writer based handwritten document image retrieval
using contourlet transform, Nagamalai D., Renault E., Dhanuskodi M. (eds.) Advances
in Digital Image Processing and Information Technology. Communications in
Computer and Information Science 205 (2011) 108–117.
4. S. Fiel, R. Sablatnig, Writer retrieval and writer identification using local features, in:
10th IAPR International Workshop on Document Analysis Systems, 27–29 March,
Queensland, 2012, pp. 145–149.
5. F. Kleber, S. Fiel, M. Diem, R. Sablatnig, CVL-database: An off-line database for
writer retrieval, writer identification and word spotting, in: International Conference
on Document Analysis and Recognition, 28–21 September, Beijing, 2013, pp. 560–564.
6. S. Fiel, R. Sablatnig, Writer identification and writer retrieval using the fisher vector
on visual vocabularies, in: International Conference on Document Analysis and
Recognition, 25–28 August, Washington, 2013, pp. 545–549.
7. M.L. Bouibed, H. Nemmour, Y. Chibani, Writer retrieval using histogram of templates
features and SVM, in: 3rd International Conference on Electrical Engineering and
Control Applications, 25–26 November, Constantine, 2017, pp. 537–544.
8. M.L. Bouibed, H. Nemmour, Y. Chibani, Evaluation of gradient descriptors and
dissimilarity learning for writer retrieval, in: 8th International Conference on
Information Science and Technology, 2–6 June, Cordoba, 2018, pp. 252–256.
9. S. Fiel, R. Sablatnig, Writer identification and retrieval using a convolutional neural
network, in: International Conference on Computer Analysis of Images and Patterns,
2–4 September, Valetta, 2015, pp. 26–37.
10. T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of texture measures with
classification based on featured distributions, Pattern Recognition 29 (1996) 51–59.
11. J.F. Vargas, M.A. Ferrer, C.M. Travieso, J.B. Alonso, Off-line signature verification
based on grey level information using texture features, Pattern Recognition 44 (2)
(2011) 375–385.
12. M. Pietikäinen, A. Hadid, G. Zhao, T. Ahonen. Computer Vision Using Local Binary
Patterns. Springer-Verlag, London, 2011.
13. N. Dalal, B. Triggs, Finding people in images and videos, PhD thesis, French National
Institute for Research in Computer Science and Control (INRIA), July 2006.
14. M.B. Yilmaz, B. Yanikoglu, C. Tirkaz, A. Kholmatov, Offline signature verification
using classifier combination of HOG and LBP features, In: International Joint
Conference on Biometrics, 11–13 October, Washington DC, 2011, pp. 1–7.
15. N. Bouadjenek, H. Nemmour, Y. Chibani, Histogram of Oriented Gradients for writer’s
gender, handedness and age prediction, in: International Symposium on Innovations in
Intelligent Systems and Applications, 2–4 September, Madrid, 2015, pp. 1–5.
16. M.L. Bouibed, H. Nemmour, Y. Chibani, New gradient descriptors for keyword
spotting in handwritten documents, in: 3rd International Conference on Advanced
Technologies for Signal and Image Processing, 22–24 May, Fez, 2017, pp. 1–5.
17. N. Jiang, J. Xu, W. Yu, S. Goto, Gradient local binary patterns for human detection,
in: International Symposium on Circuits and Systems, 19–23 May, Beijing, 2013,
pp. 978–981.
Writer Retrieval in Handwritten Documents 265
18. N. Bouadjenek, H. Nemmour, Y. Chibani, Age, gender and handedness prediction from
handwriting using gradient features, in: 13th International Conference on Document
Analysis and Recognition, 23–26 August, Nancy, 2015, pp. 1116–1120.
19. Y. Serdouk, H. Nemmour, Y. Chibani. New off-line handwritten signature v erification
method based on artificial immune recognition system. Expert Systems with
Applications, 51 (2016) 186–194.
20. H. Freeman. On the encoding of arbitrary geometric configurations, IRE Transactions
on Electronic Computers EC-10 (1961) 260–268.
21. C. Djeddi, I. Siddiqi, L. Souici-Meslati, A. Ennaji, Text-independent writer recog-
nition using multi-script handwritten texts, Pattern Recognition Letters 34 (2013)
1196–1202.
22. J. Liu, W. Li, Y. Tian. Automatic thresholding of gray-level pictures using two-
dimension Otsu method. In International Conference on Circuits and Systems, 15–16
June, Shenzhen, 1991, pp. 325–327.
23. G. Louloudis, N. Stamatopoulos, B. Gatos, ICDAR 2011 writer identification contest,
in: 11th International Conference on Document Analysis and Recognition, 18–21
September, Beijing, 2011, pp. 1475–1479.
12 A Supervised Guest
Satisfaction Classification
with Review Text
and Ratings
Himanshu Sharma, Aakash,
and Anu G. Aggarwal
University of Delhi
CONTENTS
12.1 I ntroduction................................................................................................... 267
12.2 Related Literature.......................................................................................... 270
12.2.1 Guest Satisfaction and Online Reviews............................................. 270
12.3 Methodology.................................................................................................. 273
12.3.1 Data Description and Analysis.......................................................... 273
12.3.2 Data Cleaning.................................................................................... 274
12.3.3 Latent Semantic Analysis.................................................................. 275
12.3.4 Classifiers and Performance Measures.............................................. 276
12.4 Experimental Results..................................................................................... 277
12.4.1 Features Related to Guest Satisfaction.............................................. 277
12.4.2 Hotel Guest Satisfaction Prediction................................................... 279
12.5 Discussions and Conclusion........................................................................... 279
12.6 Implications...................................................................................................280
12.6.1 Theoretical Implications....................................................................280
12.6.2 Managerial Implications....................................................................280
12.7 Limitations and Future Scope....................................................................... 281
References............................................................................................................... 281
12.1 INTRODUCTION
The digital revolution has resulted in the transformation of many businesses to
online mode. Digitalization has modified the operational strategy of online firms
due to advantages such as cost-effectiveness, 24 × 7 availability, no geographical
limitations, and low entry and exit barriers, to name a few [1–2]. The upward trend in
online marketing is especially noticed in the service sector. Few popular categories
under the service sector are retails, banks, hotels, airlines, healthcare, and educa-
tion, to name a few. Automation in conducting business practices has forced market
267
268 Data Science
The hoteliers are recognizing the impact of guest satisfaction on their present and
future business prospects, and are thus being highlighted in recent studies. Guest
satisfaction evaluates how much of the customers’ expectation parameters are ful-
filled by the service provider [18]. Since the content available on hotel web sites plays
a part in creating awareness for potential travelers, e-service is efficiently utilizing
EWOM as a promotion tool by encouraging users to provide their feedbacks on its
platform [19]. OTA web sites enable its users to rate the hotel and share their experi-
ence in textual form. These overall ratings and textual reviews empirically represent
guest experience, which is observed by people from travel community as well as
hoteliers to make informed decisions, in the customer’s welfare [20]. Satisfaction of
the guests affect the sales, revisit intention, positive EWOM probability, and market
reputation of the firm [21]. Since the nature of hospitality industry matches with
that of experienced goods, researchers suggest that EWOM holds particular impor-
tance for experienced goods as their quality level is obscured before consumption
[22]. Online reviews have a notable effect in hotel sector when compared with other
tourism segments, since they are referred by most of the users for making stay deci-
sions. Therefore, guest experience holds the ability to affect all the aspects of hotel
business.
There exist some latent dimensional variables that are a representation of large
number of attributes, but the consumers might not explicitly mention. There occurs
a need to introduce techniques to evaluate these collections (termed as documents)
by sorting, probing, tagging, and searching by using computers. With the help of
machine learning (ML), extant researchers have successfully proposed a model that
finds a pattern of words in these documents under hierarchical probabilistic models.
This is referred to as topic modeling. The rationale behind topic modeling is to iden-
tify the word-use pattern and how the documents portraying similar pattern should
be connected. Under text analytics, the model makes use of bag-of-words concept
and ignores the word ordering [23]. The topic modeling generally depends on the
four methods, namely latent semantic analysis (LSA), probabilistic LSA (PLSA),
latent Dirichlet allocation (LDA), and correlated topic model (CTM). The LSA, ear-
lier known as latent semantic indexing (LSI), creates vector-based representation
of texts to make semantic content, by making use of a predefined dictionary [24].
PLSA automates document indexing based on a statistical model for factor analysis
of count data, without referring to a predefined dictionary [25]. LDA is a Bayesian-
based unsupervised technique for topic discovery in abundant documents, without
considering any parental distribution [24]. CTM helps in discovering the topics in a
group of documents, underlined by a logistic normal distribution [26].
Another term that has gained popularity with the widespread use of internet
technology is ML. ML is an algorithm that enables the systems to learn and predict
automatically from previous experiments, without using any programming [27]. It
is divided into two categories, namely supervised and unsupervised. Under super-
vised learning, the future events can be predicted with the help of labeled past data,
whereas unsupervised learning does not require a labeled dataset [28]. However,
if both types of data are available, then semisupervised learning is a good option.
A key characteristic of ML is classification algorithms. Few popular classifiers
adopted in this study are naïve Bayes (NB), decision tree (DT), random forest (RF),
270 Data Science
support vector machine (SVM), and artificial neural networks (ANN). NB is a set of
algorithms using Bayes’ theorem at grass-root level. It considers all the features to
be independent of each other. DT organizes the testing conditions in a tree structure,
where root and internal nodes represent different test conditions and their character-
istics. RF is an ensemble model that considers many DTs at one time. The result from
this model is usually better than the result from one of the individual models. SVM
makes use of hyperplanes in a multidimensional space that divides cases of diverse
class labels. It consists of regression and classification by making use of continuous
and categorical variables. ANN comprises of layers of neurons, which convert an
input vector into some output. Each unit takes an input, applies a nonlinear function
to it, and then forwards the output to the succeeding layer.
This chapter aims to evaluate the satisfaction level of guests from a hotel
review dataset obtained from Tripadvisor. We make use of overall star ratings as
a proxy variable for guest satisfaction level of labeled data. First, topic modeling
is applied with the help of LSA, which results in topics that represent the whole
useful review information. Then, for measuring the satisfaction level, we make
use of classifiers such as NB, DT, RF, SVM, and ANN, to check their accuracy
using performance measures. Thus, we aim to solve the quest for the following
research questions:
12.2 RELATED LITERATURE
12.2.1 Guest Satisfaction and Online Reviews
EWOM is defined as “any positive or negative statement made by potential, actual, or
former customer about a product or company, made available to multitude of p eople and
institutions via internet” [29]. Due to the innate uncertainty associated with an untried
product/service, consumers often depend upon word-of-mouth for making purchase
decisions. However, in this new digital age where communication of individual’s opin-
ions knows no bounds, the UGCs have dethroned the hoteliers from the role of travel
opinion leaders [13]. Purchase involvement and service experience have also been
identified as vital antecedents of review, providing motivation. A travel trip whether for
recreational or work purpose is always followed by an experience felt or encountered
Supervised Guest Satisfaction Classification 271
by the traveler. These guest experiences are the major source of information for the
service provider [30]. It can provide a snapshot of hotel performance, which can be
evaluated and analyzed for various managerial purposes. Understanding of customer
experience involves various intricate factors majorly, because of its personal nature.
Due to the competitive nature of hospitality industry, it is vital for the hoteliers to
understand guest experience and improve their satisfaction level.
A study consisting of 60,000 reviews made by travelers on a distributional web
site was conducted to determine the factors impacting the satisfaction of guests [19].
They found the importance of qualitative and quantitative aspects of reviews for sat-
isfaction. Using an integrative analysis, findings suggested that few factors that have
high influence over customers are cleanliness, bathroom, and beds. Reviews laid
higher influence on hotel’s convenience to attractions, shopping, airports, and down-
town. Also, food and beverage items play a part. Pantelidis [31] studied the impact of
food characteristics over guest satisfaction. He considered 2,471 customer comments
from 300 restaurants in London. Content analysis results showed the importance
of food, and in particular, starters influence the experience. However, along with
food, other determinants of guest experience are service, ambience, price, menu,
and décor. He found that the results were consistent even under economic crisis, and
that these experiences determine the longevity of business and customer’s intention
of loyalty.
An ecotourism satisfaction study was performed to determine the ecolodge stay
experience [32]. Content analysis was used to analyze an ecotourism data obtained
from Tripadvisor site. A two-step statistical procedure was implemented to clas-
sify the experience level into four categories, such as satisfiers, dissatisfies, critical,
and neutral. The study considered the frequency of online reviews, expression of
favorable attitude, and the overall satisfaction levels reported by the reviewers along
with textual comments. Li et al. [33] took 42,668 online reviews from 778 hotels to
study guest satisfaction. Content analysis was used to obtain the study results. The
findings showed that transportation convenience, food and beverage management,
convenience to tourist destinations, and value for money impact the guest experience
of tourists from both luxury and budget hotels. Guests paid more consideration to
bed, reception services, room size, and decoration.
Another study considering 1,345 customer feedbacks from 97 hotels in the
Hangzhou area was performed to evaluate guest satisfaction [9]. Twenty-three attri-
butes considered as the determinants of customer satisfaction were divided into four
categories, such as satisfiers, dissatisfies, bidirectional, and neutrals. These attributes
incorporated features such as room facilities, general hotel facilities, food quality,
dining environment, price, location, and staff service. One-way ANOVA (Analysis
of Variance) results showed positive influence of softness of beds, availability of
western food, availability of 24-h reception, sound proofing of the room, and on-site
parking. Xiang et al. [10] used big data analytics considering 60,648 reviews cover-
ing 10,537 hotels extracted from Expedia.com to study the guest experience towards
hotel. Regression results show the importance of UGC for determining the guest
behavior in hotels. Research findings emphasized on semantic differentiations in
comparison to motivation and hygiene variables. Also, a strong association between
experience and satisfaction was obtained.
272 Data Science
TABLE 12.1
ML and Hotel Online Reviews Literature
Author(s) Purpose of the Study Technique(s)
[27] Sentiment of reviews SVM
[35] Sentiment of reviews NB
[28] Polarity of documents TF-IDF
[36] Sentiment classification of reviews SVM
[37] Hotel service quality NB
[38] Sentient classification SVM
[39] Opinion spam detection Sparse additive generative model (SAGE)
[40] Fake review classification Logistic regression, RF, DTs, SVM, NB
[41] Opinion mining SVM and fuzzy domain ontology (FDO)
[42] Comparative analysis of online Topic modeling
review platforms
[43] Aspect-based sentiment analysis Recurrent neural network (RNN) and SVM
Present study Hotel guest satisfaction SVM, RF, DTs, ANN, NB
12.3 METHODOLOGY
12.3.1 Data Description and Analysis
A text analytics research was conducted with the aim to analyze the hotel guest
satisfaction represented through EWOM and its influence on overall hotel ratings avail-
able at TripAdvisor.com. We used the data of Tripadvisor.com, because it is world’s
biggest social media platform specific to e-services and also has more than 460 million
EWOMs related to hotels, restaurants, and other e-services. We used TripAdvisor.com
EWOMs dataset due to [44]. A total of 5,69,861 reviews were downloaded in 2015.
274 Data Science
TABLE 12.2
Dataset Information
Categories No. of Reviews
Satisfied guests 2,19,984
Unsatisfied guests 19,875
Total number of hotel reviews (N) 2,39,859
Total number of reviews 5,69,861
12.3.2 Data Cleaning
It included the processes consistent with previous studies [45,46], including punctua-
tions, stop words, tokenization, stemming, and non-English words. We implemented
text preprocessing through the Natural Language Toolkit (NLTK) module using
python programming.
where
Based on the maximum likelihood function defined in Equation (12.3), we get the
probabilities required in Equations (12.1) and (12.2).
where n(d , w) denotes the term frequency, i.e., the number of times w occurred in d .
Note that an equivalent symmetric version of the model can be obtained by inverting
the conditional probability P( z | d ) by using Bayes’ rule
This chapter uses LSA to obtain and tag guest satisfaction attributes for all 2,39,859
hotel reviews accessed in our analysis. LSA does not make any assumption regard-
ing the structure of text or the grammatical properties of the language. LSA
extracted 20 topics and every topic contained seven words. These topics represent
the important aspects related to travelers’ satisfaction, depending on the frequency
of occurrence associated with consumers’ own hotel experiences.
TABLE 12.3
Classifiers’ Performance Measures
Measure Formula Description
Precision TP Out of data points that were found to be relevant, actually were
TP + FP relevant
Recall TP Ability to find out all the relevant occurrences in the data
TP + FN
F1-measure Precision ⋅ recall A measure to seek a balance between precision and recall
Precision + recall
Accuracy TP + TN To measure effectiveness of a classifier in prediction
TP + TN + FP + FN
obtaining satisfied guests from the collected review dataset. We divided the guest
categories on the basis of overall rating into satisfied and unsatisfied guests on the
basis of a bottleneck value. For our analysis, it is set at 0.60 (i.e., 3/5 = 0.60). If
customers gave three or greater overall ratings, then they come under the category of
satisfied guest and, otherwise, vice versa. This shows that a customer who gave 60%
or more ratings to the hotel is labeled as satisfied guest and, otherwise, unsatisfied
guest.
12.4 EXPERIMENTAL RESULTS
Here, we quote the findings of dimension extraction for the hotel guest satisfaction.
We then analyze the influence of these extracted features on the hotel guest satisfac-
tion. The experiments comprised features extraction, their relative importance, and
guest satisfaction prediction analysis.
TABLE 12.4
Examples of Identified Topic Labels
Topic Relative Weight (%) Topic Relative Weight (%)
Topic 1: Natural Beauty Topic 2: Recommendation
Beach 20.5 Recommend 14.8
Ocean 3.4 Best 12.6
Pool 3.4 Hotel 3.6
Resort 3.3 Staff 1.8
Beautiful 1.8 Service 1.6
Kids 1.6 Modern 1.3
View 1.3 Housekeeping 1.3
good about the hotel’s actual appearance. It shows that the selected hotel d elivers a
clean and beautiful place as they promised. Satisficing shows that customers found
satisfactory services by the hotel, which met their expectations towards that hotel.
The features that showed the level of satisfaction and dissatisfaction of hotel guests
are guest experience (critique), guest experience (praise), and recommendation. One
of the topics was labeled as “Hybrid,” as it incorporated more than one group of
words that represents very different levels of hotel guest experiences. The other top-
ics show 14 specific aspects related to service quality (e.g., staff, room experience,
and bathroom). Therefore, all these 20 features can influence the hotel guest satisfac-
tion. Although the relative percentage is not fixed, it can be different with respect
to other hotels, investors, and owners. However, hotel managers and practitioners
should focus on these features related to hotel guest satisfaction. It can enhance the
level of guest satisfaction towards that hotel.
Supervised Guest Satisfaction Classification 279
of these features on the hotel guest satisfaction prediction. Our proposed hotel guest
prediction model is very important to identify the features that aid marketers to
judge the guest satisfaction level. The suggested predictive model found predictive
f-measure above 80%with respect to each classifier using top-20 hotel guest satisfac-
tion features (Figure 12.4). It also obtained a predictive f-measure above 85% for all
other classifiers except NB. The RF classifier is noted to be a good predictive classi-
fier that outperformed the others as shown in Figure 12.4. The relative importance of
each extracted topic is also examined.
12.6 IMPLICATIONS
12.6.1 Theoretical Implications
Much of the previous studies in the hospitality sector focusing on customer (guest)
satisfaction were conducted empirically or through econometric modeling. Compared
with the earlier studies that made use of primary datasets obtained from survey
questionnaires and focus groups (in case of empirical studies) or using technical
aspects of UGC, such as length, sentiments, valence, and many more, this study uses
topic model to gain insights into what guests talk about in their reviews, and based
on the obtained topics, we judge the satisfaction level with the help of ML classifiers.
One of the research findings are extracting the topics that represent the collection
of buzzwords that reviewers talk about. The topic modeling technique adopted in
this chapter is LSA. The 20 most important features (topics) extracted from 2,39,859
hotel reviews were nightlife, value, amenities, natural beauty, guest experience
(critique), recommendation, staff, guest experience (praise), the location of hotel,
accessibility, car parking, visitor suitability (critique), satisficing, style and décor,
bathroom, deals, visitor suitability (praise), hybrid, room experience, and high stan-
dards. These are consistent with previous works in tourism literature.
Since the online reviews are available in abundance and represent information
in open structural form, it becomes complex to analyze them. This study tackles
these shortcomings by using five supervised ML techniques, namely RF, DTs, NB,
SVM, and ANNs to classify the reviews into satisfiers and dissatisfiers. The RF clas-
sifier delivers the best performance when compared with other four classifiers, with
performance values having 91.94% accuracy, 88.8% f-measure, 92.75% recall, and
85.18% precision. This is consistent with recent studies conducted using ML.
12.6.2 Managerial Implications
The study comes out with few significant implications for managers. It suggests
hoteliers and investors the importance of latent dimensions of guest satisfaction from
UGCs available on their sites. The study makes use of text mining methodologies
that help the practitioners in understanding the linguistic aspects of online reviews
and how these features impact the overall satisfaction of the guests. Therefore, apart
from concentrating on the perceptions made by guests through textual reviews
and improving their services accordingly, the marketers must also take note of the
obtained latent features helpful in achieving the maximum satisfaction level of cus-
tomers. Recent statistical reports project the importance of EWOM and how these
Supervised Guest Satisfaction Classification 281
REFERENCES
1. Aggarwal, A.G. and Aakash, N.A., Multi-criteria-based prioritisation of B2C e-commerce
website. International Journal of Society Systems Science, 2018. 10(3): pp. 201–222.
282 Data Science
2. Memon, Q. and Khoja, S., Semantic web for program administration. International
Journal of Emerging Technologies in Learning, 2010. 5(4): pp. 31–40.
3. Tandon, A., Sharma, H., and Aggarwal, A.G., Assessing travel websites based on
service quality attributes under intuitionistic environment. International Journal of
Knowledge-Based Organizations (IJKBO), 2019. 9(1): pp. 66–75.
4. Casaló, L.V., Flavián, C., and Guinalíu, M., The role of satisfaction and website u sability
in developing customer loyalty and positive word-of-mouth in the e-banking services.
International Journal of Bank Marketing, 2008. 26(6): pp. 399–417.
5. Memon, Q. and Khoja, S., Semantic web approach for program assessment.
International Journal of Engineering Education, 2009, 25(5): pp. 1020–1028.
6. Statista. 2018 [cited 2018 August]; Available from: www.statista.com/statistics/289770/
india-retail-e-commerce-sales/.
7. Hao, J.-X., et al., A genetic algorithm-based learning approach to understand customer
satisfaction with OTA websites. Tourism Management, 2015. 48(June): pp. 231–241.
8. Ali, F., Hotel website quality, perceived flow, customer satisfaction and purchase inten-
tion. Journal of Hospitality and Tourism Technology, 2016. 7(2): pp. 213–228.
9. Zhou, L., et al., Refreshing hotel satisfaction studies by reconfiguring customer review
data. International Journal of Hospitality Management, 2014. 38: pp. 1–10.
10. Xiang, Z., et al., What can big data and text analytics tell us about hotel guest expe-
rience and satisfaction? International Journal of Hospitality Management, 2015.
44: pp. 120–130.
11. Zhang, Y. and Cole, S.T., Dimensions of lodging guest satisfaction among guests
with mobility challenges: A mixed-method analysis of web-based texts. Tourism
Management, 2016. 53: pp. 13–27.
12. Zhao, Y., Xu, X., and Wang, M., Predicting overall customer satisfaction: Big data
evidence from hotel online textual reviews. International Journal of Hospitality
Management, 2019. 76: pp. 111–121.
13. Berezina, K., et al., Understanding satisfied and dissatisfied hotel customers: Text
m ining of online hotel reviews. Journal of Hospitality Marketing & Management,
2016. 25(1): pp. 1–24.
14. Shopify. 2018 [cited 2018 August]; Available from: www.shopify.com/enterprise/
global-ecommerce-statistics.
15. Mishra, A., et al., Adolescent’s eWOM intentions: An investigation into the roles of
peers, the Internet and gender. Journal of Business Research, 2018. 86: pp. 394–405.
16. Fan, Z.-P., Che, Y.-J., and Chen, Z.-Y., Product sales forecasting using online reviews
and historical sales data: A method combining the Bass model and sentiment analysis.
Journal of Business Research, 2017. 74: pp. 90–100.
17. Aakash, A. and Aggarwal, A.G., Role of EWOM, product satisfaction, and website
quality on customer repurchase intention, in Strategy and Superior Performance of
Micro and Small Businesses in Volatile Economies. Editors: João Conrado de Amorim
Carvalho and Emmanuel M.C.B. Sabino, 2019, IGI Global. pp. 144–168.
18. Hargreaves, C.A., Analysis of hotel guest satisfaction ratings and reviews: an
application in Singapore. American Journal of Marketing Research, 2015. 1(4):
pp. 208–214.
19. Stringam, B.B. and Gerdes Jr, J., An analysis of word-of-mouse ratings and guest
comments of online hotel distribution sites. Journal of Hospitality Marketing &
Management, 2010. 19(7): pp. 773–796.
20. Liu, Y., et al., Big data for big insights: Investigating language-specific drivers of hotel
satisfaction with 412,784 user-generated reviews. Tourism Management, 2017. 59:
pp. 554–563.
21. Radojevic, T., et al., The effects of traveling for business on customer satisfaction with
hotel services. Tourism Management, 2018. 67: pp. 326–341.
Supervised Guest Satisfaction Classification 283
22. Min, H., Lim, Y., and Magnini, V.P., Factors affecting customer satisfaction in
responses to negative online hotel reviews: The impact of empathy, paraphrasing, and
speed. Cornell Hospitality Quarterly, 2015. 56(2): pp. 223–231.
23. Wallach, H.M., Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd
international conference on Machine learning. 2006. ACM. Pittsburgh, Pennsylvania,
USA
24. Crossley, S., Dascalu, M., and McNamara, D., How important is size? An investigation
of corpus size and meaning in both latent semantic analysis and latent dirichlet alloca-
tion. In The Thirtieth International Flairs Conference. 2017. Marco Island, FL, USA
25. Hofmann, T., Probabilistic latent semantic indexing. In ACM SIGIR Forum. 2017.
ACM.
26. Dybowski, T. and Adämmer, P., The economic effects of US presidential tax com-
munication: Evidence from a correlated topic model. European Journal of Political
Economy, 55(C), 511–525, 2018.
27. Zheng, W. and Ye, Q., Sentiment classification of Chinese traveler reviews by support
vector machine algorithm. In Intelligent Information Technology Application, 2009.
IITA 2009. Third International Symposium on. 2009. IEEE., Nanchang, China.
28. Shi, H.-X. and Li, X.-J., A sentiment analysis model for hotel reviews based on super-
vised learning. In Machine Learning and Cybernetics (ICMLC), 2011 International
Conference on. 2011. IEEE., Guilin
29. Chu, S.-C. and Kim, Y., Determinants of consumer engagement in electronic word-of-
mouth (eWOM) in social networking sites. International Journal of Advertising, 2011.
30(1): pp. 47–75.
30. Saha, G.C. and Theingi, Service quality, satisfaction, and behavioural intentions:
A study of low-cost airline carriers in Thailand. Managing Service Quality: An
International Journal, 2009. 19(3): pp. 350–372.
31. Pantelidis, I.S., Electronic meal experience: A content analysis of online restaurant
comments. Cornell Hospitality Quarterly, 2010. 51(4): pp. 483–491.
32. Lu, W. and Stepchenkova, S., Ecotourism experiences reported online: Classification of
satisfaction attributes. Tourism management, 2012. 33(3): pp. 702–712.
33. Li, H., Ye, Q., and Law, R., Determinants of customer satisfaction in the hotel industry:
An application of online review analysis. Asia Pacific Journal of Tourism Research,
2013. 18(7): pp. 784–802.
34. Xiang, Z., Schwartz, Z., and Uysal, M., What types of hotels make their guests (un)
happy? Text analytics of customer experiences in online reviews, in Information
and Communication Technologies in Tourism 2015, L. Tussyadiah and A. Inversini,
Editors. 2015, Springer. pp. 33–45.
35. Baharudin, B. Sentence based sentiment classification from online customer reviews.
In Proceedings of the 8th International Conference on Frontiers of Information
Technology. 2010. ACM., Islamabad
36. Yin, P., Wang, H., and Zheng, L., Sentiment classification of Chinese online reviews:
Analysing and improving supervised machine learning. International Journal of Web
Engineering and Technology, 2012. 7(4): pp. 381–398.
37. Duan, W., et al., Mining online user-generated content: Using sentiment analysis tech-
nique to study hotel service quality. In System Sciences (HICSS), 2013 46th Hawaii
International Conference on. 2013. IEEE. Wailea, Maui, HI USA
38. Wang, H., et al., Text feature selection for sentiment classification of Chinese online
reviews. Journal of Experimental & Theoretical Artificial Intelligence, 2013. 25(4):
pp. 425–439.
39. Li, J., et al., Towards a general rule for identifying deceptive opinion spam. In
Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 2014. Baltimore, Maryland
284 Data Science
40. Banerjee, S., Chua, A.Y., and Kim, J.-J., Using supervised learning to classify authentic and
fake online reviews. In Proceedings of the 9th International Conference on Ubiquitous
Information Management and Communication. 2015. ACM.
41. Ali, F., Kwak, K.-S., and Kim, Y.-G., Opinion mining based on fuzzy domain ontol-
ogy and support vector machine: A proposal to automate online review classification.
Applied Soft Computing, 2016. 47: pp. 235–250.
42. Xiang, Z., et al., A comparative analysis of major online review platforms: Implications
for social media analytics in hospitality and tourism. Tourism Management, 2017.
58: pp. 51–65.
43. Al-Smadi, M., et al., Deep recurrent neural network vs. support vector machine for
aspect-based sentiment analysis of Arabic hotels’ reviews. Journal of computational
science, 2018. 27: pp. 386–393.
44. Roshchina, A., Cardiff, J., and Rosso, P., TWIN: personality-based intelligent recom-
mender system. Journal of Intelligent & Fuzzy Systems, 2015. 28(5): pp. 2059–2071.
45. Malik, M. and Hussain, A., An analysis of review content and reviewer variables that
contribute to review helpfulness. Information Processing & Management, 2018. 54(1):
pp. 88–104.
46. Singh, J.P., et al., Predicting the “helpfulness” of online consumer reviews. Journal of
Business Research, 2017. 70: pp. 346–355.
13 Sentiment Analysis
for Decision-Making
Using Machine
Learning Algorithms
Mohamed Alloghani
Liverpool John Moores University
Abu Dhabi Health Services Company (SEHA)
Mohammed Khalaf
Almaaref University College
Jamila Mustafina
Kazan Federal University
CONTENTS
13.1 I ntroduction................................................................................................... 286
13.2 Literature Review.......................................................................................... 287
13.2.1 Related Studies and Techniques........................................................ 289
13.3 Methods......................................................................................................... 291
13.3.1 Naïve Bayes Classifier....................................................................... 292
13.3.2 Support Vector Machine.................................................................... 292
13.4 Data Analysis and Algorithm Initialization.................................................. 293
13.4.1 Experimental Setup........................................................................... 293
13.4.2 Data Preparation and Model Initialization........................................ 293
13.4.3 Evaluation Measures.......................................................................... 294
13.4.4 Planned Approach............................................................................. 295
13.5 Results and Model Evaluation....................................................................... 295
13.6 Conclusion and Future Works........................................................................ 301
13.7 Acknowledgment........................................................................................... 301
References............................................................................................................... 301
285
286 Data Science
13.1 INTRODUCTION
Since time immemorial, people have always been curious to understand their
surroundings. The advent of data mining brought with “Opinion Mining,”
sentimental analysis (SA), and other techniques have improved the quest to under-
stand complex textual circumstances and the environment. In the context of natural
language processing (NLP), opinion mining and SA refer to algorithms that com-
pute and identify critical patterns in opinions, sentiments, and subjectivity from
written texts [1]. The technological advancement and subsequent development in
SA techniques have promoted opinion as for the first-class attribute with relatable
constructs. Machine learning (ML) algorithms have proven efficient and effective
in different classification and prediction tasks, including the context of document
analysis. Whether using supervised or unsupervised learning techniques, with
proper improvisations and modifications, these algorithms can be used to analyze
and collate negative and positive sentiments in documents and unit texts [1,2]. Both
supervised and unsupervised learning techniques can detect polarity in sentimental
reviews, although their deployment uses different approaches, for example, super-
vised learning is only applicable when training subsets are available, while unsu-
pervised learning suits datasets with linguistic resources but missing a training
subset. The basis of SA is the assumption that opinions, attitudes, and emotions are
subject to impressions that have innate binary opposition [2]; in this case, opinions
are expressed as either like or dislike, good or bad, positive or negative, among oth-
ers. The analysis of such sentiments uses either NLP, statistics, or ML techniques
in the characterization of sentiments embedded within a text unit. The concern of
SA is to extract a specific content from the provided text. It entails information
retrieval, which involves effective techniques for discarding content that is subjec-
tive. Some SA tasks also recognize and isolate opinion-inclined queries. Above all,
SA summarizes multiple perspectives and returns an actionable stance regarding
the issue of interest.
This technique is applicable in business intelligence and interindustry business
operations. In business intelligence, SA techniques permit search on opinions
relating to a product of interest. Information on prices and customer satisfaction
from a consensus point of view can help in obtaining accurate information, without
necessarily subscribing to a service or creating a profile. It is this feature that marks
a significant difference between SA and other data mining techniques. A major-
ity of scholarly articles on SA have deployed different algorithms at a document
level. Also, most articles distinguish negative and positive sentiments in such texts
in the form of reviews. Nonetheless, the current development in NLP has brought
with it more functionalities, including those that support sentence-level analysis of
multiperspective documents.
SA is a classification process with three distinct levels. The first is a document-level
SA that classifies the unit text as expressing either negative or positive sentiments.
The second is a sentence-level SA that identifies and extracts sentiments found in
each sentence. In the sentence-level analysis, it is mandatory to establish whether a
sentence is subjective or not. For subjective sentences, this SA technique e stablishes
whether it expresses negative or positive sentiments [1,2]. However, it is debatable
Sentiment Analysis for Decision-Making 287
whether sentences that bear sentiments are necessarily subjective. The first and
second levels of SA are indistinguishable, although informed and detailed analysis
is obtained using aspect-level SA, which classifies sentiments based on the object
or topic of interest but using an attribute. Hence, aspect-level SA first identifies an
object, for example, a movie and its identifiers, including genre and characters,
among others. The approach is necessary because different people have different
preferences, especially regarding some specific attributes [2]. In the context of mov-
ies, action diehard fans may rate a movie differently based on their opinion on stunts
and how good or bad they were pulled.
SA techniques alongside other NLP algorithms are becoming more relevant
with each emerging Big Data concept. For a long time, Big Data was associated
with numerical or quantitative, but this will not be the case with each new applica-
tion field. The dataset used in the research qualifies as big data because it contains
over 465,000 tag applications, although the movie reviews in the data amount to
12 million in tag genome data with comments and scores for over 27,000 movies. Big
Data analytics relies on advanced tools to identify patterns, reveal trends and corre-
lations, and such tools are used in different decision-making instances. Specifically,
Big Data has applications in customer, fraud, compliance, and operational analytics.
In customer analytics, it intends to facilitate the processes of product improvement,
value addition, and customer satisfaction strategies, and as such is applicable in the
movie industry. SA of the big data in the movie industry can help in identifying
trends and patterns that can aid in movie creation and budgeting.
13.2 LITERATURE REVIEW
The SA techniques rely on either ML techniques or lexicon-based models. ML
approaches consist of unsupervised and supervised learning algorithms. Supervised
learning requires a labeled dataset with training and validating subsets [3]. The
algorithms learn the data using the training set, make predictions, and test their
deviations from the values in the testing subset. The algorithms include decision
trees, linear discriminants, and rule-based models among others [4–11]. The linear
classifiers consist of support vector machines (SVM) and neural networks (NN),
while probabilistic classifiers consist of maximum entropy, Bayesian network,
and naïve Bayes. The classification in the article relied on SVM and naïve Bayes.
Arguably, these are the two most commonly used techniques in SA [12].
Both ML and lexicon-based techniques have their merits and demerits. For
instance, ML models require labeled training subsets that may not be necessarily
available. Conversely, the lexicon-based models utilize language-dependent
resources; otherwise, these techniques return lower recall measures. SA analy-
sis uses NLP, textual investigation, and computed linguistics for the identification
and extraction of subjective information within the sourced material [13]. That is,
SA techniques have proven effective in emotion detection (ED), transfer learning
(TL), and building resources (BR). All the application domains involve extraction
and analysis of implicit or explicit emotions and the creation of lexical corpora
[14]. The cross-domain learning or the TR specializes in lexical annotation based
288 Data Science
The selection of the feature occurs when different users apply similar tags to review
the same movie so that it features the space defined as follows [21]:
F2 = {‘tag2’ : 1, ‘tag5’ : 1, ‘tag6’ : 1} (13.2)
For example, in the two instances, the “tag2” is used to review the same movie so
that a user similarity exists with respect to that tag for that item.
Those tags identified as belonging to the positive class have a higher similar-
ity [22]. However, missing or omitted information is identified and grouped in the
negative class, and this is the tenet for All Missing as Negative assumption used in
most studies despite it is the biases of recommendations [23]. However, the absence
of values does not improve model performance as the existence of the positive. As
such, it is important to allocate lower weights to the negative tags. Most weighing
models assume that the missingness of the data includes negative ratings with equal
probabilistic distribution across users and movies [24]. In this case, the weights are
uniformly assigned using a distribution function defined as δ ϵ [0, 1], and it rep-
resents negative tags [25]. Similarly, a weighting system that considers additional
positive tags can also be adopted because the probability of a user not liking many
movies tends to be high. In this case, the nonexistent information does not have
any impact on the model. Additionally, such a weighting system assumes that the
missing information, even though include positive tags, is inconsequential compared
with the negative sentiments [26]. The weighting approaches are summarized in the
following table.
Table 13.1 summarizes the low-rank approximation computation for the ranking
matrices. The ranks are based on uniform (consistent), user-oriented (learning), and
item-oriented (movie learning) schemes. The consistent scheme fixes the missing
entries while awarding high confidence to the user based on their ratings. Further,
movie leaning uses probability with the assumption that a movie with less positive
reviews has higher instances of negative ratings and hence higher probabilities.
TABLE 13.1
Weighted Alternate Least Squares
Positive Tags Negative Tags
Consistent Wij = 1 Wij = δ
User leaning Wij = 1 Wij = ∑j·Rij
Movie leaning Wij = 1 Wij = m − ∑j·Rij
290 Data Science
TABLE 13.2
Summary of SA Studies and Techniques
Article Year Algorithms Used Polarity Data Source Dataset Source (s)
[27] 2016 Classification-based G Web forms IDM movie review
[28] 2016 Log-likelihood ratio (LLR) G Movie review pages 4,000 Chinese
and delta TF-IDF movie reviews
[29] 2016 MapReduce G Hadoop platform Hadoop database
[30] 2016 Naive Bayes, SVM G Tweet movie reviews Twitter
[31] 2017 GINI index and SVM Neg/pos Movie reviews N/A
sources, and comparison may be difficult because of the subjective nature of opinions
depending on the subject (Table 13.2).
Nonetheless, Kim et al. [32] conducted a comparative study in which they
established that NN and SVM are more accurate than naïve Bayes classier. Based
on the study, NN had an accuracy of 70.72%, while SVM had 70.63% accuracy. In
another paper, Kalaivani and Shunmuganathan [33] used SA to classify the user
review of movies using SVM, naïve Bayes, and kNN (k- nearest neighbor) algorithms.
According to the study, SVM outperformed both NN and kNN and recorded over
80% accuracy in making the predictions. The SA task is time consuming in cases
where the target variable has longer n-gram lengths. However, texts or reviews with
shorter n-gram lengths also result in unbalanced positive, negative, and neutral cor-
pora. Consequently, moving forward and considering the influence of Big Data ana-
lytics in different NLP applications, it is important to develop techniques for h andling
the computational memory problem. The R package Big Data and its dependencies
are a step towards handling ML problems, especially for text-based data.
Sahu and Ahuja [27] conducted SA analysis of an IDM (movie review based on a
structured N-gram) and concluded that their approach had a classification accuracy
of 88.95%. Chu et al. [28] also conducted an SA of movie reviews from Chinese
databases and established that despite favoritism toward SVM and naïve Bayes as
the best performing SA analysis techniques, LLR performs better and has lower
misclassification error margins. Furthermore, Amolik et al. [30] conducted an SA of
movie reviews based on comments retrieved from Twitter using SVM and NB (Naive
Bayes). The study classified each of the reviews as positive, negative, or neutral, and
established that NB and SVM have accuracies of 75% and 61%, respectively. In their
conclusion, the authors asserted that the accuracy of an SVM algorithm increased
with an increase in the size of the training subset. Finally, Tripathy Agrawal and
Rath [34] implemented logistic regression, SVM, and NB to conduct an SA of movie
reviews obtained from different platforms and claimed that logistic regression clas-
sifier is more accurate than SVM and NB. However, SVM performed better than NB
in that analysis.
It is clear from these studies that SVM is more accurate, but not necessarily
precise, compared with NB algorithm. Additionally, the results of the studies estab-
lished an accuracy that ranges between 75% and 88.95%, and based on this general
trend, this study obtained a lower accuracy. The lower accuracy can be ascribed to
Sentiment Analysis for Decision-Making 291
the size of the training data, data sparsity, and cold start problem. Nonetheless, the
algorithm had the highest sensitivity (88%) compared with the rest of the studies and,
hence, had the highest true positive (TP) or probability of detection.
13.3 METHODS
SA techniques, including binary and multiclass classification, classify documents
or corpus reviews as either positive or negative. However, multiclass sentiment
classification algorithm divides user reviews into strong positive, positive, neutral,
negative, or strong neutral [35]. In general, binary SA classification suits comparison
of documents or corpus.
The MovieLens dataset contains tags stored in unformatted text format. As such,
the data requires transformation before loading for ML algorithms [35,36]. The
data transformation involves the ratings, but the target attributes are the movie tags
and user comments. The data was processed and tasks such as feature selection,
comment segmentation for corpus creation, and transformation of rating vectors
use term frequency-inverse document frequency. The transformation of ratings
alongside the creation of the corpus from the comments and reviews yielded the
data used to complete the analysis. The processed data, corpus, was loaded, and
naïve Bayes and SVM were used to classify the data and obtain sentiments. The
visual program that was implemented to obtain the results of the study is as shown
in Figure 13.1.
Figure 13.1 illustrates the process of implementing SA. It consists of data impor-
tation and preparation, and implementation of both naïve Bayes and support vector
SA algorithms. The imported MovieLens dataset contains text-based comments and
numerical ratings; both are converted as corpus and merged to a data table and used
as input for NB and SVM SA analysis.
P(tag) refers to the probability of the tag in the rating matrix, and it shows the like-
lihood of a random user giving the review. P(feature | tag) refers to the prior prob-
ability for the algorithm, which classifies a feature as a tag [37]. However, the naïve
algorithm must meet the requirement of independence among the features so that
Equation (13.3) becomes [37]
P(tab | feature) =
(
P(tag) × P f1 tag × × P ( fn | tag ) ) (13.4)
P(feature)
Equation (13.4) and the improved version of Equation (13.3) solve the problem of
the prediction of positive classes, which is being higher than that of negative classes
[37]. Despite the improvement, the average accuracy of the model reduces, but both
precision and recall improve using the new model.
which region the hyperplane rests. According to Equation (13.5), c j∈{−1,1} is divided
according to (positive, negative) for a file (denoted as d j). The mathematical state-
ment of w is provided by [40]
∑ a c d , a ≥ 0 (13.5)
w= j j j j
The dual optimization challenge provides the values for a j ’s. The entire d j and a j are
higher than zero and is defined as support vectors not only because they are docu-
ment based but also because they contribute to w . Other research studies have shown
that naïve Bayes with local weightings outperforms both naïve Bayes itself and kNN.
It also compares favorably with more sophisticated ways of enhancing naïve Bayes
by relaxing its intrinsic independence assumption [41]. Locally weighted learn-
ing only assumes independence within a neighborhood, not globally in the whole
instance space as a standard naïve Bayes does.
13.4.1 Experimental Setup
The MovieLens dataset was retrieved from www.movielens.org, and it contains
100,004 ratings and 1,296 tags for 9,125 movies. The data reviews were written by
671 users between January 1995 and October 2016, and all the users were included
in the analysis. However, the qualification criterion included users’ tags with at least
20 movies reviewed, all users were identified using a unique ID without more demo-
graphic information.
The dataset contains links, movies, tags, and tag files, and each contains s pecific
information. Using the two approaches, SA analysis was applied to the tags. As
for feature selection, K-means classifier was used instead of the other conventional
methods. Ideally, the tags were predicted using an algorithm that is verifiable and
can be evaluated. In other terms, our expectation is that users with the same taste
will normally rate movies with a high relationship [42].
data type specifications were performed. The loaded data contained 100,005 ratings,
with timestamp, genre, and tags as the only variables. The tag variable was the only
variable used in learning and classification. The tags were categorized into positive
and negative followed by being appended to each other in accordance and stored
into already-created empty lists. The modeling scheme designated 75% of data to a
training set and the rest to validation.
13.4.3 Evaluation Measures
The performance metrics used to evaluate the two algorithms were based on the
classical confusion matrix experiment and computations. The matrix presented in
Table 13.3 contains counts of sentiment instances correctly classified and misclas-
sified. The performance of the models was evaluated based on TP, True Negative
(TN), False Positive (FP), and False Negative (FN). The TP is the count of the num-
ber of correctly predicted sentiments [43]. That is, the model predicts the actual
positive sentiments correctly. The number of actual positive sentiments predicted as
negative constitute the FN measure, while the number of negative sentiments pre-
dicted or classified as positive sentiments constitute the FP measure [43]. Finally, the
number of negative sentiments correctly predicted is the TN measure.
From the confusion matrix table, metrics such as precision, recall, F-measure,
and accuracy can be computed. The definition and computational formula for each
of the metrics are discussed as follows.
Precision: It refers to the ratio between the number of correctly predicted positive
sentiments to the sum of correctly predicted positive sentiments and the negative
sentiments predicted as positive sentiment [43]. The computation equation is shown
as follows.
TP
Precision = (13.6)
TP + FP
Recall: It is a measure of sensitivity, because it refers to the ratio between correctly
classified positive sentiments and the sum of positively classified sentiments and
positive sentiments incorrectly classified as negative sentiments [43]. The computa-
tion equation is shown as follows.
TP
Recall = (13.7)
TP + FN
TABLE 13.3
Confusion Matrix and Model Sentiment Prediction Elements
Predicted Sentiment
Class
Positive Negative
Actual Sentiment Class Positive TP FN
Negative FP TN
Sentiment Analysis for Decision-Making 295
Recall and precision tend to contradict each other, because models with higher
precision tend to be less sensitive.
F-measure: It is a single value measure that combines recall and precision, and it
is also referred as the F-1 measure [42]. The computation equation is as shown below.
2 ∗ Precision ∗ Recall
F-Measure = (13.8)
Precision + Recall
Accuracy: It is the classification accuracy of the model and is a ratio between the
sum of TP and TN and the total number of sentiments within the dataset [43]. The
computation equation is shown as follows.
TP + TN
Accuracy = (13.9)
TP + TN + FP + FN
13.4.4 Planned Approach
The implementation of SA analysis in analyzing the reviews in the Movielens data
consisted of several preprocessing activities, such as handling of missing data or
sparsity, removal of special characters applied, such as “!,” “@,” and blank spaces
needed, and creation of a corpus based on unsupervised learning to label the data.
Other phrases such as “wow” and “ooh” as well as repetitive characters were also
removed. The dataset consists of numerical attributes and string attributes. The
numerical attributes consist of five-star ratings, while string attribute contains the
comments and reviews on the movies. Studies that have used SA to conduct phrase-
level categorization or polarity classification most uses Bag-of-word model to select
features when dealing with large datasets [44]. However, the approach used in this
article relied on a k-means algorithm to predict tags or comments that were common
among users. The use of the two supervised learning algorithms permitted control
of the qualifying polar phrases, and the predicted words from K-means were used in
constructing the count vectorizer matrix.
positive sentiment being 0 and the most negative being 1. The maximum, minimum,
and average real-valued ratings of the negative sentiments have the same value of
0.587. Similarly, the maximum, minimum, and average real-valued ratings of the
negative sentiments have the same value of 0.413. The negative comments dominate
the dataset partly because of the inclusion of the white space (lack of reviews) as
negative reviews. However, both positive and negative sentiments are representable
using a single value.
The confusion matrix associated with the model is as shown in the following
table. The consequence of the inclusion of the whitespace is a higher prediction
precision of negative reviews.
From the earlier table, the SVM model had an accuracy of 61% ± 10.44%, which
is lower than the values that other researchers have established. Furthermore, the
model had a precision of 59.05% ± 7.34%, a recall of 88.73% ± 9.04%, an f-measure
of 70.61% ± 6.91%. Hence, the model has a sensitivity of 88.73% ± 9.04% and a
Sentiment Analysis for Decision-Making 297
TABLE 13.4
Confusion Matrix Retrieved from the SA SVM Model
True Negative True Positive Class Precision
Pred. Negative 29 12 70.73%
Pred. Positive 66 93 58.49%
30.53% 88.57%
TABLE 13.5
Token Count Matrix
Attribute No. 1 No. 2 No. 3 No. 4 No. 5
Phrase 1 1 1 1 0 0
Phrase 2 1 1 0 1 0
Phrase 3 1 1 0 0 1
specificity of 30.67% ± 18.11%. The token matric model from the tokenization
and vectorization of sentiments in the reviews is presented in Table 13.5. The
tokenization is based on the frequency of occurrence of positive, neutral, and
negative sentiments based on the Count Vectorizer of “great”, “fine,” and “awful”
phrases.
The generated token matrix is 3 × 5 for the three phrases and five sentiment
classes (very negative and positive, somewhat negative and positive, and neutral).
The counts are an indication of the term frequency for each word used in the
corpus or grouped reviews. For instance, for a corpus consisting of 100,000 text
phrases with word “awesome” appearing 100 times, it suffices to conclude that the
term frequency for “awesome” is 0.001. The term frequency refers to the propor-
tion representing the number of times a word appears in a corpus. The term fre-
quency for the two algorithms is shown in Table 13.6 and Figure 13.4. From Table
13.6, it is apparent that the term frequency increased from 2,000 tags onwards,
and the two models have an almost equal term frequency at the corresponding
number of tags.
From Table 13.6 and Figure 13.4, it is clear that SVM performed better than NB
in terms of term frequency. In both cases, the maximum number of term frequency
is obtained when the number of tags is less than 500.
Despite the similarity in the profiles of the algorithms in Figure 13.4, SVM had
a maximum term frequency of 4.2%, while the NB algorithm had a term frequency
of 4%, and, as such, SVM is more suitable in predicting the sentiment classes in
the dataset. Further assessment of the algorithms required information on histori-
cal movie production trends and genres. Figure 13.5 shows that movie production
grew exponentially between the early 1980s and 2016. The trend is of importance
because movie production has undergone a tremendous technological development.
Such developments hold sway on the sentiments toward movies.
298 Data Science
TABLE 13.6
Accuracy Comparison of Datasets
Term Frequency
No. of Experiments Number of Tags in the Training Set Naïve Bayes Rating SVM Rating
1 100 3.295 3.37
2 200 4.035 4.2
3 500 4.008333 3.965
4 1,000 3.541 3.519
5 1,500 2.94 2.935
6 2,000 2.599 2.608
7 2,500 2.384 2.368
8 3,000 3.338677 3.37976
9 4,000 3.4335 3.4285
10 4,500 3.655 3.663
Besides the history, the demand for movies is driven by genre popularity, as
shown in Table 13.7. Based on the table, Drama, Comedy, and Thriller are the most
popular genres.
However, the distribution curves for the genres suggest that action genre tends to
be more popular over time compared with other genres.
The graphical data in Figure 13.6 reflects a sentiment distribution of “thoughtful”
tagged movies as Sci-Fi, being less preferred compared with “lovely” tagged movies
that are associated with drama genre movies.
The visualization in Figure 13.7 suggests that Lovely and Sad tags were the
most frequently used words in the review. The Lovely comments were associated
with sentiments such as funny and revealing, while the sad tag was associated with
Sentiment Analysis for Decision-Making 299
TABLE 13.7
Tabulation of Movie Genre Popularity Distribution
Genre Count
Drama 23,152
Comedy 16,712
Thriller 11,320
Romance 10,759
Action 10,370
Horror 9,180
Crime 9,094
Documentary 1,190
Adventure 6,801
Sci-Fi 1,420
dangerous and scary sentiments. The n-gram in the figure summarizes the distribu-
tion of sentiments associated with the frequencies presented in Figure 13.7.
Figure 13.8 shows the distribution of review sentiments based on the polarity
of tags. The figure asserts that reviews or comments with shorter n-gram lengths
tend to be mostly neutral, while those with longer n-gram lengths tend to be either
negatively or positively polarized.
300 Data Science
FIGURE 13.8 Illustration the profile of both positive and negative tags.
Sentiment Analysis for Decision-Making 301
13.7 ACKNOWLEDGMENT
The data sources used in this research paper are retrieved from www.movielens.org.
We are grateful for the dataset and would like to thank them for making such an
open dataset available online, which helped us to achieve the objectives and results
of the study.
REFERENCES
1. Perea-Ortega JM, Martinez-Cámara E, Martn-Valdivia M-T, Ureña-López LA (2013)
Combining supervised and unsupervised polarity classification for non-English reviews.
In: International Conference on Intelligent Text Processing and Computational
Linguistics, pp. 63–74. Samos, Greece
2. Pang B, Lee L, et al. (2008) Opinion mining and sentiment analysis. Found Trends® Inf
Retr 2:1–135.
3. Devika MD, Sunitha C, Ganesh A (2016) Sentiment analysis: A comparative study on
different approaches. Procedia Comput Sci 87:44–49.
4. Aljaaf AJ, Al-Jumeily D, Hussain AJ, Fergus P, Al-Jumaily M, Abdel-Aziz K (2015)
Toward an optimal use of artificial intelligence techniques within a clinical decision
support system. In: 2015 Science and Information Conference (SAI), London,
pp. 548–554.
5. Keight R, Aljaaf A, Al-Jumeily D, Hussain A, Özge A, Mallucci C (2017) An intelligent
systems approach to primary headache diagnosis. In: Intelligent Computing Theories
and Application - 13th International Conference, ICIC 2017, Liverpool, UK, August
7–10.
302 Data Science
6. Aljaaf AJ, Hussain AJ, Fergus P, Przybyla A, Barton GJ (2016) Evaluation of machine
learning methods to predict knee loading from the movement of body segments. In:
2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC,
pp. 5168–5173.
7. Aljaaf AJ, et al. (2018) Early prediction of chronic kidney disease using machine
learning supported by predictive analytics. In: 2018 IEEE Congress on Evolutionary
Computation (CEC), Rio de Janeiro, pp. 1–9.
8. Aljaaf AJ, Al-Jumeily D, Hussain AJ, Fergus P, Al-Jumaily M, Radi N (2015) Applied
machine learning classifiers for medical applications: Clarifying the behavioural pat-
terns using a variety of datasets. In: 2015 International Conference on Systems, Signals
and Image Processing (IWSSIP), London, pp. 228–232.
9. Aljaaf AJ, Al-Jumeily D, Hussain AJ, Fergus P, Al-Jumaily M, Radi N (2015) A system-
atic comparison and evaluation of supervised machine learning classifiers using head-
ache dataset. In: Huang DS, Han K (eds.) Advanced Intelligent Computing Theories
and Applications. ICIC 2015. Lecture Notes in Computer Science, vol. 9227. Springer,
Cham.
10. Aljaaf AJ, Al-Jumeily D, Hussain AJ, Lamb D, Al-Jumaily M, Abdel-Aziz K (2014)
A study of data classification and selection techniques for medical decision support
systems. In: Huang DS, Jo KH, Wang L (eds.) Intelligent Computing Methodologies.
ICIC 2014. Lecture Notes in Computer Science, vol. 8589. Springer, Cham.
11. Al-Kassim Z, Memon Q (2017) Designing a low-cost eyeball tracking keyboard for
paralyzed people. Computers & Electrical Engineering 58:20–29.
12. Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applica-
tions: A survey. Ain Shams Eng J 5:1093–1113.
13. Hemmatian F, Sohrabi MK (2017) A survey on classification techniques for opinion
mining and sentiment analysis. Artif Intell Rev 1–51.
14. Bhadane C, Dalal H, Doshi H (2015) Sentiment analysis: Measuring opinions. Procedia
Comput Sci 45:808–814.
15. Guzman E, Maalej W (2014) How do users like this feature? A fine grained sentiment
analysis of app reviews. In: 2014 IEEE 22nd International Requirements Engineering
Conference (RE), pp. 153–162. Karlskrona.
16. Araque O, Corcuera-Platas I, Sanchez-Rada JF, Iglesias CA (2017) Enhancing deep
learning sentiment analysis with ensemble techniques in social applications. Expert
Syst Appl 77:236–246.
17. Galvis Carreño LV, Winbladh K (2013) Analysis of user comments: An approach for
software requirements evolution. In: Proceedings of the 2013 International Conference
on Software Engineering, pp. 582–591. San Francisco, CA, USA.
18. Taboada M, Brooke J, Tofiloski M, et al. (2011) Lexicon-based methods for sentiment
analysis. Comput Linguist 37:267–307.
19. Prabowo R, Thelwa M (2009) Sentiment analysis: A combined approach. Journal of
Informatics. Vol. (2), 143–157.
20. Pappas N, Popescu-Belis A (2013) Sentiment analysis of user comments for one-
class collaborative filtering over ted talks. In: Proceedings of the 36th International
ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 773–776. Dublin, Ireland.
21. Siersdorfer S, Chelaru S, Nejdl W, San Pedro J (2010) How useful are your comments?:
Analyzing and predicting youtube comments and comment ratings. In: Proceedings of
the 19th International Conference on World Wide Web, pp. 891–900.
22. Ahmad M, Aftab S (2017) Analyzing the performance of SVM for polarity detection
with different datasets. Int J Mod Educ Comput Sci 9:29.
23. Serrano-Guerrero J, Olivas JA, Romero FP, Herrera-Viedma E (2015) Sentiment analy-
sis: A review and comparative analysis of web services. Inf Sci (Ny) 311:18–38.
Sentiment Analysis for Decision-Making 303
24. Hu N, Bose I, Koh NS, Liu L (2012) Manipulation of online reviews: An analysis of
ratings, readability, and sentiments. Decis Support Syst 52:674–684.
25. Kaur D (2017) Sentimental analysis on Apple Tweets with machine learning technique.
Int J Sci Eng Comput Technol 7:76.
26. Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-
level sentiment analysis. In: Proceedings of Human Language Technology Conference
and Conference on Empirical Methods in Natural Language Processing. Vancouver,
British Columbia, Canada.
27. Sahu TP, Ahuja S (2016) Sentiment analysis of movie reviews: A study on feature selec-
tion and classification algorithms. In: International Conference on Microelectronics,
Computing and Communication, MicroCom.
28. Chu CH, Wang CA, Chang YC, et al. (2017) Sentiment analysis on Chinese
movie review with distributed keyword vector representation. In: TAAI 2016-2016
Conference on Technologies and Applications of Artificial Intelligence, Proceedings,
pp. 84–89.
29. Gupta P, Sharma A, Grover J (2016) Rating based mechanism to contrast abnormal
posts on movies reviews using MapReduce paradigm. In: 2016 5th International
Conference on Reliability, Infocom Technologies and Optimization, ICRITO 2016:
Trends and Future Directions, pp. 262–266.
30. Amolik A, Jivane N, Bhandari M, Venkatesan M (2016) Twitter sentiment analysis
of movie reviews using machine learning technique. Int J Eng Technol 7:2038–2044.
doi:10.5120/ijca2017916005.
31. Manek AS, Shenoy PD, Mohan MC, Venugopal KR (2017) Aspect term extraction for
sentiment analysis in large movie reviews using Gini Index feature selection method
and SVM classifier. World Wide Web 20:135–154. doi:10.1007/s11280-015-0381-x.
32. Kim Y, Kwon DY, Jeong SR (2015) Comparing machine learning classifiers for
movie WOM opinion mining. KSII Trans Internet Inf Syst 9:3178–3190. doi:10.3837/
tiis.2015.08.025.
33. Kalaivani P, Shunmuganathan K, Index MI, et al. (2013) Sentiment classification of
movie reviews by supervised machine learning approaches. Indian J Comput Sci Eng.
doi:10.1016/j.proeng.2014.03.129.
34. Tripathy A, Agrawal A, Rath SK (2015) Classification of sentimental reviews using
machine learning techniques. In: Procedia Computer Science, pp. 821–829. Delhi,
India.
35. Liu B (2015) Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Doi:
10.1017/CBO9781139084789.
36. Qiu G, He X, Zhang F, et al. (2010) DASA: Dissatisfaction-oriented advertising based
on sentiment analysis. Expert Syst Appl 37:6182–6191.
37. Dey L, Chakraborty S, Biswas A, et al. (2016) Sentiment analysis of review datasets
using naive bayes and k-nn classifier. arXiv Prepr arXiv161009982.
38. Di Caro L, Grella M (2013) Sentiment analysis via dependency parsing. Comput Stand
Interfaces. doi:10.1016/j.csi.2012.10.005.
39. Memon Q (2019) On assisted living of paralyzed persons through real-time eye fea-
tures tracking and classification using Support Vector Machines. Medical Technologies
Journal, 3(1):316–333.
40. Fan T-K, Chang C-H (2011) Blogger-centric contextual advertising. Expert Syst Appl
38:1777–1788.
41. Moreo A, Romero M, Castro JL, Zurita JM (2012) Lexicon-based comments-oriented
news sentiment analyzer system. Expert Syst Appl 39:9166–9180.
42. Lane PCR, Clarke D, Hender P (2012) On developing robust models for favourabil-
ity analysis: Model choice, feature sets and imbalanced data. Decis Support Syst
53:712–718.
304 Data Science
CONTENTS
14.1 I ntroduction................................................................................................... 305
14.2 Related Works................................................................................................306
14.3 Learning........................................................................................................308
14.3.1 Nondeep Learning-Based Approaches..............................................309
14.3.1.1 Dictionary Learning Approaches.......................................309
14.3.1.2 Genetic Programming Approaches....................................309
14.3.2 Deep Learning-Based Approaches.................................................... 310
14.3.2.1 Generative/Unsupervised Approaches................................ 310
14.3.2.2 Discriminative/Supervised Approaches............................. 311
14.3.3 Convolutional Neural Network.......................................................... 311
14.3.3.1 Convolutional Layer............................................................ 311
14.3.3.2 Subsampling Layers............................................................ 312
14.3.3.3 Rectified Linear Unit.......................................................... 313
14.3.3.4 Fully Connected Layer........................................................ 313
14.3.3.5 Softmax Layer..................................................................... 313
14.3.3.6 Output Layer....................................................................... 313
14.3.4 Feedforward Deep Convolutional Neural Network .......................... 314
14.3.5 VGG16 Model.................................................................................... 314
14.4 Performance Evaluation Metrics................................................................... 314
14.4.1 Dataset............................................................................................... 314
14.4.2 Performance Evaluation Metrics....................................................... 315
14.5 Results and Discussions................................................................................. 317
14.6 Conclusion..................................................................................................... 318
References............................................................................................................... 319
14.1 INTRODUCTION
Understanding human emotions is a key area of research, since recognizing emotions
may provide a plethora of opportunities and applications; for instance, friendlier
human–computer interactions with an enhanced communication among humans,
by refining emotional intelligence [1]. Recent research on experimental psychology
305
306 Data Science
14.2 RELATED WORKS
A system for automatic emotion recognition is developed using gesture dynamic’s
features from surveillance video and evaluated by supervised classifiers (dynamic
time wrapping, support vector machine (SVM), and naïve Bayes) [7]. A framework
is proposed to synthesize body movements based on high-level parameters and is
represented by the hidden units of a convolutional autoencoder [8]. A system for rec-
ognizing the affective state of a person is proposed from face-and-body video using
space–time interest points in video and canonical correlation analysis (CCA) for
fusion [9]. A comprehensive survey of deep learning and its current applications in
sentiment analysis is performed [10]. Recent works on high-performance motion data
is described, and relevant technologies in real-time systems are proposed [11–12].
Deep Learning Model for Emotion Recognition 307
The deep learning algorithm to develop a novel structure in large data sets by using
the backpropagation algorithm and processing images, video, speech, and audio for
emotion recognition are developed [13]. A self-organizing neural architecture was
developed for recognizing emotional states from full-body motion patterns [14]. A
system for emotion recognition on video data is developed using both convolutional
neural network (CNN) and recurrent neural network (RNN) [15]. The emoFBVP
(face, body gesture, voice and physiological signals) database of multimodal (face,
body gesture, voice, and physiological signals) recordings of actors enacting various
emotional expressions are predicted [16]. A model with hierarchical feature repre-
sentation for nonverbal emotion recognition and the experiments show a significant
accuracy improvement [17–18]. The novel design of an artificially intelligent system
is proposed for emotion recognition using promising neural network architectures
[19]. A novel system for Emotion Recognition in the Wild (EmotiW) is developed
using hybrid CNN–RNN architecture to achieve better results over other techniques
[20]. A new emotional body gesture is developed to differentiate culture and gen-
der difference framework for automatic emotional body gesture recognition. A new
emotional body gesture is developed to differentiate culture and gender difference
framework for automatic emotional body gesture recognition [21]. A novel approach
for recognition of facial emotion expressions in video sequences proposes an inte-
grated framework of two networks: a local network and a global network, which
are based on local enhanced motion history image (LEMHI) and CNN-LSTM
cascaded networks, respectively [22]. The deep learning algorithm for face detec-
tion achieves the state-of-the-art detection performance on the well-known FDDB
face detection benchmark evaluation. In particular, we improve the state-of-the-art
faster RCNN (Region-convolutional neural network) framework by combining a
number of strategies, including feature concatenation, hard negative mining, mul-
tiscale training, model pretraining, and proper calibration of key parameters [23].
The author proposed architecture-based deep neural nets for expression recognition
in videos, using an adaptive weighting scheme coping with a reduced size-labeled
dataset [24]. The proposed technique uses movements of the human body for iden-
tification, particularly movement of the head, shoulders, and legs using CNN [25].
The challenging task of detecting salient body motion in scenes with more than one
person is addressed, and a neural architecture that only reacts to a specific kind
of motion in the scene is proposed: a limited set of body gestures [26]. The recent
advances in CNNs and achievement of better performance in different applications
like natural language processing, speech recognition, emotion recognition, activ-
ity recognition, and handwritten recognition are discussed [27]. Computer vision
applications like emotion recognition, action recognition, image and video classifi-
cation are experimented using dictionary learning-based approaches. From the large
number of samples, representative vectors are learned and used in this concept [28].
The researchers developed a framework for human action recognition using dic-
tionary learning methods [29]. Based on the hierarchical descriptor, the proposed
method for human activity recognition outperforms the state-of-the-art methods.
For a visual recognition, a cross-domain dictionary learning-based method was
developed [30–31]. An unsupervised model was developed for cross-view human
action recognition [32] without any label information. The coding descriptors of
308 Data Science
14.3 LEARNING
Artificial intelligence is executed by machine learning technique. Learning is con-
sidered to be an important aspect for intelligent machines. In artificial intelligence
research, a machine teaches to detect various patterns using machine learning pat-
terns. Conventional machine learning techniques have the expertise to design a
feature extractor that transformed the raw data into a feature vector to detect or
classify patterns in the input data. Deep learning is a dedicated form of machine
learning. The technique that instructs computers to do some operation and behave
like humans is done by machine learning. From the input data, a machine learning
task starts the feature extraction process. The features are fed to a model that classi-
fies the objects in the image. Learning feature hierarchies are produced by combin-
ing low-level and high-level features [36]. In the deep learning model, features are
automatically extracted from input data. Learning features are automatically set by
several levels of abstraction. The numerous applications to machine learning tech-
niques are increasing at an enormous rate. Figure 14.1 explains the learning-based
approaches. It has the following two types of approaches.
Deep learning has emerged as a popular approach within machine learning.
The traditional handcrafted feature-based approach and deep learning-based
approach are the two major approaches in HER problem [37]. Several limitations
of handcrafted approach fail to learn features automatically from the input data,
as well as deep representation of data in classifiers. Alternatively, deep learning-
based approach plays the concept of end-to-end learning by using the trainable
feature extractor followed by a trainable classifier. The multiple layers of features
are automatically extracted from raw data. This algorithm develops a multilayer
representation of different patterns in the input data, where each successive layer
is responsible for learning increasingly complex features [38]. The lower layers
extract high-level features from the input data; thus, the representation increased
the abstraction level at each consecutive layer. The need of handcrafted feature
detectors and descriptors are eliminated by this automatic learning ability of
deep learning models. In many visual categorization tasks, deep learning mod-
els have shown higher performance than traditional handcrafted feature-based
techniques [39].
Deep learning models such as CNNs, deep belief networks (DBNs), deep RNNs,
and deep Boltzmann machines have been successfully employed for many visual
categorization tasks. Among these models, DBN is an unsupervised probabilistic
graphical model capable of learning from the input data without prior knowledge.
This model can also be trained in a semisupervised or unsupervised fashion, which
is quite helpful for labeling data or dealing with unlabeled data. Learning-based
approaches are divided into two approaches: nondeep learning approach and deep
learning approach [40].
Gabor filter and wavelet filter [48] have evolved for this motion feature. Similarly, a
valuable set of features were learned for emotion recognition.
A probabilistic graphical model indicates the dependencies and random variables
in a directed acyclic graph form. Different variations of Bayesian network have been
introduced, such as conditional Bayesian networks, temporal Bayesian networks,
and multientity Bayesian network (MEBN). In the work of Zhang et al. [49], an
interval temporal Bayesian network (ITBN) was introduced for the recognition of
complex human activities. To evaluate the performance of the proposed method, a
cargo loading dataset was considered for experimentations and evaluations. Khan
et al. [50] proposed another method for action detection using dynamic conditional
Bayesian network, which also achieved the state-of-the-art results. In Park et al. [51],
MEBN was used for predictive situation awareness (PSAW) using multiple sensors.
These networks are robust for reasoning the uncertainty in the complex domains for
predicting and estimating the temporally evolving situations.
14.3.2.1 Generative/Unsupervised Approaches
In an unsupervised deep learning approach, the class labels are not required for the
learning process. These types of approaches are specifically useful when labeled
data are relatively unavailable. A remarkable surge in the history of deep models
was triggered by the work of Hinton et al. [54], who developed a feature reduction
technique and highly efficient DBN. In an unsupervised pretraining learning stage,
a backpropagation method is used for fine-tuning. These types of deep learning
approaches are used for many applications like object identification, image classifi-
cation, speech classification, activity, and emotion recognition.
An unsupervised feature learning model from video data was proposed in the
work of Le et al. [55] for human action recognition. The authors used an indepen-
dent subspace analysis algorithm to learn spatiotemporal features, combining them
with deep learning techniques such as convolutional and staking for action repre-
sentation and recognition. DBNs trained with RBMs were used for HER [56]. This
Deep Learning Model for Emotion Recognition 311
method performs better than the handcrafted learning-based approach on two public
datasets. Learning the features continuously without any labels from the streaming
video is a challenging task. Hasan and Roy-Chowdhury [57] addressed this type of
problem using an unsupervised deep learning model. Most of the action datasets
have been recorded under a controlled environment; besides, action recognition from
unconstrained videos is a challenging task. A method for human action recognition
from unconstrained video sequences was proposed by Ballan et al. [58] using DBNs.
Unsupervised learning played a pivotal role in reviving the interests of researchers
in deep learning.
14.3.2.2 Discriminative/Supervised Approaches
The CNN is the most frequently used model from the supervised category. CNN [56]
is a type of deep learning model that has shown better performance at tasks such
as image classification, pattern recognition, human action recognition, handwritten
digit classification, and HER. The multiple hidden layers present in the hierarchical
learning model are used to transform the input data into output categories. Its archi-
tecture consists of three main types of layers:
a. Convolutional layer
b. Pooling layer
c. Fully connected layer
The mapping back of different layers of CNN is called as Deconvnets. The objects in
the images are represented and recognized using deep CNN models [59]. This author
proposed spatial and temporal streams of CNN for action and emotion recognition.
These two streams combined and outperformed better results than the other m ethods.
RNN is the other popular model among the supervised category. The skeleton-based
action and emotion recognition using RNNs are developed by this author [60]. The
five parts of the human skeleton were separately fed into five subnets. The output
from the subnets were combined and fed into a single layer for final demonstration.
For the training process, deep learning-based model need a large size of video data.
Collecting and annotating large size of video data require enormous computational
resources. An outstanding accuracy has been achieved in many application fields.
14.3.3.1 Convolutional Layer
The process of convolution has four Steps. Figure 14.2 shows the architecture of an
FDCNN model.
n m
2 2
C ( xu , v ) = ∑ ∑ f (i, j)x
n m
k u−i,v − j (14.1)
i= − i=
2 2
14.3.3.2 Subsampling Layers
Subsampling or pooling layers shrink the map size into a smaller size. Figure 14.3
shows the example of a max-pooling layer. The following four steps implement the
pooling function:
m n
M ( xi ) = Max xi + k , i + l k ≤ , l ≤ k , k ∈ N (14.2)
2 2
14.3.3.3 Rectified Linear Unit
A rectified linear unit is an activation function, while the input is below zero, and the
output is zero. Figure 14.4 shows an example of a rectified linear unit. It is calculated
as follows:
R( x ) = max(0, x ) (14.3)
F ( x ) = σ (W * x ) (14.4)
14.3.3.5 Softmax Layer
Backpropagation can be done in this layer. The networks backpropagate the error
and increase the performance. If N is a size of the input vector, S(x): R → [0, 1]N. It
is calculated by
x xi
S(x) j = N (14.5)
∑e
i=0
xi
where 1 ≤ j ≤ N
14.3.3.6 Output Layer
The size of the output layer is equal to the number of classes. It represents the class
of the input image.
14.3.5 VGG16 Model
Figure 14.6 describes the architecture of the Visual Geometric Group (VGG)16
model. The VGG16 model consists of five blocks with 16 layers. The first two blocks
have two convolution layers and one maxpooling layer. The remaining blocks have
three convolution layers and one maxpooling layer. Finally, fully connected layer has
one flatten layer and three dense layers. The sequences of an RGB (red, green, blue)
image with size 256 × 256 are the inputs for this model. After each block, the size of
the input image is different, which is defined in the architecture diagram.
The sequences were taken over a static (black) background with a frame size of
1,920 × 1,080 pixels at a rate of 25 fps [61].
All the experiments were implemented using Windows 10 operating system with
Intel core i5 3.3 GHz processor with Anaconda Python and Jupyter notebook. The
dataset is trained with a batch size 64 and 20 epochs.
• True Positives (TP)—when the data point of actual class and predicted was
(True).
• True Negatives (TN)—when the data point of actual class and predicted
was (False).
• False Positives (FP)—when the data point of actual class (False) and the
predicted is (True).
• False Negatives (FN)—when the data point of actual class (True) and the
predicted is (False).
316 Data Science
The performance evaluation of the proposed work can be calculated using Accuracy,
Recall, F-Score, Specificity, and Precision. Accuracy in classification problems is the
number of correct prediction made by the model over all kinds of prediction made,
which can be calculated using Equation (14.7).
TP + TN
Accuracy = (14.7)
TN + FP + TP + FN
TP
Recall = (14.8)
TP + FN
The symphonies mean of Precision and Recall is called as F-score.
Deep Learning Model for Emotion Recognition 317
Precision × Recall
F-score = 2 (14.9)
Precision + Recall
TN
Specificity = (14.10)
TN + FP
TP
Precision = (14.11)
TP + FP
where TP and TN are the quantity of true positive and true negative prediction of the
class, and FP and FN are the quantity of false positive and false negative expectations.
TABLE 14.1
Individual Accuracy (%) for 15 Class Emotions of an FDCNN Model
AJ AS AW FJ FS FW HJ HS HW SJ SS SW UJ US UW
AJ 95 0 0 1 0 0 2 0 0 0 0 0 2 0 0
AS 0 96 0 0 1 0 0 2 0 0 1 0 0 0 0
AW 0 0 96 0 0 1 0 0 1 0 0 0 0 0 2
FJ 2 0 0 94 0 0 1 0 0 2 0 0 1 0 0
FS 0 1 0 0 95 0 0 1 0 0 2 0 0 1 0
FW 0 0 3 0 0 95 0 0 2 0 0 0 0 0 0
HJ 0 0 0 1 0 0 96 0 0 2 0 0 1 0 0
HS 0 1 0 0 1 0 0 97 0 0 0 0 0 1 0
HW 0 0 0 0 0 1 0 0 94 0 0 3 0 0 2
SJ 4 0 0 1 0 0 1 0 0 94 0 0 0 0 0
SS 0 3 0 0 0 0 0 1 0 0 95 0 0 1 0
SW 0 0 4 0 0 2 0 0 0 0 0 92 0 0 2
UJ 0 0 0 2 0 0 3 0 0 4 0 0 91 0 0
US 0 3 0 0 2 0 0 1 0 0 2 0 0 92 0
UW 0 0 2 0 0 5 0 0 1 0 0 1 0 0 91
TABLE 14.2
Individual Accuracy (%) for 15 Class Emotions of VGG16 Model
AJ AS AW FJ FS FW HJ HS HW SJ SS SW UJ US UW
AJ 96 0 0 1 0 0 1 0 0 0 0 0 2 0 0
AS 0 96 0 1 0 0 0 1 0 0 1 0 0 1 0
AW 0 0 95 0 0 1 0 0 1 0 0 0 1 0 2
FJ 2 0 0 96 0 0 1 0 0 2 0 0 1 0 1
FS 0 1 0 0 96 0 0 1 0 0 2 0 1 1 0
FW 1 0 2 0 0 95 0 0 2 0 0 0 0 0 0
HJ 0 0 0 1 0 0 96 0 0 2 0 0 1 0 0
HS 0 1 1 0 1 0 0 96 0 0 0 0 0 1 0
HW 0 0 0 0 0 1 0 0 94 0 0 3 0 0 2
SJ 4 0 0 1 0 0 1 0 0 96 0 0 0 2 0
SS 0 3 0 0 0 2 0 1 0 0 97 0 0 1 0
SW 1 0 4 0 0 2 0 0 0 2 0 95 0 0 2
UJ 2 0 0 2 0 0 1 0 0 1 0 0 94 0 0
US 0 0 0 0 2 0 0 1 0 0 2 0 0 95 0
UW 0 0 2 0 0 0 0 0 1 0 0 1 0 0 96
recognizing emotions like happy walk, sad walk, sad sit, and happy sit than other
emotions. Table 14.2 describes the correlation among all 15 classes and shows the
individual recognition accuracy of each class using VGG16 model. Tables 14.1
and 14.2 are the individual recognition accuracy of all classes, and the values in those
tables do not exceed 100%. Table 14.3 shows the performance measure of precision,
recall, and f-measure values of FDCNN and VGG16 model. The value 1 means best,
and the value 0 means worst. In this experiment, the values obtained in Table 14.3
are mostly close to 1.
14.6 CONCLUSION
In this chapter, an FDCNN and VGG16 model for recognizing human emotions from
body movements on sequence of frames were proposed. This model is represent-
ing deep features to extract saliency information at multiple scales. The proposed
method is evaluated on a challenging benchmark emotion dataset (University of
York). The emotions such as Angry Jump, Angry Sit, Angry Walk, Fear Jump, Fear
Sit, Fear Walk, Happy Jump, Happy Sit, Happy Walk, Sad Jump, Sad Sit, Sad Walk,
Untrustworthy Jump, Untrustworthy Sit, and Untrustworthy Walk are used in this
work. Among the two models, VGG16 performed better recognition of emotions.
The performance of this model is better than the baseline models.
Future work aims at developing research applications to recognize the emotions
of children with autism spectrum disorder (ASD). The autism children express their
Deep Learning Model for Emotion Recognition 319
TABLE 14.3
Performance Measure of an Emotion Dataset with FDCNN and VGG16
Precession Recall F-Measure
Model VGG16 FDCNN VGG16 FDCNN VGG16 FDCNN
AJ 0.937 0.891 0.828 0.821 0.933 0.881
AS 0.881 0.854 0.897 0.885 0.909 0.877
AW 0.842 0.825 0.921 0.913 0.911 0.901
FJ 0.881 0.862 0.897 0.886 0.919 0.891
FS 0.871 0.869 0.932 0.924 0.949 0.876
FW 0.895 0.883 0.891 0.883 0.907 0.879
HJ 0.884 0.873 0.935 0.927 0.915 0.913
HS 0.862 0.854 0.899 0.879 0.919 0.902
HW 0.843 0.839 0.928 0.911 0.931 0.911
SJ 0.931 0.930 0.858 0.849 0.927 0.886
SS 0.852 0.844 0.887 0.879 0.915 0.897
SW 0.892 0.882 0.931 0.922 0.951 0.879
UJ 0.933 0.924 0.818 0.821 0.973 0.880
US 0.891 0.887 0.890 0.881 0.899 0.878
UW 0.852 0.831 0.910 0.928 0.915 0.909
emotion through facial and body movements. ASDs are neurodevelopmental dis-
orders in which multiple genetic and environmental factors play roles. Symptoms
of deficit in social communication and restrictive, repetitive behavioral patterns
emerge early in a child’s development. ASD is a lifelong neurodevelopmental dis-
order involving core deficit in interpersonal communication and social interactions,
as well as restricted, repetitive mannerisms and interests (American Psychiatric
Association, 2013). Children with ASD struggle with significant relationships and
behavioral challenges and, in most cases, have serious implications for social inclu-
sion in adulthood.
REFERENCES
1. R. Plutchik, The Emotions, University Press of America, 1991. Lanham, Maryland
2. F. E. Pollick, H. M. Paterson, A. Bruderlin, A. J. Sanford, “Perceiving Affect from Arm
Movement,” Cognition, Vol. 82, Issue 2, pp. 51–61, 2001.
3. P. N. Lopes, P. Salovey, R. Straus, “Emotional intelligence, personality, and the per-
ceived quality of social relationships,” Personality and Individual Differences, Vol. 35,
Issue 3, pp. 641–658, 2003.
4. M. A. Goodrich, A. C. Schultz, “Human-Robert interaction: A survey,” Foundation
and Trends in Human Computer Interaction, Vol. 1, Issue 3, pp. 203–275, 2007.
5. H. R. Markus, S. Kitayama, “Culture and the self: Implementations for cognition, emo-
tion, and motivation,” Psychological Review, Vol. 98, pp. 224–253, 1991.
320 Data Science
6. Q. Memon, “On assisted living of paralyzed persons through real-time eye features
tracking and classification using Support Vector Machines,” Medical Technologies
Journal, Vol. 3, Issue 1, pp. 316–333, 2019.
7. J. Arunnehru, M. Kalaiselvi Geetha, “Automatic human emotion recognition in sur-
veillance video,” Intelligent Techniques in Signal Processing for Multimedia Security,
Editors: Dey, Nilanjan, Santhi, V, Springer-Verlag, pp. 321–342, 2017. Berlin
8. D. Holden, J. Saito, T. Komura, “A deep learning framework for character motion syn-
thesis and editing,” SIGGRAPH ’16 Technical Paper, July 24–28, Anaheim, CA, ISBN:
978-1-4503-4279-7/16/07, 2016.
9. H. Gunes, C. Shan, S. Chen, Y. L. Tian, “Bodily expression for automatic affect rec-
ognition,” Emotion Recognition: A Pattern Analysis Approach, John Wiley and Sons,
Editors: Amit Konar, Aruna Chakraborty, pp. 343–377, 2015. Hoboken, New Jersey
10. L. Zhang, S. Wang, B. Liu, “Deep learning for sentiment analysis: A survey,” 2018.
https://arxiv.org/pdf/1801.07883.
11. H. Brock, “Deep learning - Accelerating next generation performance analysis sys-
tems,” in 12th Conference of the International Sports Engineering Association,
Brisbane, Queensland, Australia, pp. 26–29, 2018.
12. Z. Ali, Q. Memon, “Time delay tracking for multiuser synchronization in CDMA net-
works,” Journal of Networks, Vol. 8, Issue 9, pp. 1929–1935, 2013.
13. Y. LeCun, Y. Bengio, G. Hinton, “Deep learning,” Nature, Vol. 521, pp. 436–444, 2015.
14. N. Elfaramawy, P. Barros, G. I. Parisi, S. Wermter, “Emotion recognition from body
expressions with a neural network architecture,” Session 6: Algorithms and Learning,
Bielefeld, Germany, 2017.
15. P. Khorrami, T. Le Paine, K. Brady, C. Dagli, T. S. Huang, “How deep neural networks
can improve emotion recognition on video data,” 2017. https://arxiv.org/pdf/1602.07377.
pdf.
16. H. Ranganathan, S. Chakraborty, S. Panchanathan, “Multimodal emotion recognition
using deep learning architectures,” 2017. http://emofbvp.org/.
17. P. Barros, D. Jirak, C. Weber, S. Wermter, “Multimodal emotional state recognition
using sequence-dependent deep hierarchical features,” Neural Networks, Vol. 72,
pp. 140–151, 2015.
18. Z. Al-Kassim, Q. Memon, “Designing a low-cost eyeball tracking keyboard for para-
lyzed people,” Computers & Electrical Engineering, Vol. 58, pp. 20–29, 2017.
19. E. Correa, A. Jonker, M. Ozo, R. Stolk, “Emotion recognition using deep convolutional
neural networks,” 2016.
20. S. E. Kahou, V. Michalski, K. Konda, R. Memisevic, C. Pal, “Recurrent neu-
ral networks for emotion recognition in video,” ICMI 2015, USA, pp. 9–13, 2015.
doi:10.1145/2818346.2830596.
21. F. Noroozi, C. A. Corneanu, D. Kamínska, T. Sapínski, S. Escalera, G. Anbarjafari,
“Survey on emotional body gesture recognition,” Journal of IEEE Transactions on
Affective Computing, 2015. PrePrints, DOI Bookmark: 10.1109/TAFFC.2018.2874986
22. M. Hu, H. Wang, X. Wang, J. Yang, R. Wang, “Video facial emotion recognition based
on local enhanced motion history image and CNN-CTSLSTM networks,” Journal
of Visual Communication and Image Representation, Vol. 59, pp. 176–185, 2018.
doi:10.1016/j.jvcir.2018.12.039.
23. X. Sun, P. Wu, S. C. H. Hoi, “Face detection using deep learning: An improved faster
RCNN approach,” Neurocomputing, Vol. 299, pp. 42–50, 2018.
24. O. Gupta, D. Raviv, R. Rasker, “Illumination invariants in deep video expression rec-
ognition”, Pattern Recognition, 2017. doi:10.1016/j.patcog.2017.10.017.
25. G. Batchuluun, R. Ali, W. Kim, K. R. Park, “Body movement based human identi-
fication using convolutional network,” Expert Systems with Applications, 2018.
doi:10.1016/j.eswa.2018.02.016.
Deep Learning Model for Emotion Recognition 321
26. F. Letsch, D. Jirak, S. Wermter, “Localizing salient body motion in multi-person scenes
using convolutional neural network,” Neurocomputing, pp. 449–464, 2019.
27. J. Gu, Z. Wan, J. Kuen, “Recent advances in convolutional neural network,” Pattern
Recognition, Vol. 77, pp. 354–377, 2018.
28. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, “Learning spatiotemporal fea-
tures with 3D convolutional networks,” IEEE International Conference on Computer
Vision (ICCV), pp. 4489–4497, 2015. Santiago, Chile
29. T. Guha, R. K. Ward, “Learning sparse representations for human action recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, Issue 8,
pp. 1576–1588, 2012.
30. H. Wang, C. Yuan, W. Hu, C. Sun, “Supervised class-specific dictionary learning
for sparse modeling in action recognition,” Pattern Recognition, Vol. 45, Issue 11,
pp. 3902–3911, 2012.
31. F. Zhu, L. Shao, “Weakly-supervised cross-domain dictionary learning for visual rec-
ognition,” International Journal of Computer Vision, Vol. 109, Issue 1–2, pp. 42–59,
2014.
32. F. Zhu, L. Shao, “Correspondence-free dictionary learning for cross-view action recog-
nition,” in ICPR, pp. 4525–4530, 2015. Piscataway, NJ
33. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D.
Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural compu-
tation, Vol. 1, Issue 4, pp. 541–551, 1989.
34. A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet classification with deep convo-
lutional neural networks,” in Advances in Neural Information Processing Systems, pp.
1097–1105, 2014., 25. Doi: 10.1145/3065386.
35. Y. Du, W. Wang, L. Wang, “Hierarchical recurrent neural network for skeleton based
action recognition,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1110–1118, 2015. Boston, MA
36. F. Zhu, L. Shao, J. Xie, Y. Fang, “From handcrafted to learned representations for
human action recognition: A survey”, Image and Vision Computing, 2016. Volume 55
Issue P2, pp: 42-52
37. A. Cortes, V. Vapnik, “Support-vector networks,” Machine learning, Vol. 20, Issue 3,
pp. 273–297, 1995.
38. G. E. Hinton, R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” Science, Vol. 313, Issue 5786, pp. 504–507, 2006.
39. P. Smolensky, “Information processing in dynamical systems: Foundations of harmony
theory,” Report, DTIC Document, 1986.
40. L. Sun, K. Jia, T. H. Chan, Y. Fang, G. Wang, S. Yan, “DL-SFA: Deeply-learned
slow feature analysis for action recognition,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2625–2632. 2015, 10.1109/
CVPR.2014.336. Columbus, United States
41. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, “Learning spatiotemporal fea-
tures with 3D convolutional networks,” IEEE International Conference on Computer
Vision (ICCV), pp. 4489–4497, 2015. Santiago, Chile
42. T. Guha, R. K. Ward, “Learning sparse representations for human action recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, Issue 8,
pp. 1576–1588, 2012.
43. H. Wang, C. Yuan, W. Hu, C. Sun, “Supervised class-specific dictionary learning
for sparse modeling in action recognition,” Pattern Recognition, Vol. 45, Issue 11,
pp. 3902–3911, 2012.
44. F. Zhu, L. Shao, “Weakly-supervised cross-domain dictionary learning for visual
recognition,” International Journal of Computer Vision, Vol. 109, Issue 1–2, pp. 42–59,
2014.
322 Data Science
45. F. Zhu, L. Shao, “Correspondence-free dictionary learning for cross-view action recog-
nition,” in ICPR, pp. 4525–4530. 2014, Piscataway, NJ
46. J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, “Locality-constrained linear coding
for image classification,” in Computer Vision and Pattern Recognition (CVPR), IEEE
Conference on, pp. 3360–3367, 2010. San Francisco, CA, USA
47. A. Cohen, I. Daubechies, J.-C. Feauveau, “Biorthogonal bases of compactly supported
wavelets,” Communication in Pure Applied Mathematics, Vol. 45, Issue 5, pp. 485–560,
1992.
48. L. Liu, L. Shao, X. Li, K. Lu, “Learning spatio-temporal representations for action
recognition: A genetic programming approach,” IEEE Transactions on Cybernetics,
Vol. 46, Issue 1, pp. 158–170, 2016.
49. Y. Zhang, Y. Zhang, E. Swears, N. Larios, Z. Wang, Q. Ji, “Modeling temporal inter-
actions with interval temporal bayesian networks for complex activity recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, Issue 10,
pp. 2468–2483, 2013.
50. F. M. Khan, S. C. Lee, R. Nevatia, “Conditional Bayesian networks for action detection,”
in Advanced Video and Signal Based Surveillance (AVSS), 10th IEEE International
Conference, pp. 256–262, 2013. Krakov, Poland
51. C. Y. Park, K. B. Laskey, P. C. Costa, S. Matsumoto, “A process for human aided multi-
entity Bayesian networks learning in predictive situation awareness,” in Information
Fusion (FUSION), 19th International Conference on, pp. 2116–2124, 2016.
52. L. Deng, D. Yu, “Deep learning,” Signal Processing, Vol. 7, pp. 3–4, 2014.
53. A. Ivakhnenko, “Polynomial theory of complex systems,” IEEE Transactions on
Systems, Man, and Cybernetics, Vol. SMC-1, Issue 4, pp. 364–378, 1971.
54. G. E. Hinton, S. Osindero, Y. W. Teh, “A fast learning algorithm for deep belief nets,”
Neural Computation, Vol. 18, Issue 7, pp. 1527–1554, 2006.
55. Q. V. Le, W. Y. Zou, S. Y. Yeung, A. Y. Ng, “Learning hierarchical invariant
spatio-temporal features for action recognition with independent subspace analysis,”
in Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368, 2011. Colorado,
USA
56. P. Foggia, A. Saggese, N. Strisciuglio, M. Vento, “Exploiting the deep learning paradigm
for recognizing human actions,” in Advanced Video and Signal Based Surveillance
(AVSS), pp. 93–98, 2014. Seoul, Korea
57. M. Hasan, A. K. Roy-Chowdhury, “Continuous learning of human activity models
using deep nets,” in European Conference on Computer Vision, pp. 705–720, Springer.
Zurich, Switzerland, 2014
58. L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, G. Serra, “Effective codebooks
for human action representation and classification in unconstrained videos,” IEEE
Transactions on Multimedia, Vol. 14, Issue 4, pp. 1234–1245, 2012.
59. A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet classification with deep convo-
lutional neural networks,” in Advances in Neural Information Processing Systems,
pp. 1097–1105. 2012, San Francisco, CA
60. Y. Du, W. Wang, L. Wang, “Hierarchical recurrent neural network for skeleton based
action recognition,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1110–1118. Boston, MA, USA, 2015
61. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel,
“Backpropagation applied to handwritten zip code recognition,” Neural Computation,
Vol. 1, Issue 4, pp. 541–551.
Index
A Descriptive statistics, 162
Design guidelines, 70, 74, 76, 89–91
Adaptive Data Collection Tree, 97, 103 Distributed hash table, 96
Aggregation, 106, 115–118, 120
Aggregation operators, 74
Arrhythmias, 174 E
Edge computing, 98
B EEG data, 247
Electronegativity, 174
Backward stepwise regression method, 169 Electronic word-of-mouth (EWOM), 268–270,
Bagging and oversampling techniques, 163 273, 274, 277, 280, 281
Big data, 128–132, 135, 137, 140, 142, 143, 145, Emotional state transition, 245, 249, 252, 254,
149–155 257, 260
Binary response variable, 168 Execution plan, 70, 73, 74, 77–83, 90
Biological data, 128–132, 135, 137, 140, 141, 145, Expanding operators, 73
149–155
Bio-nano-sensor, 174
Blood perfusion, 223
F
Body Movements, 306, 318, 319 False positive rate, 170
Boundary nodes, 103, 104 Feature subset, 51, 52, 54–56, 59, 60
Brain Computer Interface, 245 Filter operators, 74, 80
Broker, 96–100 Finger table, 99, 102
Brugada syndrome, 174 Fitness Function, 55, 57
Burns, 220–227, 234–238 F1-Measure, 277
Business Intelligence Workload, 76 Framingham Heart Study, 159, 160, 166, 172
C G
Canonical Correlation Analysis (CCA), 306 Genetic Algorithm, 28
Choke points, 76, 77, 81, 82 GeoHash, 4, 5, 7, 11, 17, 20, 23
Chord, 97–99, 101, 103, 106, 120 GeoJSON, 4
Classification, 26–28, 41, 42, 44, 53, 54, 57–61 Geospatial Data, 4, 9, 10
Clinical decision support, 160 Geospatial Query, 4, 5, 7–9, 11, 23
Clinical indicators, 162 Gradient feature, 268, 269, 272, 281
Collaborative filtering, 288 Guest satisfaction, 269–281
Collective Store and Forwarding, 97, 101
Confusion matrix, 160–165, 169, 170, 315, 316
H
Contingency table, 235
Convolutional Neural Network, 221, 307, 314 Healthcare analytics, 160–163
High performance computing, 185
D Histogram of Oriented Gradient, 268
Hunting Strategy, 52, 53, 57
Data analysis, 293
Data analytics framework, 160, 161
I
Data management, 293
Data security, 131, 138, 143 Internet of Things, 96
DB Hits, 74, 75, 79, 80, 82–89, 91 Intrinsic disorder predisposition, 178, 180
Deadline, 101–103, 105, 106
Decision Tree (DT), 269, 270, 273, 276, 280
J
Deep Belief Network (DBN), 309–311
Deep Learning, 306–311 Join operators, 74
323
324 Index
T V
Ten-fold cross-validation, 276, 279 Validation dataset, 164
Topic-based pub/sub (TBPS), 96–101, 105, 108, Visual Geometric Group (VGG), 314, 315,
111, 112, 118, 120 318, 319
Tripadvisor, 270–273, 279 Visualization, 3, 7–11, 21, 23, 132, 141, 149
True negative rate, 165
True positive rate, 164, 170 W
U Writer Retrieval, 268, 269, 275, 278–281
Updating, 141, 155
User generated content (UGC), 268, 270–272, 280
Taylor & Francis eBooks
www.taylorfrancis.com
Improved
A streamlined A single point search and
experience for of discovery discovery of
our library for all of our content at both
customers eBook content book and
chapter level