Book
Book
M. Arif Wani
Bisma Sultan
Sarwat Ali
Mukhtar Ahmad Sofi
Advances
in Deep
Learning,
Volume 2
Studies in Big Data
Volume 12
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence including neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as
self-organizing systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
M. Arif Wani · Bisma Sultan · Sarwat Ali ·
Mukhtar Ahmad Sofi
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2025
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Deep learning has revolutionized the Artificial Intelligence (AI) field, offering unpar-
alleled performance across a variety of complex tasks. Its ability to automatically
learn features from vast amounts of data makes it one of the suitable learning tech-
niques for numerous applications. Several areas and applications of deep learning
have witnessed significant progress, this includes Neural Architecture Search (NAS),
Steganography, and Medical Applications.
Neural Architecture Search has significantly contributed to the design of deep
learning models by automating the process of architecture creation. Traditionally,
designing deep learning networks required expert knowledge and trial-and-error
methods, which were both time-consuming and limited by human intuition. NAS
overcomes these limitations by employing algorithms to explore a vast search space
of potential architectures and selecting the best architecture.
Steganography, the practice of hiding information within digital media, has long
been used for secure communication. Traditional methods of steganography were
often susceptible to detection or offered limited capacity for hidden data. However,
with the advent of deep learning, new techniques have emerged that significantly
improve the robustness, capacity, and security of hidden information.
The ability to analyze vast amounts of medical data has opened new fron-
tiers in healthcare, encompassing disease diagnosis, personalized medicine, and
drug discovery. Deep learning models are now integral to interpreting medical
images, predicting patient outcomes, and identifying potential therapeutic targets,
significantly enhancing both accuracy and efficiency of healthcare services.
This book aims to explore these three areas, highlighting the significant work
accomplished, the impact made, and the future research scope in these areas. The
book is organized into thirteen chapters.
Chapter 1 discusses the impact of deep learning in three important areas: Neural
Architecture Search (NAS), Steganography, and Medical Applications. The chapter
introduces NAS, an approach that automates the neural network architecture search
process, resulting in the design of efficient and high-performance models. It then
discusses steganography, where impact of advanced methods based on deep learning
for secure data embedding within digital media is outlined. The chapter highlights
v
vi Preface
the achievements of deep learning in the medical field. The chapter also outlines the
future research directions in these three important areas.
Chapter 2 explores the fundamental concepts of Evolutionary Algorithm-Based
Neural Architecture Search (NAS). By employing principles of natural selection,
evolutionary NAS iteratively evolves optimal neural architectures across generations,
effectively exploring a vast search space of architectures. The integration of tech-
niques such as mutation, crossover, and selection enhances the diversity and adapt-
ability of architectures for complex tasks like image classification. The chapter also
evaluates the performance of various evolutionary NAS methods, comparing their
effectiveness and identifying promising avenues for future research and development.
Chapter 3 focuses on Gradient-Based Neural Architecture Search, an approach
that automates the design of neural network architectures. It examines the DARTS
methodology in-depth, a foundational gradient-based NAS technique that formulates
architecture search as a continuous optimization problem. An experimental analysis
showcases the efficiency and effectiveness of gradient-based NAS, highlighting its
practical applications. The chapter concludes with a discussion on future directions
for research, underscoring the importance of balancing accuracy, efficiency, and
computational costs in advancing this powerful architecture search paradigm.
Chapter 4 presents a new training methodology aimed at improving the perfor-
mance of deep learning models. The approach utilizes a coarse-to-fine-tuning
strategy that incorporates selective freezing techniques, specifically Simple Selec-
tive Freezing (SSF) and Progression-Based Selective Freezing (PSF). Initially, coarse
training is performed on deep learning architectures, followed by the application of
these selective freezing methods to fine-tune the model. This approach can be applied
to architectures obtained either manually or through Neural Architecture Search
(NAS) methods. The experiments on the CIFAR-10 dataset, using an architecture
derived from DARTS, reveal that the coarse-to-fine-tuning approach outperforms
traditional training methods.
Chapter 5 discusses various Generative Adversarial Networks (GANs) architec-
tures that have been used in image steganography. Generative Adversarial Networks
have gained considerable attention in image steganography mainly because these
networks can encode and decode secret information using digital images efficiently.
Various GAN-based techniques to embed and extract secret data seamlessly using
images, offering a robust solution for secure communication and data concealment,
are discussed.
Chapter 6 discusses advanced deep learning-based image steganography tech-
niques. The three key parameters (security, embedding capacity, and invisibility) for
measuring the quality of an image steganographic technique are described. Various
steganography techniques, with emphasis on the above three key parameters, are
described. The results reported by researchers on benchmark datasets, CelebA, Boss-
base, PASCAL-VOC12, CIFAR-100, ImageNet, and USC-SIPI, are used to evaluate
the performance of various steganography techniques. Analysis of the results shows
that there is scope for new suitable deep learning architectures that can improve the
capacity and invisibility of image steganography.
Preface vii
ix
x Contents
7.4.3
Performance Comparison with Other GAN-Based
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.4 Performance Comparison with Hybrid Methods . . . . . . . 109
7.5 Challenges and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8 Two-Stage Generative Adversarial Networks for Image
Steganography with Multiple Secret Messages . . . . . . . . . . . . . . . . . . . 111
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 GAN Based Image Steganography System for Multiple
Secret Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.1 Workflow Diagram of Image Steganography
Process for Multiple Secret Messages . . . . . . . . . . . . . . . . 112
8.2.2 Generator Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.3 Steganalyzer Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.2.4 Extractor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.2.5 Algorithm for Training the Networks . . . . . . . . . . . . . . . . 116
8.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3.2 Performance of Generator Models . . . . . . . . . . . . . . . . . . . 119
8.3.3 Performance of Late Embedding Generator Model
and Steganography Methods . . . . . . . . . . . . . . . . . . . . . . . . 121
8.3.4 Performance of the Late Embedding Generator
Model and Deep Learning Models . . . . . . . . . . . . . . . . . . 121
8.4 Challenges and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9 Deep Learning in Healthcare and Computational Biology . . . . . . . . . 125
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Deep Learning for Sequence Data: A Case Study of Protein
Secondary Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.3 Deep Learning for Genomic Data: A Case Study
of Pan-Cancer Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.3.1 Pan-Cancer Classification: A Paradigm Shift . . . . . . . . . . 129
9.4 Deep Learning for Image Data: A Case Study of Tumor
Prediction Using MRI Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5 Challenges and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Contents xiii
Dr. M. Arif Wani received an M.Tech. degree in Computer Technology from the
Indian Institute of Technology, Delhi, and a Ph.D. degree in Computer Vision from
Cardiff University, UK. Currently, he is a Professor at the University of Kashmir,
having previously served as a Professor at California State University, Bakersfield.
His research interests are in the area of machine learning, with a focus on neural
networks, deep learning, inductive learning, support vector machines, computer
vision, pattern recognition, and classification tasks. He has published many papers in
reputed journals and conferences in these areas. Dr. Wani has co-authored the book
Advances in Deep Learning, and co-edited many books on Machine Learning and
Deep Learning applications. He is a member of many academic and professional
bodies.
Bisma Sultan earned a B.Tech. in Computer Science and Engineering from the
University of Kashmir, an M.Tech. in Computer Science from the University of
Jammu, and completed her Ph.D. in Computer Science at the University of Kashmir.
She qualified National Eligibility Test (NET) and the Graduate Aptitude Test in
Engineering (GATE). She served as a Senior Research Fellow in the Department of
Computer Sciences at the University of Kashmir. Currently, she holds a faculty posi-
tion within the Department of Computer Sciences at the University of Kashmir. Dr.
Bisma’s research interests encompass a broad spectrum of fields, including machine
learning, deep learning, information security, web technologies, and networking. She
has made contributions to these areas by publishing a number of research articles in
high-ranking academic journals and conferences.
Sarwat Ali has a Bachelor’s and Master’s degree in Computer Applications from
the University of Kashmir, Hazratbal, Srinagar. She has worked as a Research Asso-
ciate in the Artificial Intelligence Research Center project supported by Rashtriya
Uchchatar Shiksha Abhiyan (RUSA 2.0), an initiative of the Government of India.
Currently, she is dedicated to her Ph.D. studies in Neural Architecture Search (NAS)
at the Post Graduate Department of Computer Science (PGDCS), University of
Kashmir. She has contributed to publications in prestigious journals and research
xv
xvi About the Authors
conferences in this specialized field. Sarwat’s academic journey has been marked
by achievements, including the receipt of gold medals for both her Bachelor’s and
Master’s degrees from the University of Kashmir. She has also achieved success in the
(University Grants Commission National Eligibility Test) UGC NET examination.
Mukhtar Ahmad Sofi received MCA and M.Tech. in Computer Science and engi-
neering degrees from Pondicherry Central University and a Ph.D. in Computer
Science from the University of Kashmir. Currently, he is an Assistant Professor in
the Information Technology Department at the BVRIT Hyderabad College of Engi-
neering for Women. Dr. Sofi’s diverse research interests span data mining, machine
learning, deep learning, natural language processing, computational biology, and
bioinformatics. He has made notable contributions to these areas through numerous
research articles published in prestigious academic journals and presentations at inter-
nationally acclaimed conferences. Dr. Sofi continues to make impactful contributions
to the academic and research community.
Chapter 1
Introduction to Deep Learning
Applications
1.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 1
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_1
2 1 Introduction to Deep Learning Applications
Lastly, the medical applications of deep learning represent perhaps the most
profound impact of this technology. The ability to analyze vast amounts of medical
data has opened new frontiers in healthcare, encompassing disease diagnosis, person-
alized medicine, and drug discovery. Deep learning models are now integral to inter-
preting medical images, predicting patient outcomes, and identifying potential thera-
peutic targets, significantly enhancing both the accuracy and efficiency of healthcare
services. As the volume of medical data continues to grow, the role of deep learning
in medicine is set to expand, offering innovative possibilities for treatment and care.
In conclusion, this chapter highlights the transformative potential of deep learning
across NAS, steganography, and medical applications. By highlighting the significant
advancements, impacts, and achievements within these domains, we gain valuable
insights into how deep learning is reshaping various fields and paving the way for
future innovations.
The field of Neural Architecture Search (NAS) has undergone significant evolu-
tion, driven primarily by advancements in search strategies designed to navigate
the vast space of potential architectures more efficiently. Early NAS approaches,
such as Grid Search and Random Search, provided foundational methodologies but
lacked the efficiency required for large-scale applications. Over time, these basic
1.2 Neural Architecture Search (NAS) 3
Chosen architecture
from search space
Performance
Architecture Selection Evaluation
the most efficient techniques. Its scalability to large datasets and complex archi-
tectures offers a compelling balance between search efficiency and performance,
positioning it as one of the leading methods in modern NAS.
In brief, the progression of NAS search strategies—from the early methods of
Grid and Random Search to the more sophisticated approaches of Evolutionary
Algorithms, Reinforcement Learning, and Gradient-based NAS—has dramatically
improved the process of architecture discovery. While early approaches provided
important groundwork, the advanced methods employed today make NAS far more
accessible, efficient, and effective across a wide range of applications.
The integration of Neural Architecture Search (NAS) into deep learning has signifi-
cantly impacted both research and industry applications. One of the important contri-
butions has been the optimization of architecture design. Traditionally, designing
neural networks was a manual and time-consuming process that relied heavily on
expert knowledge and iterative trial-and-error methods. NAS has revolutionized this
process by automating the discovery of architectures, enabling the creation of models
customized to specific tasks. This automation has not only streamlined the develop-
ment process but also opened the door to more innovative and task-specific model
designs that might have been overlooked by human experts.
Another key impact of NAS is the enhancement of model performance. NAS-
generated architectures have consistently achieved State-of-the-art results across a
variety of important benchmarks, including CIFAR-10, ImageNet, and COCO. These
architectures often outperform handcrafted models in terms of both accuracy and
efficiency, demonstrating the potential of NAS to push the boundaries of what deep
learning models can achieve. This improvement in performance extends across a
range of applications, from image classification to object detection and beyond,
highlighting the versatility and effectiveness of NAS-driven designs.
Another transformative impact of NAS is the reduction of human intervention in
the network design process. By automating the search for optimal architectures, NAS
minimizes the need for expert knowledge, making it possible for even non-experts
to achieve competitive results. This democratization of neural network design not
only accelerates the pace of innovation but also broadens access to advanced deep
learning techniques, allowing a wider range of practitioners to employ cutting-edge
models in their respective fields.
1.2 Neural Architecture Search (NAS) 5
The achievements of Neural Architecture Search (NAS) are both extensive and
continually expanding as the field progresses. One of the most notable accomplish-
ments is NAS’s ability to deliver State-of-the-art performance across various deep
learning tasks. NAS-generated architectures have consistently outperformed manu-
ally designed models on standard datasets, like ImageNet. These successes demon-
strate the effectiveness of NAS in discovering highly optimized architectures that
push the boundaries of what neural networks can achieve in terms of accuracy and
performance.
NAS has also brought about significant efficiency gains in the process of neural
network design. Historically, designing effective models was a resource-intensive
task, requiring considerable time and computational power. NAS has streamlined this
process by incorporating techniques like weight sharing and early stopping, which
drastically reduce the computational demands of searching for optimal architectures.
These innovations ensure that the search process remains efficient while maintaining
high levels of performance, allowing models to be developed faster and with fewer
resources.
Another key achievement of NAS is the flexibility it introduces into neural network
design. NAS allows for the exploration of a wide variety of architectural possibil-
ities, including novel combinations of layers and connections that were previously
unexplored. This flexibility has led to the discovery of unconventional architectures
that challenge traditional design paradigms, further expanding the range of model
designs available for various tasks.
Neural Architecture Search (NAS) has already found practical applications in
several real-world scenarios, further highlighting its achievements. For instance,
NAS has been employed by Google to develop EfficientNet, a family of models that
achieve superior performance on image classification tasks. Additionally, NAS has
been integrated into medical imaging, where it has helped optimize architectures
for tasks like tumor detection and segmentation, leading to more accurate and faster
diagnosis. These examples show how NAS advances both practical applications and
theoretical innovations.
Despite the remarkable progress made in Neural Architecture Search (NAS), there
are several promising avenues for future exploration. One key area is the develop-
ment of improved search algorithms. Although current NAS methods have achieved
impressive results, there is still a need for more efficient search strategies that can
reduce computational costs without compromising performance. As NAS continues
to evolve, researchers are expected to devise algorithms that streamline the search
6 1 Introduction to Deep Learning Applications
process, making it faster and more resource-efficient, thus broadening its accessibility
and practical application.
Neural Architecture Search also holds great potential for application in new
domains beyond its traditional focus areas. Fields such as medical imaging,
autonomous driving, and drug discovery present exciting opportunities for inno-
vation. As NAS methodologies are adapted to these complex, high-stakes domains,
the potential for groundbreaking advancements in technology and science grows. For
instance, in medical imaging, NAS could help design models that accurately diag-
nose diseases with greater efficiency, while in autonomous driving, it could enhance
the safety and reliability of vehicle perception systems.
A significant challenge that NAS continues to face is its high computational cost.
While strides have been made to improve efficiency, the resource demands of archi-
tecture search remain substantial. Future research will likely focus on reducing this
cost even further, possibly through the development of lighter, more efficient search
techniques that require fewer computational resources. Achieving this would make
NAS more accessible to researchers and developers who may not have access to
extensive computing power.
Another promising direction is the incorporation of multi-objective optimiza-
tion within NAS frameworks. Currently, most NAS systems primarily optimize for
accuracy, but future research could explore balancing accuracy with other factors
such as model size, energy consumption, and latency. This would be particularly
important for deploying NAS-generated models on edge and mobile devices, where
computational resources and energy efficiency are critical.
Additionally, hardware-aware NAS is poised to become a major focus in the
future. As specialized hardware like Tensor Processing Units (TPUs) and Field-
Programmable Gate Arrays (FPGAs) become more prevalent, NAS models will
need to be co-optimized with specific hardware constraints in mind. This approach
could lead to architectures that are fine-tuned for specific deployment environments,
ensuring optimal performance and efficiency on various hardware platforms.
Steganography, the practice of hiding information within digital media, has been
significantly transformed by deep learning techniques. Traditional methods relied on
algorithms that manipulated image, audio, or video data to embed secret messages
while maintaining imperceptibility. However, deep learning offers a more sophis-
ticated approach, enabling the design of models that can automatically learn intri-
cate features for secret message embedding and extraction. This evolution has led to
improved capacity, security, and resistance to steganalysis. In this section, we explore
the application of deep learning in steganography, highlighting key developments,
impacts, achievements, and future directions.
1.3 Deep Learning in Steganography 7
The integration of deep learning into steganography has had a profound impact on
both theoretical and practical applications. Key impacts include improved security,
where deep learning models, particularly adversarial networks, have enhanced the
robustness of stego images against modern steganalysis tools, making it more difficult
to detect hidden messages. Additionally, these models have enabled higher payload
capacities while maintaining visual fidelity, a significant advancement over tradi-
tional techniques. By automating feature extraction and learning optimal embedding
patterns, deep learning reduces the need for manual intervention, which was prevalent
in earlier approaches.
The future of deep learning in steganography is promising, with several areas ripe
for exploration. As deep learning architectures become more efficient, the focus
will likely shift towards creating stego images with higher resolutions (e.g., 512
× 512 or larger) without sacrificing imperceptibility. Another exciting direction is
the development of real-time steganography systems, allowing for the instantaneous
embedding and extraction of messages in live video streams or interactive appli-
cations. While defensive models are improving, the future will also see advance-
ments in deep learning-driven steganalysis, pushing researchers to develop even
more robust and secure steganographic methods. Additionally, combining different
types of media (e.g., image, audio, and text) in a single steganographic framework
using deep learning could lead to richer and more complex hidden communication
systems.
Deep learning has emerged as a transformative force in the medical field, offering
innovative solutions to some of the most pressing challenges in healthcare. By incor-
porating the power of advanced algorithms and vast datasets, deep learning tech-
niques have demonstrated remarkable capabilities in areas such as medical imaging,
analysis of complex biological data, and treatment optimization. These technolo-
gies enable healthcare professionals to analyze complex medical data with unprece-
dented accuracy, enhancing diagnostic processes and personalizing patient care.
From improving the detection of diseases in medical images to accelerating drug
discovery and enabling real-time health monitoring, deep learning is reshaping the
landscape of modern medicine. This section highlights the work done in this area,
the impacts made, notable achievements, and future directions for deep learning
applications in healthcare.
1.4 Deep Learning in Medical Applications 9
The integration of deep learning methodologies into medical imaging and the analysis
of proteomic and genomic data has profoundly impacted healthcare, bringing signif-
icant improvements in various aspects of the field. One of the most notable advance-
ments has been the improvement in diagnostic accuracy. Deep learning algorithms,
particularly in areas like image classification and object detection, have demonstrated
performance that rivals, and in some cases surpasses, that of human practitioners.
This has led to earlier identification of medical conditions, a reduction in both false
positives and false negatives, and more reliable and objective evaluations. These
advancements enhance the quality of patient care by enabling more precise and
timely diagnoses.
Another key impact of deep learning in medical imaging is the increased effi-
ciency in diagnostic processes. Automation of traditionally labor-intensive tasks,
such as image segmentation and analysis, has dramatically accelerated the pace of
medical image interpretation. This has resulted in faster reporting times, allowing
for quicker clinical decisions. Additionally, the reduced workload on radiologists
enables them to focus on more complex cases, improving the overall operational
flow within radiology departments and healthcare facilities.
Deep learning holds immense potential in processing proteomic and genomic
data. In proteomics, it enables accurate prediction of protein structures, unrav-
eling complex biological functions and accelerating drug discovery. In genomics,
it empowers the classification of cancers by identifying shared molecular patterns
across diverse cancer types, paving the way for personalized medicine and targeted
therapies.
Deep learning also has the potential to enhance accessibility to high-quality diag-
nostics, particularly in underserved regions. In areas with limited access to special-
ized radiologists, deep learning models can facilitate large-scale screening programs,
ensuring that individuals in remote or resource-limited environments receive timely
diagnoses. Furthermore, these models can provide second opinions, increasing the
availability of expert-level diagnostic evaluations without the need for physical
consultations, thereby bridging the healthcare gap between urban and rural settings.
1.4 Deep Learning in Medical Applications 11
segment brain tumors from MRI scans. These models are now approaching human-
level performance, offering a more precise and consistent method of tumor detection,
which is essential for treatment planning and prognosis in neuro-oncology.
During the COVID-19 pandemic, deep learning played a crucial role in the rapid
development of models for detecting the virus from chest X-rays and CT scans.
These models were instrumental in aiding healthcare professionals with quick triage
and diagnosis, especially in overwhelmed healthcare systems. The ability of deep
learning models to analyze imaging data quickly and accurately was vital in managing
the pandemic, providing valuable insights into the progression of the disease and its
impact on the lungs. In addition, Alphafold, developed by DeepMind, demonstrated
near-experimental accuracy in predicting protein structures and played a crucial
role in understanding the structural biology of SARS-CoV-2, aiding in the fight
against COVID-19 by accelerating the discovery of viral protein interactions and
potential therapeutic targets. These achievements collectively highlight the growing
importance of deep learning in enhancing the accuracy, efficiency, and accessibility
of medical diagnostics.
The future potential of deep learning in medical imaging is vast and holds promise
across several key areas. One exciting direction is the integration of multi-modal
data, combining imaging with genetic, clinical, and laboratory information to offer a
holistic view of a patient’s health. This could lead to earlier diagnoses, improved prog-
nostic assessments, and more personalized treatment plans. Additionally, advance-
ments in computational power, particularly through cloud computing and edge
processing, could enable real-time image analysis, providing immediate feedback
to clinicians at the point of care and speeding up the diagnostic process. Explainable
AI (XAI) is another area of intense research, aimed at making deep learning models
more transparent and trustworthy by using tools like heat maps to show how models
arrive at specific diagnoses. In terms of disease detection, deep learning has the
potential to identify conditions in their earliest stages, even before symptoms appear,
such as in cancer screening using low-dose CT scans or biomarkers. Looking further
ahead, autonomous diagnostic systems could operate independently of human over-
sight, particularly in resource-constrained environments or during emergencies. The
integration of deep learning into personalized medicine also holds promise, as AI
could analyze imaging alongside genomics and patient history to predict individual
responses to treatments.
Finally, combining deep learning with robotic systems for image-guided surg-
eries could revolutionize surgical precision and safety, enabling more accurate
intra-operative decision-making and improving patient outcomes.
In conclusion, deep learning is reshaping the medical field by offering more
accurate, scalable, and personalized approaches to diagnostics and treatment. By
1.5 Summary 13
improving tasks such as protein structure prediction and gene expression clas-
sification, it is paving the way for more effective medical research and patient
care.
1.5 Summary
Bibliography
2.1 Introduction
The increasing demand for specialized neural network architectures that cater to
specific tasks has given rise to automated methods for architecture design, allevi-
ating the need for manual, labor-intensive processes. Neural Architecture Search
(NAS) has emerged as a key solution, enabling the discovery of optimized neural
networks without human intervention. Among the various approaches to NAS, Evolu-
tionary Algorithm-based NAS has proven to be particularly effective due to its ability
to efficiently navigate the vast search space of neural architectures by employing
biologically inspired optimization techniques.
Evolutionary Algorithms (EAs) are rooted in the principles of Darwinian natural
selection. By evolving a population of candidate solutions—neural architectures—
over multiple generations, EAs apply genetic operations such as mutation, crossover,
and selection to gradually improve architecture performance. The strength of Evolu-
tionary NAS lies in its ability to balance exploration and exploitation. Through muta-
tion, the search can explore new areas of the architecture space, while crossover
combines the strengths of high-performing architectures to refine solutions.
It is worth noting that the earliest NAS approaches were closely tied to genetic
algorithms, which inspired many of the core concepts in Evolutionary NAS. These
early Genetic Algorithm-based NAS methods demonstrated the feasibility of opti-
mizing neural architectures through Evolutionary processes. However, before the rise
of Evolutionary NAS, simpler techniques such as Grid Search and Random Search
were commonly employed. In Grid Search, a predefined set of hyperparameters is
exhaustively evaluated, but this method suffers from inefficiency due to its compu-
tational cost and lack of adaptability to dynamic search spaces. Random search, on
the other hand, involves selecting hyperparameters at random and evaluating them.
While more efficient than Grid Search in certain cases, Random Search lacks the
structured guidance provided by Evolutionary or other NAS methods.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 15
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_2
16 2 Evolutionary Algorithm-Based Neural Architecture Search
The search space in Evolutionary NAS forms the fundamental landscape that defines
all potential neural network configurations available for exploration and optimization.
It encompasses two key aspects: the operation space, which outlines the possible
architecture component choices and hyperparameters, and the representation of these
architecture components, which provides a structured way to encode and manipulate
network designs. Together, these components enable the Evolutionary Algorithm to
systematically explore and refine neural network architectures.
Two key types of search spaces that can be explored using Evolutionary Algo-
rithms are the chain-structured search space and the cell-based search space. Each
of these approaches offers unique ways to define and optimize the structure of neural
networks.
A. Chain-Structured Search Space
In a chain-structured search space, the architecture is represented as a simple
sequence of layers, typically organized in a feed-forward manner. The search process
aims to evolve this sequence by altering the types and configurations of layers (e.g.,
convolutional, pooling, and dense layers), the number of filters or units, activation
functions, and other hyperparameters. Evolutionary operations such as mutation
and crossover are applied to modify the architecture’s structure, thereby exploring
different configurations and identifying the optimal chain of layers that yields the
best performance.
For example, in a chain-structured CNN architecture:
• The input layer is followed by a series of convolutional layers.
• The final convolutional layer output is flattened and passed to dense layers for
classification.
• The Evolutionary process searches for the best combination of convolutional
filters, dense units, and activations.
This approach is intuitive and easy to implement, but it may lack the flexibility
needed to design more complex architectures with repeated patterns or complex data
flow.
An example of chain structured search space, in which a neural network config-
uration is encoded in a structured format, such as a Python dictionary, is shown in
Table 2.1. This representation captures all essential hyperparameters, including the
number of layers and their respective sizes, enabling a clear and consistent approach
to defining architectures.
In this example, the network has two convolutional layers and one dense layer. The
first convolutional layer uses 32 filters of size 3 × 3, while the second convolutional
layer uses 64 filters of size 5 × 5. The dense layer has 128 units. This structured
representation simplifies the implementation of NAS algorithms, allowing for effi-
cient exploration and optimization of different network configurations. Figure 2.1
depicts a simplified flow of various steps in the Evolutionary Algorithm-based NAS,
18 2 Evolutionary Algorithm-Based Neural Architecture Search
2. Architecture development:
for each individual: Sample_architecture =
-Create an architecture with given {
1. Initialize Population:
- Create initial population of few individuals, parameters and train it. 'num_conv_layers': 2,
as represented above. 'num_flatten_layers': 1,
- Here genes represent various parameters of 'num_dense_layers': 1,
the individual, such as no. of convolution 3. Compute Fitness: 'filter_count_convolution_layer_1': 32,
layers, no. of dense layers etc. for each individual: 'filter_count_convolution_layer_2': 64,
-Evaluate model on test data and calculate 'filter_size_convolution_layer_1': 3,
-Generate Random Values for the parameters.
'filter_size_convolution_layer_2': 5,
accuracy
'filter_size_dense_layer_1': 128
-Record accuracy and parameters
Activation functions: [‘relu’}
}
5. Crossover
for each parameter:
-Perform crossover between selected parents to
produce new offsprings(architectures).
6. Mutation
for each child:
Mutate the children to introduce diversity
7. Replacement
-least-performing individuals in the current
population are replaced with the newly
generated offspring
Input
No
End of generations Convolution (32,3)
Yes
Convolution (64,5)
Performance Flatten
Estimation:
Here we retain the
best-performing Dense (128)
candidate
architectures
Output
Output architecture
The cell-based search space in Neural Architecture Search (NAS) offers a modular
and flexible approach to designing deep learning models. In this framework, a neural
network is built from multiple cells that are repeated throughout the architecture. Each
cell functions as an independent unit, performing a specific set of operations such
as convolutions, pooling, or skip connections. By focusing the search on optimizing
the configuration of these cells, NAS can efficiently explore the design space and
construct complex models with repeated, well-optimized components.
In an Evolutionary NAS, cells evolve through genetic operations like mutation
and crossover, which modify the structure of the cell by altering paths between nodes
or changing the operations along those paths. This process allows the Evolutionary
Algorithm to optimize the cell structure incrementally. Once an optimized cell is
discovered, it can be reused across various layers in the network, leading to sophisti-
cated architectures that benefit from repeated patterns of operations. This approach
significantly reduces the complexity of searching for optimal architectures, as it
concentrates on refining smaller, reusable components.
To efficiently represent the structure of a cell, path-based one-hot encoding is
often employed. For example, consider a NAS framework operating on a simple cell
with four nodes, as shown in Fig. 2.2. Each node represents a feature map, while the
edges connecting the nodes represent operations like convolutions or pooling. The
goal is to represent the cell’s structure using one-hot encoding, which captures the
operation along each edge as a binary feature.
In this example, there are two types of operations: Operation 1 (denoted by green
edges, e.g., a 3 × 3 convolution) and Operation 2 (denoted by yellow edges, e.g.,
a 5 × 5 convolution). One-hot encoding is used to represent the operation on each
edge as a binary vector. For every edge between two nodes, a binary pair is assigned
to indicate which operation is present. If Operation 1 is active on a given edge, the
encoding would be [1, 0], meaning that Operation 1 is present, and Operation 2 is
absent. Conversely, if Operation 2 is selected, the encoding would be [0, 1].
This encoding technique provides a compact and efficient way to represent the
cell’s structure, facilitating the manipulation and optimization of the architecture
during the Evolutionary process. By encoding each operation as a binary feature,
the algorithm can efficiently explore various combinations of operations and paths
within the cell, ultimately leading to an optimized network architecture.
Figure 2.2 illustrates the flow of steps in an Evolutionary Algorithm-based NAS,
utilizing the cell-based search space. The key components involved in this process—
such as the search space, genetic operations, and evaluation strategies—are discussed
in subsequent sections, providing a deeper understanding of how Evolutionary NAS
optimizes neural architectures.
This approach exemplifies how Evolutionary methods, inspired by natural selec-
tion, can be applied to optimize neural network architectures, yielding efficient and
high-performing models by evolving simple, modular cell structures.
20 2 Evolutionary Algorithm-Based Neural Architecture Search
Search Space
(a) Example of simple Cell
structure where nodes D
No_of_conv_layers: [1,2,3,4,5], 3
5 4
Conv_filtercount:[16,32,64,128], represent features and 0 1 2 3 4 5
Conv_filtersize:[3,5], edges represent 2
Pooling layers: [max_pool, avg_pool], operaons B C
Pool_size:[2,3] 1 (b)
(b) Path based encoding for
No_of_flatten_layers: [1,2,3,4], 0
the cell structure A
Activation functions: [‘relu’, ‘gelu’, ‘sigmoid’] 0 (a)
2. Architecture development:
for each individual:
1. Population Initializion: -Create stacked architecture with similar N
- Random generation of cells, as represented cells and train it. Sample_Architecture (4 cells)
above.
0 1 2 3 4 5 Cell 1
3. Compute Fitness:
for each individual: 0 1 2 3 4 5 Cell 2
-Evaluate model on test data and calculate
accuracy 0 1 2 3 4 5 Cell 3
-Record accuracy and parameters
0 1 2 3 4 5 Cell 4
4. Selection (Roulette Wheel)
- Select the best individuals (two parents)
based on their fitness values
5. Crossover
for each parameter:
-Perform crossover between selected parents to
produce new children
6. Mutation
for each child:
Mutate the children to introduce diversity
7. Replacement
-least-performing individuals in the current
population are replaced with the newly
generated offspring
Input
No
End of generations Cell 1
Yes
Cell 2
Performance
Estimation:
Here we retain the Cell N
best-performing
candidate
architectures
Output
Output architecture
A. Population Initialization
In the Evolutionary Neural Architecture Search (NAS) framework, the popula-
tion initialization phase marks the beginning of the search process. During this
phase, an initial population of candidate architectures—each representing a potential
neural network design—is generated. The population can consist of chain-structured
networks or cell-based architectures, depending on the search space being utilized.
Each candidate, or individual, within this population, is encoded using a set of genes
that define essential architectural parameters such as the number of convolutional
layers, dense layers, filter sizes, and activation functions.
These genes are randomly initialized within predefined bounds, ensuring diver-
sity within the population and allowing for the exploration of a wide variety of
network configurations. The initial population size is typically set to a reason-
ably large number, ensuring that the search begins from a broad and diverse set of
candidate solutions. This random generation process results in architectures ranging
from simple to highly complex designs, providing a rich foundation for subsequent
Evolutionary operations.
The purpose of this diverse initialization is to ensure that the search explores
various architectural designs early in the process. Each individual’s performance
is evaluated based on specific criteria, such as accuracy, computational efficiency,
or a task-specific objective. The highest-performing architectures are then selected
as “parents” for the next phase of the algorithm, where genetic operators such as
crossover and mutation are applied. This allows the search to iteratively refine the
architecture design, progressively leading to higher-performing neural networks.
The random initialization of the population ensures that the search covers a broad
spectrum of possibilities, increasing the likelihood of discovering optimal or near-
optimal solutions as the Evolutionary process unfolds.
B. Architecture Development
In the architecture development phase of Evolutionary Neural Architecture Search
(NAS), each individual within the population is transformed into a unique neural
network architecture based on its encoded genetic parameters. These parameters,
which define the network’s structure and components—such as the number of convo-
lutional layers, dense layers, filter sizes, and activation functions—act as the building
blocks for each architecture. By decoding these genetic encodings, each individual is
instantiated as a distinct neural network configuration, ensuring architectural diver-
sity within the population. A simple example of architecture development in chain
structure-based EA-NAS is shown in Fig. 2.3.
In the case of a cell-based search space, the architecture is created by stacking
multiple cells together, as exemplified in Fig. 2.4. These cells function as modular
building blocks, each defined by a set of operations—such as convolutions, pooling,
or skip connections—and the specific connections between nodes within the cell.
The configuration of each cell, including the operations performed along the edges
and the connections between nodes, is determined by the genetic encoding provided
by the Evolutionary Algorithm. By stacking N cells in sequence, the overall network
gains both depth and complexity, allowing for more sophisticated feature extraction
and data transformation.
22 2 Evolutionary Algorithm-Based Neural Architecture Search
Sample_architecture =
{
'num_conv_layers': 2,
'num_flatten_layers': 1,
'num_dense_layers': 1,
'filter_count_convolution_layer_1': 32, Develop
'filter_count_convolution_layer_2': 64, Architecture
'filter_size_convolution_layer_1': 3,
'filter_size_convolution_layer_2': 5,
'filter_size_dense_layer_1': 128
Activation functions: [‘relu’}
}
Architecture representation
Developed Architecture
0 1 2 3 4 5
This cell-based approach offers a flexible framework for designing deep learning
models. Once an optimized cell structure is discovered, it can be reused across the
network, resulting in an efficient design process. The repetition of cells not only
simplifies the architecture search but also constrains the search space, enabling a more
targeted exploration of optimal network configurations. As a result, the developed
architectures balance complexity and efficiency, with the potential to yield high-
performing models suited to the specific task at hand.
C. Compute Fitness
Once the neural network architectures are constructed, their performance is evaluated
through a fitness assessment using a designated dataset (see Fig. 2.5). The primary
metric for determining the fitness of each individual architecture is accuracy, which
2.2 Evolutionary Algorithm-Based NAS Methods 23
reflects the proportion of correct predictions made by the model on unseen data.
This accuracy score provides a quantitative measure of how well the architecture
generalizes to new inputs, serving as a direct indicator of its effectiveness for the
given task.
During the evaluation process, each model’s accuracy is calculated and recorded
alongside its corresponding architectural parameters. This accuracy score acts as the
fitness value in the Evolutionary Algorithm, guiding the selection of individuals for
further processes. Models that achieve higher accuracy are considered fitter and are
more likely to be selected for the next phase of evolution.
Following the fitness evaluation, the best-performing architectures will undergo
genetic operations, such as crossover and mutation, to refine and optimize their design
in subsequent generations.
D. Parent Selection
The next phase in the Evolutionary NAS process is selection, where individuals with
the highest fitness values are chosen to become parents for the next generation. The
Roulette Wheel Selection method is commonly used, which gives higher probabilities
of selection to individuals with better fitness scores, while still allowing a chance for
less fit individuals to be selected. This ensures diversity and helps avoid premature
convergence to suboptimal solutions.
In the Roulette Wheel Selection process, the fitness value of each individual is
normalized to reflect its proportion relative to the total fitness of the population. A
“wheel” is imagined, where each individual occupies a segment proportional to its
fitness. The algorithm then randomly “spins the wheel” to select two parents, with
the likelihood of selection being proportional to the individual’s fitness.
This method ensures that the best-performing individuals have a higher chance
of being selected as parents while still maintaining diversity in the population. Once
the two parent individuals are selected, they are used in the next stage to undergo
genetic operations, such as crossover and mutation, to create offspring for the next
generation.
24 2 Evolutionary Algorithm-Based Neural Architecture Search
E. Genetic Operations
Genetic operations in Evolutionary Algorithms and Neural Architecture Search
(NAS) are fundamental processes that drive the exploration and optimization of
neural network architectures. These operations, primarily crossover and mutation,
play key roles in generating new solutions from existing ones, mimicking biological
evolution.
Crossover involves combining elements of two parent architectures to create
offspring (candidate architectures) with potentially improved characteristics. The
process mirrors genetic recombination in biological reproduction and aims to explore
novel architectural configurations. Once the parents were selected, the crossover
method (shown in Fig. 2.6) is applied to generate new offspring. Each hyperparam-
eter of the offspring is independently inherited from one of the two parents with
equal probability.
In the cell-based crossover, operations between the cells of two parent architec-
tures are interchanged to create offspring, as shown in Fig. 2.7. This involves swap-
ping specific operations—such as convolutional layers, pooling layers, or activation
functions—between corresponding cells in the parent architectures. By interchanging
these operations, the offspring inherit a mix of features from both parents, enabling
the exploration of novel architectural designs. The exchange of operations can occur
at multiple levels within the cells, such as exchanging convolution types (e.g., depth-
wise vs. standard convolution) or altering kernel sizes, allowing for a diverse search of
potential performance improvements. This method preserves the structural integrity
of the cells while introducing variability in their functional components, enhancing
the diversity of candidate architectures.
Mutation introduces further diversity by making small, random adjustments to an
architecture. These changes might include varying the number of layers or altering
filter sizes. The mutation probability—typically a small value such as 0.4—deter-
mines how often mutations occur. By applying mutations with a certain probability,
the algorithm can explore different configurations and potentially discover novel and
high-performing architectures that might not emerge through crossover alone.
2.2 Evolutionary Algorithm-Based NAS Methods 25
Specifically, in a point mutation, as illustrated in Figs. 2.8 and 2.9, a single hyper-
parameter of the architecture is randomly changed. In chain-structured architectures
(Fig. 2.8), this might involve altering the depth or the type of layers, while in cell-
based architectures (Fig. 2.9), the mutation can affect operations within individual
cells, such as changing the convolution type, activation function, or other hyperpa-
rameters. After both crossover and mutation operations, the modified offspring are
added to the new generation, ensuring continued exploration and refinement of the
architecture space.
F. Replacement
After the selection and genetic operations (like crossover and mutation) generate
new offspring, the next step is to replace the least-performing individuals in the
current population with the newly generated offspring. The fitness of the individuals
in the population is reassessed, and those with the lowest performance—typically
those with the poorest accuracy or other relevant metrics—are removed. These least-
performing individuals are replaced by the new offspring, ensuring that the population
evolves by retaining high-performing architectures while introducing new potential
solutions.
This replacement strategy ensures that the Evolutionary process maintains diver-
sity and continually improves the population by discarding less effective models in
favor of potentially stronger ones from the next generation.
G. Termination Condition
Steps B, C, D, E, and F of the Evolutionary NAS process continue iteratively until
a predefined termination condition is met. Common termination criteria include
achieving a specified accuracy threshold, reaching a maximum number of gener-
ations, or observing no significant improvement in fitness over a set number of iter-
ations. Once the termination condition is satisfied, the algorithm concludes, and the
best-performing architecture identified throughout the process is selected for deploy-
ment. This final architecture reflects the culmination of the Evolutionary search,
optimized for the specific task of interest, such as image classification.
H. Final Architecture
The architecture with the highest fitness score tracked globally across all generations,
is selected as the final output of the NAS process.
in automating the design of neural networks. While this performance was commend-
able at the time, subsequent techniques have surpassed it, reflecting the rapid
advancements in the field and the continual refinement of search strategies and
architectures.
Hierarchical Evolution, introduced later the same year, mimics the modularized
(cell-based) design pattern typically employed by human experts, enabling it to
discover architectures that outperform many manually designed models for image
classification. This approach reduced the error rate to 3.63%, demonstrating the effi-
cacy of incorporating hierarchical structures into the architecture search process.
Further refinements in Evolutionary methods, as seen in AmoebaNet-A (2018),
brought the error rate down to 3.34%. This improvement reflects the modification of
the tournament selection Evolutionary Algorithm by introducing an age property to
favor the younger genotypes.
CNN-GA, also introduced in 2018, slightly improved on this with an error rate of
3.22%, highlighting the effectiveness of genetic algorithms specifically customized
to convolutional neural networks. The important advancement in accuracy came from
LEMONADE (2018), which achieved an error rate of 2.58%. This improvement can
be attributed to its Lamarckian inheritance mechanism, allowing child networks to
inherit and benefit from the predictive performance of their trained parents, enhancing
the efficiency of the architecture search process. While LEMONADE’s primary
variant excelled in accuracy, a secondary version of the algorithm produced an error
rate of 4.57%, using significantly fewer parameters (0.5 million compared to the
original’s 13.1 million). This result highlights the inherent trade-off between model
complexity and accuracy in NAS. Reducing the number of parameters often leads to
less accurate models, but in this case, the parameter-efficient variant of LEMONADE
still performs competitively, making it suitable for applications where computational
resources are limited.
The evolution of NAS methods has also focused on reducing model complexity, as
reflected in the number of parameters required for each architecture. Early methods,
such as LargeEvo and AmoebaNet-A, required substantial model sizes, with the latter
using 3.2 million parameters. In contrast, CNN-GA managed to reduce this to 2.9
28 2 Evolutionary Algorithm-Based Neural Architecture Search
The field of Evolutionary Neural Architecture Search (NAS) offers numerous oppor-
tunities for exploration and innovation. Future research could focus on enhancing
genetic operations by investigating more sophisticated techniques, such as advanced
2.5 Summary 29
2.5 Summary
Bibliography
3.1 Introduction
The rapid advancement of deep learning has revolutionized numerous fields, from
computer vision to natural language processing. However, the effectiveness of deep
learning models heavily relies on the choice of their architectures. Traditionally,
designing these architectures has been a labor-intensive and expert-driven process. To
address this challenge, Neural Architecture Search (NAS) has emerged as a promising
solution, aiming to automate the architecture design process.
Within the realm of NAS, various techniques have been explored, including Evolu-
tionary Algorithms and reinforcement learning. While these methods have shown
promise, they are often computationally expensive and time-consuming. In contrast,
Gradient-based NAS techniques have gained attention for efficiently exploring the
search space.
Among these Gradient-based approaches, Differentiable Architecture Search
(DARTS) stands out for its innovative formulation of the architecture search problem
as a continuous optimization task. By allowing for the simultaneous optimization
of model parameters and architecture through Gradient descent, DARTS signifi-
cantly reduces the computational costs associated with traditional NAS methods. This
unique framework facilitates exploration of the architecture search space, enabling
rapid convergence to high-performance models.
In this chapter, we will provide a comprehensive overview of Gradient-based NAS,
focusing specifically on DARTS, a basic method in this category. We will begin by
discussing the fundamental concepts of Gradient-based NAS, transitioning into a
detailed examination of the DARTS methodology. By the end of this chapter, readers
will gain a clearer understanding of how Gradient-based techniques, particularly
DARTS, are reshaping the landscape of neural network design.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 31
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_3
32 3 Gradient-Based Neural Architecture Search
The search space in DARTS is a central component that defines the possible archi-
tectures the method can explore during the optimization process. It consists of a
collection of candidate operations and their arrangement within a modular framework
known as cells.
A. Cell Representation
In DARTS, the architecture is built from stacked cells, where each cell is represented
as a Directed Acyclic Graph (DAG) structure as shown in Fig. 3.1. This structure
allows for the flow of information from one operation to another, enabling complex
3.2 Gradient-Based NAS Methods 33
This approach allows for efficient exploration of the search space, as the opti-
mization focuses on a smaller, more manageable unit rather than the entire network.
Gradient-based NAS methods have demonstrated the ability to find highly effec-
tive cell architectures with reduced computational resources compared to traditional
search methods.
B. Candidate Operations
DARTS explores a diverse set of operations within a Convolutional Neural Network
(CNN) cell, encompassing seven potential operations listed in Table 3.1. Each oper-
ation represents a different way to process data within the network structure. Impor-
tantly, there is also a zero operation which signifies no connection between nodes,
providing flexibility in architectural configurations.
Initially, these operations are placed between each pair of nodes within the cell,
with their associated weights set to zero. As the search progresses, DARTS dynam-
ically learns the strengths and weights for each operation through a Gradient-based
optimization process. This iterative approach allows DARTS to adaptively adjust the
contributions of each operation within the cell, aiming to maximize the performance
of the neural network architecture on the given task.
34 3 Gradient-Based Neural Architecture Search
In DARTS, the base network utilized during the architecture search phase is a rela-
tively simple structure consisting of eight cells, as shown in Fig. 3.2. Within this
configuration, the third and sixth cells serve as reduction cells, which are specifi-
cally designed to down-sample the feature maps, effectively managing the spatial
dimensions. The remaining cells are designated as normal cells, focusing on feature
extraction through various operations.
Fig. 3.3 Difference in architecture depths during search and evaluation phases
While this base network provides a foundational framework for exploring different
cell configurations, it is important to note that the evaluation of the discovered cell
architecture occurs within a deeper network that includes 20 cells, as illustrated in
Fig. 3.3. This deeper network evaluation allows for a more comprehensive assessment
of the architecture’s performance across a wider range of complexities and input
variations.
One of the critical challenges in DARTS arises from the inherent differences in
behavior between shallow and deep networks. The configurations identified during
the architecture search phase may not necessarily yield optimal results when eval-
uated within the deeper network. This discrepancy is commonly referred to as the:
depth gap.” The depth gap highlights the potential pitfalls of relying solely on shallow
architectures for guiding the search process, as they may not capture the full range
of interactions and dependencies present in deeper architectures.
The Gradient-based search strategy in DARTS comprises two key steps: Continuous
Relaxation and Bi-level Optimization. Continuous relaxation transforms discrete
architectural decisions into a continuous space, allowing Gradients to flow smoothly
through architecture parameters during optimization. This transformation enables
the integration of architecture search directly into the training process, significantly
enhancing the overall search efficiency.
Bi-level optimization further refines this approach by simultaneously handling
the search for optimal architectures and model training. This framework allows for
the iterative refinement of both the architecture and its weights. Below, we explain
these steps in detail:
A. Continuous Relaxation
In DARTS-based methods, a key aspect is the placement of mixed operations on the
edges connecting nodes within a cell. In this approach, each edge, denoted as o(i, j)
in the directed acyclic graph represents a combination (linear) of all the candidate
operations. Every operation is assigned a weight, and these weights, collectively
represented as α , sum up to 1.
As a result, every edge embodies a weighted combination of all operations within
the search space. The variable α can be fine-tuned using Gradient descent. The
36 3 Gradient-Based Neural Architecture Search
goal is to find the optimal cell topologies by learning a collection of variables (α )
that specify how the operations should be combined for a provided set of nodes.
The categorical decision-making process for selecting operations is converted into a
Softmax operation encompassing all feasible operations in DARTS, as described in
Eq. (3.2).
exp α (i,j)
(i,j)
o (x) = o(x) (3.2)
(i,j)
o∈O o ∈O exp αo
B. Two-Level Optimization
Following the continuous relaxation, the subsequent step involves two-level opti-
mization, which seeks to simultaneously acquire both α (operation strength param-
eters) and the weights (ŵ) associated with every operation in the network (Fig. 3.5).
The Ltrain (training loss) and Lval (validation loss) are influenced by the network
Fig. 3.4 Continuous relaxation of the search space in DARTS cell. a Visualization of the DAG
corresponding to cell structure b Shows mixed operation. (The count of operations positioned on
the edges linking every pair of nodes is indicated by N.)
Blended Operations
A (Set of operations)
Feature A
A maps 0.52
0.33 0.31
0.33 A
? 0.33 0.02 0.17
0.33
? 0.08
0.33 0.33 0.90
B
3.2 Gradient-Based NAS Methods
?
B B
? 0.33 B
C 0.13 C
0.33 0.03
0.33 0.84
? C 0.03 0.13
0.33 C
? 0.33 0.84
0.33
0.33 0.33 0.33
D 0.33 0.33 0.51
Operation 0.10 D
0.33 0.31
0.18 0.28
(a) 0.62
D (d)
D
(b) (c)
Fig. 3.5 DARTS Summary a The operations at the edges are initially unspecified. b Continuous relaxation of the search space c Resolution of a two-level
optimization problem to enhance mixing probabilities and network weights simultaneously. d Utilization of acquired mixing probabilities to deduce the final
architecture
37
38 3 Gradient-Based Neural Architecture Search
configuration as well as the weights (ŵ) of network. In the context of architec-
ture search, the objective is to obtain α ∗ which reduces Lval ŵ∗ , α ∗ , where the
optimal weights ŵ∗ for the given
network are found by reducing the training loss:
ŵ∗ = argmin ŵLtrain ŵ, α ∗ . This challenge can be formulated as a two-level
optimization problem, as denoted by Eqs. (3.3 and 3.4) below:
min
Lval ŵ α , α (3.3)
α
s.t.ŵ∗ α = argminŵ Ltrain ŵ, α (3.4)
≈ ∇α Lval ŵ − ξ ∇wLtrain ŵ, α , α (3.6)
where ŵ represents the existing weights and ξ represents the learning rate. Instead
of pursuing full convergence through continuous training
for the inner optimization
(Eq. 3.3), the approach aims to approximate ŵ∗ α by adjusting ŵ in one training
iteration.
The formula to update the weights (ŵ) based on the training loss is given as
follows:
ŵ∗ = ŵ − ξ ∇ ŵLtrain ŵ, α (3.7)
In these equations, ∇ ŵLtrain represents the training loss Gradient concerning the
weights (ŵ) and ∇ α Lval represents the validation loss Gradient concerning the
hyperparameter, α . The step size of the update is determined by the learning rate ξ.
3.2 Gradient-Based NAS Methods 39
In DARTS, the process of obtaining the best neural architecture begins with the
exploration of a continuously relaxed architecture search space. Instead of making
early commitments to discrete architectural choices, DARTS employs a weighted
combination of possible operations. This flexibility allows the optimization process
to utilize gradient descent effectively, enabling the search for optimal configurations
without being constrained by binary decisions.
After the optimization phase, the next step is to convert this relaxed representation
into a discrete cell architecture. This is achieved by selecting the top-ranked opera-
tions based on their learned weights. Specifically, DARTS retains the operations that
have the highest probabilities, ensuring that the most promising configurations are
incorporated into the final model.
This technique allows for a more informed and data-driven approach to neural
architecture design, ultimately leading to architectures that are not only efficient but
also well-suited to the tasks they are intended to perform. By utilizing the strengths
of both continuous relaxation and discrete selection, DARTS effectively bridges
the gap between theoretical exploration and practical implementation, resulting in
high-performing neural networks tailored for diverse applications.
In this section, we present the final cell architectures (both normal and reduction),
learned using the DARTS framework. Each DARTS cell has two inputs, c_{k-1}
and c_{k-2}, which represent the output and input of the previous cell, respectively.
The architecture is designed to produce a single output, c_{k}. Figure 3.6 shows the
normal cell learned through DARTS. Normal cells are designed for feature extraction
and consist of various operations, such as convolutions and pooling, optimized to
capture essential patterns in the data. These configurations enable the network to
effectively process input while maintaining computational efficiency.
Reduction cells, on the other hand, are responsible for down-sampling the feature
maps. They help manage the spatial dimensions of the data as it passes through the
network, allowing for deeper and more abstract feature representations. Together,
these learned normal and reduction cells form the building blocks of the target
architecture, ensuring robust performance across various tasks. Figure 3.7 shows
the reduction cell architecture learned through DARTS.
40 3 Gradient-Based Neural Architecture Search
In the DARTS framework, once the cell architectures are learned, both normal and
reduction cells are utilized to construct a deeper architecture, typically comprising
20 cells, as shown in Fig. 3.8. This 20-cell deep architecture is then employed for
the evaluation of the learned cells, allowing for an assessment of their performance
on various tasks.
The decision to use a shallower network during the architecture search process
is primarily motivated by the need to save computational resources. By limiting
the depth of the network during the search phase, the process becomes more effi-
cient, enabling faster experimentation and iteration. This approach helps in rapidly
identifying promising architectures without incurring the high computational costs
associated with training deep networks.
Once optimal cell architectures are established, they are integrated into the deeper
model for thorough evaluation, ensuring that the discovered configurations are both
effective and practical for deployment. The deeper target network is then trained
from scratch and evaluated for performance.
3.3 Results and Discussion 41
In this section, we present the results of the DARTS method and compare its perfor-
mance with few established NAS methods, including Reinforcement Learning-based
methods and Evolutionary Algorithms. Our evaluation is based on various bench-
marks, focusing on accuracy, efficiency, and the quality of the discovered architec-
tures. Table 3.2 shows the performance of various Neural Architecture Search (NAS)
methods on the CIFAR-10 dataset.
Evolutionary Algorithms have proven to be effective in the domain of NAS on
image classification tasks. NSGANETv1 showcased both accuracy and efficiency,
achieving an impressive 97.98% accuracy within 27 days. OSNAS followed suit
with a noteworthy accuracy of 97.44% in 4 days. These methods demonstrate the
capacity of Evolutionary approaches to explore the architecture space and uncover
high-performing solutions, though with varying computational demands.
Reinforcement Learning techniques have also exhibited promising results in NAS
endeavors. MetaQNN attained an accuracy of 93.08% with 11.2 million parameters,
requiring a substantial 100 GPU days, on the CIFAR-10 dataset. BlockQNN demon-
strated a remarkable accuracy of 97.42%, utilizing reinforcement learning techniques
42 3 Gradient-Based Neural Architecture Search
Table 3.2 Comparative evaluation of different NAS methods on the CIFAR-10 dataset
Reference Search method Accuracy (%) Parameters (Millions) Total GPU days
MetaQNN Reinforcement 93.08 11.2 100
learning
BlockQNN Reinforcement 97.42 3.3 96
learning
NSGANETv1 Evolutionary 97.98 2.9 27
Algorithm
OSNAS Evolutionary 97.44 3.3 4
Algorithm
DARTS Gradient 97.24 3.3 4
optimization
within a span of 96 GPU days. These results highlight the potential of reinforcement
learning to guide the search process effectively, leading to competitive architectures.
Gradient optimization-based methods, emphasizing computational efficiency,
have displayed strong performance on the datasets for the classification of images
like CIFAR-10. DARTS, as an exemplar, reached an accuracy of 97.24% with 3.3
million parameters, all in just 4 days of GPU time. This method underscores the
advantage of rapid exploration through their efficient search process.
In summary, the performance comparison of various NAS methods on the CIFAR-
10 dataset reveals distinct trends among different search paradigms. Evolutionary
Algorithms have showcased competitive accuracies, with methods like NSGANETv1
and OSNAS demonstrating notable efficiency. Reinforcement learning approaches,
exemplified by MetaQNN and BlockQNN, have proven effective in achieving high
accuracies, albeit with varying computational investments. Gradient optimization
techniques, as demonstrated by DARTS, emphasize computational efficiency and
rapid exploration, achieving remarkable accuracies in short periods. The choice of
NAS method depends on a trade-off between accuracy and computational resources,
with each approach catering to different preferences and priorities.
Future research directions for Gradient-based NAS should focus on addressing the
challenges associated with the determination of optimal cell structures. Developing
methods that can effectively navigate the complexities of cell design will enhance
the search process. Additionally, overcoming the depth gap encountered in DARTS
during the transition from architecture search to evaluation is crucial; this may
involve innovating techniques that ensure smoother transitions and better perfor-
mance assessments. Moreover, refining the operations utilized in DARTS to minimize
the impact of irrelevant characteristics will be essential for improving the efficiency
and accuracy of the search process. On the other hand, the strengths of cell-based
Bibliography 43
search spaces should be leveraged further. Future studies could explore novel config-
urations and combinations of candidate operations within cells to maximize flexi-
bility in design. Additionally, investigating ways to customize stacked cells could
lead to more adaptive and efficient neural architectures, ultimately contributing to
the advancement of automated design methodologies in deep learning.
3.5 Summary
Bibliography
12. C.-H. Hsu et al., MONAS: multi-objective neural architecture search. www.tensorflow.org/tut
orials/deep. (Online)
13. Z. Zhong et al., BlockQNN: Efficient Block-wise Neural Network Architecture Generation
14. C.J.C.H. Watkins, Learning from delayed rewards, pp. 1–234. Ph.D. Thesis (1989)
15. V. Mnih et al., Human-level control through deep reinforcement learning. Nature 518 (2015).
https://doi.org/10.1038/nature14236
16. L.-J. Lin. Reinforcement Learning for Robots using Neural Networks. Carnegie Mellon
University, USA (1992).
17. M. Längkvist, L. Karlsson, A. Loutfi, Inception-v4, Inception-ResNet and the impact of residual
connections on learning. Pattern Recognit. Lett. 42(1), 11–24 (2014). http://arxiv.org/abs/1512.
00567. (Online)
18. X. Chen, L. Xie, J. Wu, Q. Tian, Imagenet classification with deep convolutional neural
networks. https://github.com/. (Online)
19. G. Huang, S. Liu, L. Van Der Maaten, K.Q. Weinberger, CondenseNet: An Efficient DenseNet
using Learned Group Convolutions
20. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional
networks, in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, vol. 217 (2017), pp. 2261–2269. https://doi.org/10.1109/CVPR.201
7.243
21. Y. LeCun, Y. Bengio, G. Hinton, G. Deep learning. Nature 521, 436–444 (2015). https://doi.
org/10.1038/nature14539
22. J. Peters, S. Schaal, Policy Gradient methods for robotics, in IEEE International Conference
on Intelligent Robots and Systems (2006), pp. 2219–2225. https://doi.org/10.1109/IROS.2006.
282564
23. Z. Lu et al., Multi-objective Evolutionary design of deep convolutional neural networks for
image classification. https://github.com/mikelzc1990/nsganetv1. (Online)
24. H. Zhang, Y. Jin, K. Hao, Evolutionary search for complete neural network architectures
with partial weight sharing. IEEE Trans. Evol. Comput. 26(5), (2022). https://doi.org/10.1109/
TEVC.2022.3140855
25. J. Hu, Squeeze-and-Excitation_Networks_CVPR_2018_paper.pdf, CVPR (2018), pp. 7132–
7141. http://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_N
etworks_CVPR_2018_paper.html. (Online)
26. F. Juefei-Xu, V.N. Boddeti, M. Savvides, Local Binary Convolutional Neural Networks (2017).
https://doi.org/10.1109/CVPR.2017.456
27. L. Chen, D. Alahakoon, NeuroEvolution of augmenting topologies with learning for data
classification, in 2006 International Conference on Information and Automation (2006). https://
doi.org/10.1109/ICINFA.2006.374100
28. X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable architecture search: bridging the
depth gap between search and evaluation. https://github.com/. (Online)
29. H. Cai, L. Zhu, S. Han, Proxylessnas: direct neural architecture search on target task and
hardware. https://github.com/MIT-HAN-LAB/ProxylessNAS. (Online)
30. Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2015). https://doi.org/
10.1038/nature14539
31. S. Ali, M.A. Wani, Recent trends in neural architecture search systems, in 2022 21st IEEE
International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas
(2022), pp. 1783–1790. https://doi.org/10.1109/ICMLA55696.2022.00272
32. S. Ali, M.A. Wani, Gradient-based neural architecture search: a comprehensive evaluation.
Mach. Learn. Knowl. Extr. 5, 1176–1194 (2023). https://doi.org/10.3390/make5030060
Chapter 4
Efficient Training of Deep Learning
Architectures
4.1 Introduction
The development of deep learning models involves two key stages: architecture
generation and training. Traditionally, creating neural architectures involves a manual
process requiring extensive adjustments and deep expertise. Recently, however, there
has been a significant shift with the advent of automated methods for architecture
generation, such as Neural Architecture Search (NAS). NAS represents a major
advancement by automating the architecture generation process, which has led to
considerable improvements in model development.
While substantial advancements have been made in architecture generation—
from manual to automated methods—less focus has been given to the training phase
of deep learning model development. Traditional training techniques have domi-
nated this stage. To address this gap, a novel training method based on a coarse-to-
fine-tuning approach is introduced. This method starts with coarse training of the
target architecture, allowing the neural network to learn general features applicable
to various cases but with lower accuracy. In the subsequent fine-tuning stage, this
initial broad understanding is refined to enhance task-specific accuracy. The fine-
tuning process incorporates two techniques: Simple Selective Freezing (SSF) and
Progression-Based Selective Freezing (PSF). These techniques selectively freeze
certain layers during fine-tuning to better control adaptation and improve overall
classification accuracy.
To evaluate the effectiveness of the new training method, experiments using a
20-cell deep architecture generated through a Gradient-based architecture search
were conducted. The results on the CIFAR-10 dataset demonstrate that training the
architecture with the new coarse-to-fine-tuning approach yields improved accuracy.
Additionally, experiments were performed to assess the effectiveness of this approach
on the same architecture using the CIFAR-100 and MNIST datasets for evaluating
performance across different datasets.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 45
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_4
46 4 Efficient Training of Deep Learning Architectures
The process of deep learning involves two main stages: architecture generation and
training (Fig. 4.1). In the architecture generation stage, the structure of the neural
network is established within a predefined search space, using either manual or
automated methods. Following this, the defined architecture undergoes training on
the specific task it is intended to perform.
In traditional deep learning training methods, neural network weights are initial-
ized randomly and then optimized using Gradient-based techniques like stochastic
gradient descent (SGD). During this process, the network iteratively adjusts its
parameters to minimize a predefined loss function, computing gradients and updating
weights over multiple epochs. Training continues until the model converges or
achieves an acceptable level of performance. However, this approach can some-
times lead to suboptimal results. This issue arises when adjustments in earlier layers
alter the weight distribution throughout the network, which can negatively impact
learning in subsequent layers and result in less-than-ideal solutions.
Training the entire network simultaneously can cause changes in earlier layers to
affect the training of later layers, potentially impairing their ability to learn mean-
ingful representations. This issue is especially pronounced in deep networks, where
the cumulative effects across multiple layers can hinder the network’s ability to
converge to an optimal solution. Addressing this challenge, a new training approach
designed to enhance the accuracy of deep learning models is developed, which is
detailed in the following section.
Fig. 4.1 Two-stage process of model development with conventional training in deep learning
4.3 Efficient Training Via Coarse-to-Fine-Tuning 47
While significant attention has been given to architecture generation in deep learning,
the training phase has received comparatively less focus. To bridge this gap a coarse-
to-fine-tuning approach is introduced, which incorporates selective freezing to refine
the training process and address this imbalance.
Initially, the target network is trained to capture general features that apply across
a range of scenarios, aiding in the classification of various classes but with reduced
accuracy. The network begins with random weights and is trained through multiple
epochs, adjusting parameters iteratively to minimize the loss function. This coarse
training phase follows traditional deep learning methods but has a different objective:
rather than aiming to achieve the final model, it focuses on learning broad, general
features. This foundational step ensures that the network establishes a solid base by
identifying generalizable patterns in the data, which will be refined in subsequent
stages.
The new method, known as selective freezing, provides a simple yet effective solution
to mitigate the impact of changes in certain layers on the training of other layers within
the network. Selectively freezing a portion of the network’s weights during training,
can reduce the influence of adjustments in these frozen layers on subsequent layers.
This approach allows for more focused parameter tuning and leads to improved
learning outcomes.
Selective freezing supports coarse-to-fine-tuning, which starts with training the
network to learn coarse features and then refines the model’s performance through
focused fine-tuning. Figure 4.2 illustrates the block diagram of the coarse-to-fine-
tuning approach with selective freezing.
Two methods are presented for implementing selective freezing: Simple Selective
Freezing and Progression-Based Selective Freezing. Both methods facilitate incre-
mental weight learning, allowing the model to improve its accuracy through iterative
48 4 Efficient Training of Deep Learning Architectures
Fig. 4.2 Two-stage model development process with the new coarse-to-fine-tuning approach:
Utilizing simple and progression-based selective freezing during fine-tuning
refinement. The following sections detail these two approaches to selective freezing
during fine-tuning.
A. Simple Selective Freezing
In a network with depth N, the Simple Selective Freezing (SSF) technique divides the
weights ω learned during coarse training into two parts: the weights ω1 associated
with the first N/2 layers (or cells) are selectively frozen, while the weights ω2 of the
remaining N/2 layers are optimized during fine-tuning. The weights ω associated
with the entire network that is obtained during the coarse training can be represented
using concatenation notation as follows:
ω = [ω1 , ω2 ] (4.1)
The frozen layers, representing the initial N/2 layers, retain essential general
features learned during coarse training. While the unfrozen layers, closer to the
output, are fine-tuned to adapt to specific task objectives or requirements.
The weight update process in the SSF technique is shown by Eqs. (4.2 and 4.3).
ω1 (p + 1) = ω1 (p) (4.2)
Here, ω(p) represents the overall weights of the network, ξ is the learning rate param-
eter, and ∇ω2 (p) Ltrain (ω(p)) represents the gradient of the training loss function for
updating the weights of unfrozen layers. ω1 (p) and ω1 (p + 1) denote the weights of
the frozen segment before and after fine-tuning, respectively. Similarly, ω2 (p) and
ω2 (p + 1) denote the weights of the unfrozen segment before and after fine-tuning,
respectively. The illustration of simple selective freezing on an “N-layer” network is
given in Fig. 4.3.
4.3 Efficient Training Via Coarse-to-Fine-Tuning 49
Fig. 4.3 Illustration of simple selective freezing on an “N-layer” network. a Depicts the interme-
diate model obtained after the coarse training phase with weights ω, which consists of two parts
i.e., ω1 (p) and ω2 (p). b Shows the final model, after selective freezing, with the first half of the
weights remaining unchanged (i.e., ω1 (p + 1) = ω1 (p)) and the second part of weights is updated
to ω2 (p + 1). Note that the frozen part of the network is shown in the solid green block and the
updating part is shown in the solid yellow block
The steps of training the target network with the SSF technique are summarized
below in Algorithm 1.
50 4 Efficient Training of Deep Learning Architectures
IV. Divide the network into two segments each consisting of N/2 layers (or cells), with weights ω 1 and ω2 associated
with the two segments.
V. Freeze the weights ω1 associated with the first segment.
VI. While not converged: (Fine tuning phase)
Update weights ω2 of the second segment by descending the gradient of the training loss
function( ) using the learning rate ξ
ω2 (p + 1) = ω2 (p) (4.5)
Here, ω(p) represents the overall weights of the network, ξ is the learning rate param-
eter, ∇ω1 (p) Ltrain (ω(p)) and ∇ω3 (p) Ltrain (ω(p)) represent the gradient of the training
loss functions in the (p + 1)th iteration. ω1 (p) and ω1 (p + 1) denote the weights
of the first (unfrozen) segment before and after fine-tuning, respectively. Similarly,
(ω2 (p), ω2 (p + 1)), and (ω3 (p), ω3 (p + 1)) denote the weights of the second (frozen)
4.3 Efficient Training Via Coarse-to-Fine-Tuning 51
segment and third (unfrozen) segment before and after fine-tuning, respectively. Note
that, in the first iteration, there are no layers in the first segment. In each successive
iteration, the first segment grows, the second segment shifts downwards and the third
segment shrinks. By freezing specific layers in each iteration, the training process
is stabilized by preventing previously learned parameters (of the frozen segment)
from excessively influencing the remaining layers. This balanced approach ensures
that the unfrozen layers can adapt more effectively to the task at hand without being
overwhelmed by the frozen ones. Figure 4.4 illustrates the PSF technique for an
N-layer network.
The steps for training the target network with PSF are outlined below in Algorithm
2.
Fig. 4.4 Illustration of progression-based selective freezing. a The weights of the intermediate
model obtained after the coarse training. b During the Ist iteration, the weights of the first m(=K)
layers are frozen while the weights of the remaining layers are updated. c During the 2nd iteration,
the weights of the second set of m layers are frozen while the weights of the remaining layers are
updated. d During last iteration, the weights of the last but one m layers are frozen while the weights
of the remaining layers are updated. Note that frozen part of network is shown in solid green block
and updated part is shown in solid yellow block
52 4 Efficient Training of Deep Learning Architectures
IV. Divide the network layers into three segments having 'l', 'm', and 'n' number of layers respectively, with
ω1, ω2, and ω3 representing the weights of each segment.
V. Do: (Fine-tuning phase)
l=0 for the first iteration, l = l+K otherwise, starting at 1st position
m = K starting at (l+1)th position ,
n = (N-l-m). starting at (l+m+1)th position
Freeze the weights (ω2) for the second segment of the network
While not converged
Update weights of the first segment (ω1) and last segment (ω3) by descending
the gradient of the training loss function( ) using the learning rate ξ
ω1(p+1) = ω1(p) - ξ. (ω(p))
The proposed approach is versatile and can be applied to various deep learning
architectures. To illustrate its effectiveness, we specifically implement it in a 20-
layer architecture derived using DARTS. Traditionally, neural networks are trained
using standard training methods. To assess the new approach, the performance of the
target network trained with both conventional methods and our proposed technique
is compared. The results indicate that the new method consistently yields models
with improved accuracy. The following section provides a detailed discussion of the
application of the selective freezing training technique to the target architecture.
4.4 Experiments
The performance of the selective freezing training approach is evaluated in the context
of image classification tasks. Several well-known image classification datasets,
namely MNIST, CIFAR-10, and CIFAR-100 were used. An architecture search was
conducted specifically on the CIFAR-10 dataset. The target architecture created was
trained across all three datasets. The transferability feature of using one dataset
for creating a target architecture and training it with the proposed approach across
different datasets is also discussed.
4.4.1 Datasets
purposes and 10,000 images for testing. Each image in the dataset has a dimension
of 32 × 32 pixels. The CIFAR-10 dataset is publicly accessible at (https://www.cs.
toronto.edu/~kriz/CIFAR.html).
The CIFAR-100 is another widely used benchmark dataset consisting of 60,000
color images, similar to CIFAR-10, but is divided into 100 different classes instead
of 10. Each class has 600 images, with 500 images in the training set and 100 images
in the test set. The images are of size 32 × 32 pixels, just like in CIFAR-10. The
CIFAR-100 dataset poses a more challenging classification task due to its increased
number of classes, making it valuable for evaluating the robustness and generalization
capabilities of machine learning models. The dataset is accessible at (https://www.
cs.toronto.edu/~kriz/CIFAR.html).
The MNIST dataset is a well-known dataset used for image classification appli-
cations. It is a collection of grayscale handwritten digits. The dataset has 60,000
images for training and 10,000 images for testing.
The results of the new coarse training integrated with selective freezing techniques
for a target network are discussed below and are compared with the results of the
traditional training method.
A. Results of Coarse Training Integrated with SSF Technique
The impact of using different data proportions for training the target network using
the new training method is explored. The target model is trained with 60%, 80%,
and the entire training dataset for evaluating the new coarse training integrated with
the SSF technique.
To gain a comprehensive understanding of the effectiveness of the new training
approach, parallel experiments were conducted to compare it with the traditional
training method. Both approaches were subjected to an equal number of epochs,
corresponding to a given proportion of the dataset.
The results on CIFAR-10, presented in Table 4.1, clearly demonstrate the SSF
technique’s effectiveness in improving the target model’s accuracy. When utilizing
traditional training methods, the model attained accuracies of 85.12%, 89.09%, and
96.11% for 60%, 80%, and 100% of the training data, respectively. These results were
achieved across 400 epochs for 60% of the data, 500 epochs for 80% of the data, and
600 epochs for the entire dataset. In contrast, the coarser training integrated with the
SSF technique produced 94.10, 95.59, and 96.22% accuracy for the corresponding
data proportions training over the same epoch counts.
54 4 Efficient Training of Deep Learning Architectures
Table 4.1 Performance comparison of coarse training integrated with SSF versus traditional
training on CIFAR-10
Training method Percentage of Parameters (Millions) Accuracy
data used for Parameters Average (%)
training (%) parameters
Traditional training 60 3.3 3.3 85.12
80 3.3 3.3 89.09
100 3.3 3.3 96.11
Coarse-to-fine-tuning Coarse Fine-tuning [(P1 + P2)/
with SSF training training (P2) 2]
(P1)
60 3.3 1.65 2.5 94.10
80 3.3 1.65 2.5 95.59
100 3.3 1.65 2.5 96.22
These findings highlight the improvements offered by the coarse training inte-
grated with the SSF technique, demonstrating its potential as a valuable tool for deep
learning model optimization.
B. Cross-Dataset Transferability Results of Coarse Training Integrated with the
SSF technique
The learned cell obtained by employing the new coarse training integrated with the
SSF technique on the CIFAR-10 dataset was used to evaluate its performance on
CIFAR-100 and MNIST datasets, to check the versatility of the new approach. The
results from the CIFAR-100 dataset are given in Table 4.2. With a training data
proportion of 60%, the new method improved accuracy from 74.29 to 75.56% over
a total of 400 epochs. At 80% of training data, accuracy with the new method was
79.16% compared to 77.54% using traditional training, spanning 500 epochs. With
complete training data, accuracy reached 80.35% with the new training method,
compared to 79.86% with the traditional training, across 600 epochs.
The experimental results for the MNIST dataset are shown in Table 4.3. Using the
traditional training method, accuracies of 88.19, 94.09, and 94.99% were achieved
for 60, 80, and 100% of the training data, respectively, with epochs set at 50, 60, and
70. In contrast, the new coarse training integrated with the SSF approach, involving
40, 50, and 60 epochs of coarse training followed by 10 epochs of fine-tuning for the
respective data proportions, significantly surpassed traditional training results. The
new method demonstrated enhanced accuracies of 99.51%, 99.55%, and 99.67%,
respectively, for the same data proportions.
The results in Tables 4.2 and 4.3 indicate that the new method not only enhances
performance compared to traditional training but is also highly transferable, where
an architecture learned on one dataset can be trained on other datasets. For instance,
when a network with cells learned on CIFAR-10 was trained using the new approach
on CIFAR-100, the accuracy increased by about 1.7% and 2.1% with the 60% and
4.4 Experiments 55
Table 4.2 Transferability performance comparison of coarse training integrated with SSF versus
traditional training on CIFAR-100
Training method Percentage of Parameters (Millions) Accuracy
data used for Parameters Average (%)
training (%) parameters
Traditional training 60 3.4 3.4 74.29
80 3.4 3.4 77.54
100 3.4 3.4 79.86
Coarse-to-fine-tuning Coarse Fine-tuning [(P1 + P2)/
with SSF training training (P2) 2]
(P1)
60 3.4 1.7 2.6 75.56
80 3.4 1.7 2.6 79.16
100 3.4 1.7 2.6 80.35
Table 4.3 Transferability performance comparison of coarse training integrated with SSF versus
traditional training on CIFAR-100
Training method Percentage of Parameters (Millions) Accuracy
data used for Parameters Average (%)
training (%) parameters
Traditional training 60 3.3 3.3 88.19
80 3.3 3.3 94.09
100 3.3 3.3 94.99
Coarse-to-fine-tuning Coarse Fine-tuning [(P1 + P2)/
with SSF training training (P2) 2]
(P1)
60 3.3 1.65 ≈2.5 99.51
80 3.3 1.65 ≈2.5 99.55
100 3.3 1.65 ≈2.5 99.67
method’s performance gains may appear relatively smaller on complete datasets, this
is primarily due to the inherent complexity of the task, where marginal improve-
ments are harder to achieve regardless of the training method employed. Thus,
the new method’s versatility in optimizing model performance across different
data scenarios underscores its potential for improving model performance for deep
learning applications.
C. Results of Coarse Training Integrated with the PSF Technique
The coarse training integrated with the Progression-Based Selective Freezing (PSF)
technique builds upon the SSF technique where freezing 50% of the weights yielded
notable improvements. The PSF technique is applied to the already learned weights
obtained during coarse training. The location of the frozen segment of the network
progressively changes as training progresses. Three distinctive variants of PSF are
examined, strategically freezing 1/4, 1/5, and 1/10th of the weights. The impact
of different freezing proportions on the performance of the trained models, on the
CIFAR-10 dataset, is examined.
Table 4.4 presents experimental results showing the impact of progressively
freezing 1/4th of the weights on the performance of the trained model. In the initial
iteration, where the first 1/4th of the weights (already learned during coarse training)
were frozen, an accuracy of 96.11% was noted. The subsequent iterations indicated
improved accuracies: 96.22%, and finally, 96.27%. The new approach involved two
main phases: coarse training and fine-tuning. During coarse training, the model
builds its foundational knowledge for 500 epochs. Following this, in each round
of fine-tuning, the model improves its performance further for 30 epochs. Moreover,
during fine-tuning only 2.5 million parameters are updated, in addition to the 3.3
million parameters updated during coarse training. The average number of parameters
updated in the coarse and fine-tuning phases is approximately 2.9 million.
Table 4.5 illustrates the impact of freezing 1/5th of the weights and modifying the
remaining ones, giving insights into improvement in accuracy and associated compu-
tational cost. In the initial iteration, where the first 1/5th of the weights (already
Table 4.4 Performance metrics of coarse training integrated with 1/4th weight freezing (Variant1)
of PSF on CIFAR-10 dataset
Variant Iteration Parameters (Millions) Accuracy
Coarse Fine-tuning Average (%)
training training parameter
parameters parameters [(P1 + P2)/2]
(P1) (P2)
Coarse-to-fine-tuning 1st 1/4th 3.3 2.5 2.9 96.11
with 1/4th weights frozen
frozen (Variant1) in 2nd 1/4th 3.3 2.5 2.9 96.22
PSF frozen
3rd 1/4th 3.3 2.5 2.9 96.27
frozen
4.4 Experiments 57
Table 4.5 Performance metrics of coarse training integrated with 1/5th weight freezing (Variant2)
of PSF on CIFAR-10 dataset
Variant Iteration Parameters (Millions) Accuracy
Coarse Fine-tuning Average (%)
training training parameters
parameters parameters [(P1 + P2)/2]
(P1) (P2)
Coarse-to-fine-tuning 1st 1/5th 3.3 2.6 2.95 96.25
with 1/5th weights frozen
frozen (Variant2) in 2nd 1/5th 3.3 2.6 2.95 96.27
PSF frozen
3rd 1/5th 3.3 2.6 2.95 96.27
frozen
4th 1/5th 3.3 2.6 2.95 96.29
frozen
58 4 Efficient Training of Deep Learning Architectures
Table 4.6 Performance metrics of coarse training integrated with 1/10th weight freezing (Variant
3) of PSF on CIFAR-10 dataset
Variant Iteration Parameters (Millions) Accuracy
Coarse Fine-tuning Average (%)
training training parameters
parameters parameters [(P1 + P2)/2]
(P1) (P2)
Coarse-to-fine-tuning 1st 1/10th 3.3 2.97 3.14 96.14
with 1/10th weights frozen
frozen (Variant3) in 2nd 1/10th 3.3 2.97 3.14 96.21
PSF frozen
3rd 1/10th 3.3 2.97 3.14 96.26
frozen
4th 1/10th 3.3 2.97 3.14 96.32
frozen
5th 1/10th 3.3 2.97 3.14 96.33
frozen
6th 1/10th 3.3 2.97 3.14 96.35
frozen
7th 1/10th 3.3 2.97 3.14 96.37
frozen
8th 1/10th 3.3 2.97 3.14 96.39
frozen
9th 1/10th 3.3 2.97 3.14 96.40
frozen
Future work can explore several key areas to enhance and validate the coarse-to-
fine-tuning approach. One focus can be on integrating this approach with a wider
variety of manually designed and automated architectures to assess its versatility and
effectiveness across different network designs.
Evaluating the scalability of the approach for large models and datasets, as well
as its computational demands, can be crucial for practical applications in resource-
constrained environments. Finally, comparing the coarse-to-fine-tuning approach
with other training techniques, such as weight-sharing methods, can provide insights
into its relative strengths and potential areas for improvement. Through these efforts,
one can refine and extend the approach, establishing its value and applicability in
deep learning model training.
Bibliography 59
4.6 Summary
Bibliography
1. L. Li, A. Talwalkar, Random search and reproducibility for neural architecture search, in
Proceedings of PMLR (2020), pp. 367–377
2. Z. Zhong et al., BlockQNN: efficient block-wise neural network architecture generation. IEEE
Trans. Pattern Anal. Mach. Intell. 43(7), 2314–2328 (2021). https://doi.org/10.1109/TPAMI.
2020.2969193
3. B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural network architectures using reinforce-
ment learning, in Proceedings of the 5th International Conference on Learning Representations
(ICLR), Toulon, France, April 24–26 (2017)
4. B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in Proceedings of
the 5th International Conference on Learning Representations (ICLR) 2017, Toulon, France,
April 24–26 (2017)
5. Y. Gu, Y. Cheng, C.L.P. Chen, X. Wang, Proximal policy optimization with policy feedback.
IEEE Trans. Syst. Man Cybern. Syst. 52(7), 4600–4610 (2022). https://doi.org/10.1109/TSMC.
2021.3098451
6. B. Zoph, G. Brain, V. Vasudevan, J. Shlens, Q.V. Le, G. Brain, Learning transferable architec-
tures for scalable image recognition, in Proceedings of the CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (2018), pp. 8697–8787. https://doi.org/10.1109/CVPR.2018.
00907
7. E. Real, Large-scale evolution of image classifiers, in Proceedings of the 34th International
Conference on Machine Learning (ICML), vol. 70 (2017), pp. 2902–2911
8. E. Real, A. Aggarwal, Y. Huang, Q.V. Le, Aging evolution for image classifier architecture
search, in Proceedings of the AAAI Conference on Artificial Intelligence (2019), pp. 4780–4789
9. H. Liu, K. Simonyan, O. Vinyals, C. Fernando, K. Kavukcuoglu, Hierarchical representations
for efficient architecture search, in International Conference on Learning Representations
(2018). http://arxiv.org/abs/1711.00436. (Online)
10. D.E. Goldberg, K. Deb, A comparative analysis of selection schemes used in genetic algorithms,
in Foundations of Genetic Algorithms, vol. 1 (1991), pp. 69–93. Elsevier. https://doi.org/10.
1016/B978-0-08-050684-5.50008-2. ISSN 1081-6593, ISBN 9780080506845
11. T. Elsken, J. H. Metzen, F. Hutter, Efficient multiobjective neural architecture search via Lamar-
ckian evolution, in International Conference on Learning Representations (2019). http://arxiv.
org/abs/1804.09081. (Online)
12. S. Ali, M.A. Wani, Gradient-based neural architecture search: a comprehensive evaluation.
Mach. Learn. Knowl. Extr. 5(3), 1176–1194 (2023). https://doi.org/10.3390/make5030060
13. H. Liu, K. Simonyan, Y. Yang, DARTS: differentiable architecture search, in 7th International
Conference of the Learning Represent (ICLR) (2019), pp. 1–13
14. E. Real, A. Aggarwal, Y. Huang, Q.V. Le, Regularized evolution for image classifier architec-
ture search, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019),
pp. 4780–4789
60 4 Efficient Training of Deep Learning Architectures
15. X. Chen, L. Xie, J. Wu, Q. Tian, Progressive differentiable architecture search: bridging the
depth gap between search and evaluation, in Proceedings of the IEEE/CVF International
Conference on Computer Vision (2019), pp. 1294–1303. https://github.com/. (Online)
16. K. Nakai, T. Matsubara, K. Uehara, Att-DARTS: differentiable neural architecture search for
attention, in Proceedings of the IEEE International Joint Conference on Neural Networks
(IJCNN) (2020), pp. 1–8
17. G. Raskutti, M.J. Wainwright, Early stopping and non-parametric regression: an optimal data-
dependent stopping rule. J. Mach. Learn. Res. 15, 335–366 (2014)
18. W. Roth, F. Pernkopf, S. Member, Bayesian neural networks with weight sharing using Dirichlet
processes. IEEE Trans. Pattern Anal. Mach. Intell. 42(1), 246–252 (2020). https://doi.org/10.
1109/TPAMI.2018.2884905
19. S. Ali, M. A. Wani, Recent trends in neural architecture search systems, in 2022 21st IEEE
International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas
(2022), pp. 1783–1790. https://doi.org/10.1109/ICMLA55696.2022.00272
Chapter 5
Generative Adversarial Networks
in Image Steganography
5.1 Introduction
Vanilla GAN, also known as the original GAN model, serves as the basic network on
which subsequent GAN architectures are built. Vanilla GAN has a straightforward
architecture consisting of a generator and a discriminator network that engages in a
competitive training process. The generator aims to produce realistic synthetic data
samples, while the discriminator strives to differentiate between real and synthetic
samples. Vanilla GAN is simple and easy to implement, making it an ideal starting
point for exploring its use in various applications. Despite its simplicity, Vanilla
GAN demonstrates remarkable capabilities in generating diverse and realistic data
distributions, laying the groundwork for advancing its application in various domains,
including image generation and steganography.
The block diagram of the vanilla GAN is given in Fig. 5.1. Multilayer Percep-
tron (MLP) is commonly used as the basic building block for both the generator
and discriminator networks in Vanilla GAN. MLPs are the simplest form of neural
networks, consisting of multiple layers of neurons with fully connected connections
between each layer.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 61
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_5
62 5 Generative Adversarial Networks in Image Steganography
One analogy characterizes GAN as a thief (the Generator) who forges currency
and a detective (the Discriminator) who tries to apprehend him. The detective must
become more efficient at spotting fake currency as it becomes more realistic-looking,
and vice versa.
The word “generative” refers to the model’s primary goal of producing new
synthetic data. Depending on the training dataset used, a GAN can learn to make
different types of synthetic data.
The word “adversarial” refers to the competitive, game-like dynamics between the
Generator and Discriminator, the two models that make up the GAN framework. The
two networks are constantly competing to outsmart one another: the Discriminator
becomes more effective in differentiating real samples from synthetic ones, while
the Generator becomes more efficient in producing synthetic data similar to the real
data.
The architecture of the generator network is given in Fig. 5.2. The generator uses
a random noise, usually represented as a simple N-dimensional Gaussian random
vector, to produce synthetic data that closely resembles the distribution of real data.
To achieve this, the generator typically comprises a series of fully connected layers,
each followed by activation functions. As the noise data moves through these layers,
it undergoes a process of up-sampling, gradually transforming it into high-resolution
output images.
In essence, the generator functions as a network of interconnected layers, with
each layer playing a specific role in shaping the output. This network architecture
ensures that the generated synthetic data aligns with the desired characteristics of
the real data distribution. Once the data passes through all the layers, it reaches the
output layer, where necessary adjustments are made to ensure compatibility with the
discriminator network. Ultimately, the goal of the generator is to produce synthetic
5.2 Vanilla Generative Adversarial Networks 63
samples that can effectively deceive the discriminator into not differentiating between
the real data and the synthetic data.
When processing data, the discriminator network analyzes both real and synthetic
samples, extracting features that differentiate between them. These features are then
used to make binary classification decisions, determining whether each input sample
is real or synthetic. As the data moves through the network, it undergoes a process
of down-sampling, gradually condensing the information to a binary classification
output.
The discriminator operates as a network of interconnected layers, with each layer
contributing to the discrimination process. The network architecture is designed to
effectively distinguish between real and synthetic data samples, providing feedback
to the generator for improving its ability to generate realistic data.
Min Max
V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) log(1 − D(G(z))) (5.1)
G D
In this equation, G represents the generator network, D represents the discrimi-
nator network, x denotes the real data samples, z denotes the random noise input to
the generator, pdata (x) represents the distribution of real data, and pz (z) represents the
distribution of the noise vector z. The first term corresponds to the discriminator’s
loss, while the second term corresponds to the generator’s loss.
This loss function makes the generator and discriminator compete with each
other, leading to the convergence of both networks towards an equilibrium where the
generator produces realistic samples and the discriminator struggles to differentiate
between real and synthetic samples effectively.
5.3 Deep Convolutional Generative Adversarial Networks 65
The DCGAN generator network is depicted in Fig. 5.5. The initial phase involves
initializing a random noise vector, typically denoted as z, with dimensions [100,
1]. This vector is then passed through a fully connected layer, yielding a vector of
random numbers with dimensions [16384, 1], achieved through linear projection.
Subsequently, in the reshaping step, the outputs from the previous phase are trans-
formed into a feature map of size 4 × 4 × 1024, which is fed into the network’s first
convolution layer using fractional strides.
The generator network comprises four convolution layers that use fractional
strides, each designed to up-sample the input feature maps, where feature maps
progress from size 4–8 to 16–32. The architectural choices ensure that the generator
progressively refines the random noise input, generating increasingly detailed and
coherent images as it traverses through the network layers.
The loss function used in WGANs is based on the Wasserstein distance, also known as
the Earth Mover’s Distance (EMD). Unlike traditional GANs, which use the binary
cross-entropy loss to train the discriminator and generator, WGANs optimize the
Wasserstein distance between the distributions of real and generated data. Wasser-
stein distance is directly tied to the fidelity of generated samples to the true data
distribution.
The Wasserstein distance measures the minimum amount of work (or “distance”)
required to transform one probability distribution into another. In the context of
WGANs, this distance is computed between the distribution of real data (x) and
the distribution of generated data (z). The generator aims to minimize this distance,
effectively bringing the generated distribution closer to the real data distribution.
The loss function aims to maximize the difference between the critic’s output scores
for real and generated samples. This difference quantifies how well the critic can
distinguish between real and synthetic data. The discriminator loss function in
WGANs is expressed as
Wassertein Critic Loss = max Ex∼pr [D(x)] − Ez∼pz [D(G(z))] (5.2)
θD
5.5 Auxiliary Classifier Generative Adversarial Networks (ACGAN) 69
The generator loss function involves minimizing the Wasserstein distance but with
a flipped sign. It aims to minimize the discrepancy between the critic’s scores for
real and generated samples. The generator loss function in WGANs is expressed as
Generator Loss = − min Ez∼pz [D(G(z))] (5.3)
θG
This loss function drives the training of WGANs, leading to the refinement of
both the generator and discriminator networks over successive iterations.
The architecture of an ACGAN consists of two main components: the generator and
the discriminator. As illustrated in Fig. 5.8, the generator takes random noise (z) as
input, along with the desired conditioning information (c), and learns to generate
samples that not only resemble real data but also exhibit the specified attributes.
This is typically achieved through a series of convolutional and up-sampling layers,
where the noise vectors and auxiliary information are combined and transformed
into high-dimensional representations, ultimately producing realistic images.
The discriminator in an ACGAN serves a dual role. In addition to discriminating
between real and generated samples, it is also responsible for predicting the auxiliary
information associated with the samples. The discriminator receives both real and
generated samples, along with their respective auxiliary conditioning information,
and learns to classify them into their corresponding classes or attribute categories.
This dual-task architecture encourages the discriminator to learn a more informative
representation of the data, enabling it to accurately predict the auxiliary information
while effectively distinguishing between real and synthetic samples.
70 5 Generative Adversarial Networks in Image Steganography
The ACGAN discriminator network is designed to evaluate the authenticity and class
of input images, as depicted in Fig. 5.10. The network comprises several convolu-
tional layers that progressively down-sample the input image, extracting hierarchical
5.5 Auxiliary Classifier Generative Adversarial Networks (ACGAN) 71
32×32×128
FC with 1
16×16×256 Flatten unit
8×8×512
4×4×1024
Conv4 (5 * 5)
Conv3 (5 * 5)
Conv2 (5 * 5)
Real Image /
Fake Image
The ACGAN loss functions for both the generator and discriminator are tailored to
optimize two main objectives: generating realistic images and predicting auxiliary
information, such as class labels or attributes, associated with the samples. These
loss functions play a crucial role in guiding the training process and ensuring that
the generator produces high-quality images aligned with the specified attributes.
The ACGAN generator loss function is composed of two components: the adversarial
loss and the auxiliary classification loss. The adversarial loss, similar to traditional
GANs, encourages the generator to produce samples that are indistinguishable from
real data, as perceived by the discriminator. Formally, the generator’s loss function
can be expressed as
Generator Adversarial Loss = −Ez∼pz log D(G(z, c)) (5.4)
The generator aims to minimize this loss, effectively fooling the discriminator
into believing that the generated samples are real.
The auxiliary classification loss encourages the generator to produce samples that
match the specified attributes or classes. Formally, the auxiliary classification loss
for the generator can be expressed as
Generator Auxiliary Loss = −Ez∼pz ,c∼pc log Q((c|G(z, c))) (5.5)
It predicts the auxiliary information c for the given generated samples. The gener-
ator seeks to minimize this loss, ensuring that the generated samples are aligned with
the specified attributes.
The ACGAN loss function also consists of two components: the adversarial loss
and the auxiliary classification loss. The adversarial loss guides the discriminator
to effectively distinguish between real and generated samples, while the auxiliary
classification loss encourages the discriminator to accurately predict the auxiliary
information associated with the samples.
Formally, the discriminator’s loss function can be expressed as
Discriminator Adversarial Loss = − Ex∼pdata (x) log D(x, c)
− Ez∼pz log(1 − D(G(z, c))) (5.6)
5.6 Challenges and Future Direction 73
Discriminator Auxiliary Loss = − Ex∼pdata (x),c∼p(c) log Q((c|x))
− Ez∼p(z),c∼p(c) log(1 − Q((c|G(z, c)))) (5.7)
The discriminator aims to minimize the adversarial loss while maximizing the
auxiliary classification loss.
By jointly optimizing these loss functions, ACGAN facilitates the generation of
diverse and controllable outputs while ensuring alignment with specified attributes
or characteristics. The adversarial and auxiliary classification components of the loss
functions guide the training process, leading to the synthesis of high-quality images
that not only resemble real data but also exhibit the desired attributes or classes.
5.7 Summary
This chapter focused on GAN, the prevailing deep learning architecture widely
employed in several applications such as medical, remote sensing, computer vision,
natural language processing, and steganography. It delved into the discussion of
various Generative Adversarial Network (GAN) models, including Vanilla GAN,
Deep Convolutional GAN (DCGAN), Wasserstein GAN (WGAN), and Auxiliary
Classifier GAN (ACGAN). As researchers continue to refine GAN-based approaches
to image steganography, the potential for enhancing data security and privacy remains
promising, paving the way for new advancements and applications in the field.
Bibliography
6.1 Introduction
S = f(C, m, k)
m = f (S, k)
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 75
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_6
76 6 Deep Learning Architectures for Image Steganography
Cover
Object C
Steganographic
Encoder Stego
Object S
S=f (C, m, k)
Message
m
Steganalyzer
Decoded Steganographic
message Decoder
m′
m′=f ′(S, k)
During transmission of the stego object from the encoder to the decoder, the stegan-
alyzer analyses the stego object. This process of intrusion into the communication
channel to find out whether the secret communication is taking place by the intruder
is known as steganalysis. The security of a steganographic system can be increased by
making the probability distribution PC of the stego object similar to the probability
distribution PC of the cover object. Thus
D(PC , PC ) <∈
The metrics of the three parameters: (i) Security, (ii) Hiding Capacity, and (iii)
Invisibility for evaluating the performance of a steganographic technique are given
below.
(a) Security: Steganalysis test determines whether a given image is embedded with
secret data or not; it labels a given image as a cover or stego. A high error rate
implies a secure steganographic technique and is computed using the following:
number of true positives (TP), number of false negatives (FN), number of true
negatives (TN), and number of false positives (FP). The accuracy and error rate
are defined below:
(b) Hiding Capacity: Hiding capacity is the number of secret data bits that can
be effectively embedded inside the cover image. It is computed as the ratio of
maximum hiding capacity to the image size.
Absolute Capacity
Relative Capacity =
Image Size
78 6 Deep Learning Architectures for Image Steganography
Maximum hiding capacity or absolute capacity gives the number of secret data
bits hidden inside an image and relative capacity gives the maximum number of bits
hidden per pixel. Relative capacity is also referred to as bit rate, and is measured in
bits per pixel (bpp) or bytes per pixel (bpp).
(c) Extraction Accuracy: Extraction accuracy is obtained by dividing the number
of secret bits that are accurately decoded by the extractor model to the number of
secret bits concealed inside the cover image. With more steganography capacity,
the extractor network’s performance declines.
Here MaxI indicates the maximum possible intensity levels (pixel values) in an image,
and MSE is the mean square error that is given as:
1 2
m−1 n−1
MSE = C(i, j) − C (i, j)
mn i=0 j=0
where C(i, j) represents the cover image pixel and C (i, j) represents the stego image
pixel.
m and n represent the number of rows and columns respectively in the images.
SSIM measures the similarity between two images and is given as:
Here μc denotes the mean intensity of cover image (C), and μc denotes the mean
intensity of the stego image (C ); σc2 and σc2 represent the variance of C and C respec-
tively; σ cc gives the covariance between C and C ; C1 and C2 are two parameters
for stabilizing weak denominator divisions and are given by C1 = (k1 L)2 and C2 =
(k2 L)2 , L indicates the dynamic range of pixels values (typically this is 2 # bits per
pixel-1) and k1 and k2 are set to 0.01 and 0.03 respectively as default values.
Cover generation techniques generate secure cover images for steganography. These
artificial cover images are generated by GAN models like SGAN, and WGAN. The
models described in SGAN Volkhonskiy et al. (2020), Zi et al., (2019) and Shi et al.
(2018) have been developed for generating secure covers for steganography, and a
comparison of the performance of these models is given in Table 6.2. The workflow
diagram of the SGAN model for image steganography is shown in Fig. 6.2.
The model consists of a DCGAN generator, a DCGAN discriminator, and a CNN-
based steganalyzer. DCGAN generator receives random noise as input and generates
an image from this noise. The discriminator receives generated and real images
alternatively and its role is to distinguish between real and generated images. A secret
message is embedded in the generated cover image using a traditional steganography
Stego Image
DCGAN Generated
Generator Secret Message
Cover Images
Embedding using a
Traditional
Generated Algorithm
Noise Vector Z Cover Image
algorithm like LSB. In the original DCGAN, only the discriminator is used to perform
classification, but in SGAN, steganalyzer is also used to perform binary classification.
Steganalyzer’s role is to distinguish between the generated cover image and the stego
image. During the early stages of training, the generator produces images that do
not resemble real images. As the training progresses, the generator produces better
and better images with the help of the feedback received from the discriminator
and the steganalyzer. The trained generator produces cover images that cannot be
differentiated from the real images or stego images. This results in secure cover
images for steganography. The generator is trained against the discriminator and the
steganalyzer simultaneously using the loss function given below:
L =α EX∼p(X) logD(X) + EZ∼p(z) log(1 − D(G(Z)))
+ (1 − α)EZ∼p(z) log S(Stego(G(Z))) + log(1 − S(G(Z)))
where α is a parameter that is used to control the trade-off between the realism of
covers and security of covers.
Z is a noise vector.
X is a real image.
E (.) is the expectation value.
Stego (.) is the output of a traditional embedding algorithm.
S (.) is the output of the steganalyzer network.
G (.) is the output of the generator network (generated cover image).
D (.) is the output of the discriminator network.
Distortion learning techniques use high pass filtered cover images in distortion gener-
ators to produce distortion maps for steganography. A pre-trained simulator is used
to embed secret messages in the distortion map. The distortion maps are generated by
GAN-based distortion learning models like ASDL-GAN (Tang et al. 2017), UT-SCA-
GAN (Yang et al. 2018), and UT-6-HPF-GAN (Yang et al. 2020). A comparison of
the performance is presented in Table 6.3. The workflow diagram of the ASDL-GAN
model for image steganography is shown in Fig. 6.3.
The distortion-based steganography algorithms use a high pass filtered version of
the cover image to generate residual maps. On similar lines, GAN-based distortion
learning models use high pass filtered versions of cover images for producing distor-
tion maps and in steganalyzer. This ensures consistency between the generator and
steganalyzer at the input level. As depicted in Fig. 6.4, the generator network receives
the high pass filtered cover image as an input and generates the distortion map. The
6.3 Deep Learning-Based Steganography Techniques 81
Cover
Image
Modified
High Pass Distortion Map +
Filter
Pre-Trained Stego
High-Pass Filtered Distortion Distortion
Image
Embedding
Image Generator Map
Simulator
High Pass
Filter
distortion map is fed into a small network called Ternary Embedding Simulator (TES)
that is pre-trained and acts as an activation function for the generation of modified
distortion map m. The modified distortion map m and the cover image are added
together to form the stego image S. The role of the steganalyzer is to distinguish
between generated stego images and the cover images. The two networks compete
with each other to achieve the task. TES is pre-trained and uses the loss function
given below:
1 1
H W
2
lTES = mi,j − mi,j
H W i=1 j=1
where mi,j is the modified distortion map and m i,j is the ground truth
The steganalyzer is trained using the function:
2
lD = − yi log yi
i=1
where yi represents ground truth and yi represents the output of the steganalyzer.
The generator is trained using the loss function given below:
lG = αl 1 + βl 2
Adversarial Secret
+ Image Message
Stego
CNN Based Image
Steganalyzer
f X = abs(Z X − Z Stego X
GAN embedding techniques for image steganography are unlike traditional algo-
rithms as these techniques are free from any human intervention. A model repre-
senting this technique is trained using the adversarial training of GAN to embed a
secret message in a cover image. The techniques described in (Abadi and Andersen
2016), GSIVAT (Hayes and Danezis 2017; Shi et al. 2019; Liu et al. 2018; Yedroudj
et al. 2020; Yang et al. 2019; Fu et al. 2020) are GAN embedding techniques. A
comparison of the performance of such techniques is given in Tables 6.5, 6.9, and
6.11. The GAN embedding GSIVAT model workflow is shown in Fig. 6.5.
A secret message (M) and a cover image (C) are fed to the generator network to
generate the stego image (C ) as shown in Fig. 6.5. The extractor network receives the
stego image generated and recovers the secret message. The discriminator network
serves as a steganalyzer and receives both the cover image and the stego image
alternatively and provides feedback for training the generator. The generator is trained
so that the steganalyzer cannot differentiate between the cover image and the stego
image. The output value of the steganalyzer that is close to 0 indicates that the input
message is a stego image and the value that is close to 1 indicates that the input
image is a cover image. Initially, the generator produces noisy images from which
84 6 Deep Learning Architectures for Image Steganography
Secret
Extractor Secret
Message Network Message
(M)
the extractor is not able to extract the secret message successfully. As the training
progresses, all the networks improve their performances. With the trained generator,
the extractor can successfully recover the secret message from the stego image. The
extractor network is trained by minimizing the loss function shown below:
LE (θG , θE , M, C) =d M, M
= d M, B θE , C
= d(M, B(θE , A(θG , C, M))
where θG and θE denote the parameters of the generator and extractor networks.
M and M denote the original secret message and the recovered secret message.
C and C denote the cover and stego images.
d(M, M ) is the Euclidean distance between M and M .
The steganalyzer uses the Sigmoid cross-entropy loss function and is given as:
LS θS , C, C = −y(log(E(θS , X))) − (1 − y)(log(1 − E(θS , X)))
a Cover
Images
S1
Z1
S2
Z2
DCGAN Stego DCGAN
S3 Z3
Generator Images Discriminator
…. ….
Sn Zn
b
S1 Z’1
Z1
S2 Z’2
Z2 Pre-Trained
Stego CNN Based Z’3
S3 Z3
DCGAN
Generator Images Extractor
…. ….
….
Z’n
Sn Zn
Noise Vector Z ‘
Secret data S Noise Vector Z
S1
Z1 Z’1 S’1
S2
Z2 Z’2 S’2
Pre- Pre-Trained
S3 Z3 Stego
Trained CNN Based Z’3 S’3
Images
…. …. DCGAN Extractor …. ….
Sn Zn Z’n S’n
Fig. 6.6 a Training stego image generation, b Training secret message extractor, c Workflow of
embedding less steganography model
86 6 Deep Learning Architectures for Image Steganography
These transformed noise vectors are then fed into the DCGAN generator for the gener-
ation of stego images. The trained generator is obtained after DCGAN converges
using the loss function given below:
min max
V(G, D) =EX∼p(x) log D(X)
G D
+ EZ∼p(z) 1 − log(1 − D(G(Z)))
n
n
LE = Z − E(Stego)2 = Z − E(G(Z))2
i−1 i=1
where Z is the noise vector, E(Stego) is the noise vector recovered by the extractor
network. This noise vector is then used to recover the secret message bits using
reverse mapping rules.
Figure 6.6c displays how the communication takes place. The sender uses the
trained generator and the receiver uses the trained extractor. At the sender side, the
secret message S is divided into segments Si and then each segment Si is mapped
into noise vector Zi , This mapping is fed to the trained generator which generates
stego images. The extractor at the receiver side extracts the noise vector from the
stego image. The secret data is recovered from the Noise vector as per the reverse
mapping rules.
Category label image steganography techniques use the conditional GAN model for
the generation of stego images. The category labels and noise are fed to the generator
network for the generation of stego images. The labels act as a driver for the stego
image generation. The techniques described in (Zhang et al. 2020; Liu et al. 2018)
are category label techniques. A comparison of the performance of these techniques
6.3 Deep Learning-Based Steganography Techniques 87
a
Category
labels Ci
Real Real or
Image False
ACGAN
Discriminator
ACGAN Stego C1
Generator Image
C2
C3
….
Noise Vector Cn
b
Secret data Code Category Pre-Trained
Fragments Dictionary Stego
Label ACGAN
for Mapping Images
Generator
Noise (Z)
C1
C2
Code Category Label Argmax
Secret data C3
Dictionary with Max Value
Fragments Pre-Trained
for Mapping ….
ACGAN
Cn
Discriminator
Real or
Fake
is given in Table 6.7. Figure 6.7a and b show the workflow of the category label
image steganography technique described in Liu et al. (2018).
The ACGAN generator is trained using a noise vector and the category label as
two inputs as shown in Fig. 6.7a. The ACGAN discriminator has two outputs: one
is the class probability value of the input image and the second is the category label
of the input image. ACGAN is trained using a two-fold objective function which is
given as:
LG = E log P(S = real |Xreal ) + E log P S = Stego|Xstego
LC = E log P(C = C |Xreal ) + E log P(C = C|Xfake )
The commonly used datasets in image steganography are briefly summarized below.
(a) CelebA (CelebA Dataset, n.d.) is a large–scale dataset of celebrity faces. It
consists of 202599 face images of various celebrities across the globe with
10177 distinctive identities. Each image has 40 attribute annotations and five
landmark locations.
(b) Bossbase (Bossbase Dataset, n.d.) is the benchmark dataset for steganography.
It consists of 10,000 grayscale images of diverse scenes. The dataset has images
of buildings, animals, landscapes, etc. Each image in the dataset is of size 512
∗ 512.
(c) ImageNet (ImageNet Object Localization Challenge | Kaggle, n.d.) is a large-
scale dataset of annotated images. It consists of approximately 14 million images
of 21 thousand categories. Each category in the dataset consists of hundreds of
images.
(d) PASCAL-VOC12 (PASCAL VOC 2012 | Kaggle, n.d.) consists of images of 20
different object classes such as dogs, cats, sheep, persons, sofas, plants, etc.
There is a bounding box annotation, object class annotation, and pixel-level
segmentation annotation for all the images in the dataset.
(e) CIFAR-100 (CIFAR-10 and CIFAR-100 Datasets, n.d.) is a tiny color image
dataset of 100 different object categories such as butterfly, mountain, lion,
mouse, etc. It consists of 50000 labeled images having 500 images of each
6.4 Performance Comparison of Image Steganographic Techniques 89
class and 10000 unlabeled images for testing. Each image is of size 32 ∗ 32 ∗
3.
(f) USC-SIPI (SIPI Image Database, n.d.) is a digital image database that consists
of a collection of images such as brodatz and mosaic textures, high-altitude aerial
images, mandrill, peppers, moving heads, moving vehicles, fly-overs, etc. The
database is split into volumes having different image sizes of 256 ∗ 256, 512
∗ 512, or 1024 ∗ 1024. All images are 8 bits/pixel for black and white images,
and 24 bits/pixel for color images.
This section gives the security comparison of different steganography techniques. The
security of a steganography technique is determined by the error rate of the steganal-
ysis test performed by using a steganalyzer. The dataset used for the steganalysis test
is split into two parts: the training dataset and the testing dataset. The steganalyzer
is trained using the images in the training dataset and corresponding stego images. It
is tested by using the testing dataset and the corresponding stego images. The error
rate of the steganalysis test determines the security of a steganography technique. A
higher error rate indicates better security and vice versa.
The security comparison of content-based steganography techniques such as
WOW, HUGO, S-UNIWARD, and LSB is shown in Table 6.1. The steganalyzer
employed in these methods is SRM except for (Lu et al. 2021) which uses RLCM-
100D steganalyzer. All these methods employ the Bossbase dataset. The embedding
capacity is set to 0.4 or 0.01 for the steganalysis test. From Table 6.1, it can be seen
that content-based steganography methods have an error rate in the range of 10–25
which is not considered to be high.
The security performance of steganography approaches that use different cover
generation techniques is performed. These techniques generate artificial cover images
by using a given dataset. The cover generation models like SGAN and SSGAN
are compared in Table 6.2, which shows the security performance of these cover
generation techniques against the steganalysis test. These techniques use the CelebA
dataset for training the model to generate artificial covers. The secret data is embedded
in the generated covers using the LSB and HUGO methods with an embedding
capacity of 0.4 bpp. The steganalyzer used by the SGAN is a self-defined steganalyzer
whereas SSGAN employs a self-defined steganalyzer for LSB embedding and Qian’s
Net for HUGO embedding. The table shows that the SSGAN technique performs
better than the SGAN technique for both HUGO and LSB embeddings. The results
indicate that in general cover generation steganographic techniques are more secure
than content-based steganographic techniques. In particular, the steganalysis test of
6.4 Performance Comparison of Image Steganographic Techniques 91
the SSGAN technique has produced very good results with an error rate of more than
70%.
The security comparison of the steganography methods that automatically learn
the distortion function for steganography using GAN is performed. The distortion
learning steganography models like ASDL-GAN, UT-6HPF-GAN, and UT-SCA-
GAN are compared for security in Table 6.3. All these models employ the Bossbase
dataset for the steganalysis test. The embedding capacity is set to 0.4 bpp. It employs
SRM steganalyzer for the steganalysis. As depicted in the table, such techniques do
not possess high resistance to steganalysis test and have security similar to that of
the content-based traditional steganographic techniques.
The security performance of approaches that use the adversarial image embedding
technique is performed. This technique uses adversarial images as cover images.
The adversarial image embedding steganography models like ADV-EMD and CNN-
Adv-EMD are compared in Table 6.4. The steganalysis test of the adversarial image
embedding technique employs the Bossbase dataset. The embedding capacity is set to
0.4 bpp. The test makes use of the XuNET steganalyzer. As depicted in the table, these
models achieve higher error rates than the distortion learning steganography models,
but lower error rates than cover generation steganography models. Thus adversarial
image embedding steganography models are more secure than the distortion learning
steganography models and less secure than cover generation steganography models.
The security performance comparison of the GAN embedding technique is
performed. This technique utilizes deep learning models for the entire steganography
process without using any rules framed by humans. The technique uses cover images
from a given dataset. The security measure of GAN embedding steganographic
models like HIGAN and the model described in (Duan et al. 2019) is compared
in Table 6.5. The steganalysis test uses the ImageNet dataset and XuNet stegana-
lyzer. The model described in (Duan et al. 2019) produces an appreciable error rate
of 60.3% during the steganalysis test but this error rate is less than 72% obtained
from the cover generation steganography techniques.
The security performance comparison of embedding less technique is performed.
This technique directly generates stego images using noise and secret data without
embedding the secret data in the cover image. The security comparison of embed-
ding less steganography models like SsteGAN, GSS, and SWE-GAN is shown in
Table 6.6. The steganalysis test of these models uses the CelebA dataset. As depicted
in the table, the test of SsteGAN and GSS models produces an error rate of around
50%. The SWE-GAN model is tested for two cases: in case 1, stego images are kept
private and are not used by the steganalyzer CRM for training. This results in low
accuracy and high security; in case 2, stego images are made public and are used by
the steganalyzer CRM for training. This decreases the error rate on the steganalyzer
from 99.2 to 44% and hence security is decreased. These models have overall high
security but when the stego data is made public there is an appreciable decline in the
security.
The security performance comparison of category labels steganographic technique
is performed. This technique directly generates stego images from the noise and the
category labels. The category labels act as a driver for the stego image generation. The
security performance comparison of the category labels models like CIH-GAN and
SSS-GAN is shown in Table 6.7. CelebA dataset has been used for the steganalysis
test. The CIH-GAN model has resulted in an error rate of 52%. SSS-GAN has been
tested for two cases: in case 1, stego images are kept hidden, and are not used by
the steganalyzer CRM for training; in case 2, stego images are made public, and are
used by the steganalyzer CRM for training. The error rate decreases from 99.9 to
6.4 Performance Comparison of Image Steganographic Techniques 93
56% when stego images are made public and hence decreases the security. These
models have overall high security but when the stego data is made public the security
decreases appreciably.
new suitable deep learning architectures that can improve the capacity and invisibility
of image steganography.
Despite the progress made, deep learning-based image steganography faces several
challenges. A primary issue is balancing imperceptibility with payload capacity, as
embedding more data can introduce detectable distortions, which compromise secu-
rity. While deep learning has improved this trade-off, achieving a higher payload
without sacrificing image quality remains difficult. Another challenge is the robust-
ness of stego images against increasingly sophisticated steganalysis techniques that
utilize deep learning models like XuNet, which can detect minute alterations in
images. Additionally, the generalization of these models across different datasets
poses problems. Although models work well on specific datasets, ensuring they
perform equally well on new, unseen data requires further research.
Resource efficiency is another significant challenge. Deep learning models are
computationally expensive, requiring large datasets and powerful hardware. This
limits their accessibility for real-time applications. Furthermore, scaling models to
handle high-resolution images introduces complexity, as the architecture must be
more sophisticated to maintain imperceptibility and capacity in larger images. Lastly,
deep learning models are susceptible to adversarial attacks, where small modifica-
tions in the input data can disrupt the embedding or extraction process, threatening
the security of hidden communications.
Looking ahead, there are several promising directions for research in deep
learning-based steganography. Future efforts could likely focus on increasing the
payload capacity while minimizing distortion, possibly by employing more advanced
architectures like transformers or diffusion models. Adversarial training can also
enhance the security of stego images, making them more resistant to both tradi-
tional and deep learning-based steganalysis. Another exciting avenue is cross-
media steganography, where information can be embedded across multiple types of
media, such as images, audio, and video, enhancing the versatility of steganographic
methods.
Lightweight models that reduce computational costs are crucial for expanding
the use of steganography in real-time applications, such as secure video streaming.
Techniques like model pruning and knowledge distillation can be explored to develop
efficient, scalable solutions. On the security front, combining steganography with
blockchain and cryptography could add layers of protection, offering stronger encryp-
tion and secure key management. Steganography may also become integral to
privacy-preserving systems, protecting sensitive data in communications and cloud
storage environments, which will be especially valuable in an era of increasing digital
surveillance and cyber threats.
Bibliography 97
6.6 Summary
Bibliography
1. B. Sultan, M.A. Wani, Enhancing steganography capacity through multi-stage generator model
in generative adversarial network based image concealment. J. Electron. Imaging 33(3), 033026
(2024). https://doi.org/10.1117/1.JEI.33.3.033026
2. B. Sultan, M.A.Wani, Anewframework for analyzing colormodels with generative adversarial
networks for improved steganography. Multimed. Tools Appl. (2023). https://doi.org/10.1007/
s11042-023-14348-7
3. M.A. Wani, B. Sultan, Deep learning based image steganography: a review.Wiley Interdiscip.
Rev. Data Min. Knowl. Discov. 1–26 (2022). https://doi.org/10.1002/widm.1481
4. W. Lu, Y. Xue, Y. Yeung, H. Liu, J. Huang, Y.Q. Shi, Secure halftone image steganography
based on pixel density transition. IEEE Trans. Dependable Secure Comput. 18(3), 1137–1149
(2021). https://doi.org/10.1109/TDSC.2019.2933621
5. D. Volkhonskiy, I. Nazarov, E. Burnaev, Steganographic generative adversarial networks
(2020), p. 97. https://doi.org/10.1117/12.2559429
6. J. Yang, D. Ruan, J. Huang, X. Kang, Y.Q. Shi, An embedding cost learning framework using
GAN. IEEE Trans. Inf. Forensics Secur. 15, 839–851 (2020). https://doi.org/10.1109/TIFS.
2019.2922229
7. L. Zhou, G. Feng, L. Shen, X. Zhang, On security enhancement of steganography via generative
adversarial image. IEEE Signal Process. Lett. 27, 166–170 (2020). https://doi.org/10.1109/LSP.
2019.2963180
8. Z. Zhang, G. Fu, R. Ni, J. Liu, X. Yang, A generative method for steganography by cover
synthesis with auxiliary semantics. Tsinghua Sci. Technol. 25(4), 516–527 (2020). https://doi.
org/10.26599/TST.2019.9010027
9. M. Yedroudj, F. Comby, M. Chaumont, Steganography using a 3-player game. J. Vis. Commun.
Image Represent. 72, 102910 (2020). https://doi.org/10.1016/j.jvcir.2020.102910
10. Z. Fu, F. Wang, X. Cheng, The secure steganography for hiding images via GAN. EURASIP
J. Image Video Process. (2020). https://doi.org/10.1186/s13640-020-00534-2
11. H. Shi, X.Y. Zhang, S. Wang, G. Fu, J. Tang, Synchronized detection and recovery of stegano-
graphic messages with adversarial learning. Lecture Notes in Computer Science (including
subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 11537.
(Springer, 2019), pp. 31–43
98 6 Deep Learning Architectures for Image Steganography
12. W. Tang, B. Li, S. Tan, M. Barni, J. Huang, CNN-based adversarial embedding for image
steganography. IEEE Trans. Inf. Forensics Secur. 14(8), 2074–2087 (2019). https://doi.org/10.
1109/TIFS.2019.2891237
13. S. Ma, X. Zhao, Y. Liu, Adaptive spatial steganography based on adversarial examples.
Multimed. Tools Appl. 78(22), 32503–32522 (2019). https://doi.org/10.1007/s11042-019-079
94-3
14. Z. Zhang, J. Liu, Y. Ke, Y. Lei, J. Li, M. Zhang, X. Yang, Generative steganography by sampling.
IEEE Access 7, 118586–118597 (2019). https://doi.org/10.1109/ACCESS.2019.2920313
15. H. Zi, Q. Zhang, J. Yang, X. Kang, Steganography with convincing normal image from a joint
generative adversarial framework, in 2018 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference, APSIPA ASC 2018: Proceedings, November
(2019), pp. 526–532. https://doi.org/10.23919/APSIPA.2018.8659716
16. A. ur Rehman, R. Rahim, S. Nadeem, S. ul Hussain, End-to-end trained CNN encoder-decoder
networks for image steganography. Lecture Notes in Computer Science (including subseries
lecture notes in artificial intelligence and lecture notes in bioinformatics), 11132 LNCS (2019),
pp. 723–729. https://doi.org/10.1007/978-3-030-11018-5_64
17. G. Swain, Very high capacity image steganography technique using quotient value differencing
and LSB substitution. Arab. J. Sci. Eng. 44(4), 2995–3004 (2019). https://doi.org/10.1007/s13
369-018-3372-2
18. X. Duan, K. Jia, B. Li, D. Guo, E. Zhang, C. Qin, Reversible image steganography scheme
based on a U-net structure. IEEE Access 7, 9314–9323 (2019). https://doi.org/10.1109/ACC
ESS.2019.2891247
19. H. Shi, J. Dong, W. Wang, Y. Qian, X. Zhang, SSGAN: Secure Steganography Based on
Generative Adversarial Networks, vol. 10735 (Springer International Publishing, LNCS, 2018)
20. M.M. Liu, M.Q. Zhang, J. Liu, P.X. Gao, Y.N. Zhang, Coverless information hiding based on
generative adversarial networks. Yingyong Kexue Xuebao/J. Appl. Sci. 36(2), 371–382 (2018).
https://doi.org/10.3969/j.issn.0255-8297.2018.02.015
21. I.R. Grajeda-Marín, H.A. Montes-Venegas, J.R. Marcial-Romero, J.A. Hernandez-Servín, V.
Muñoz-Jiménez, G.D.I. Luna, A new optimization strategy for solving the fall-off boundary
value problem in pixel-value differencing steganography. Int. J. Pattern Recogn. Artif. Intell.
32(1), 1–17 (2018). https://doi.org/10.1142/S0218001418600108
22. Z. Wang, N. Gao, X. Wang, X. Qu, L. Li, SSteGAN: Self-learning steganography based on
generative adversarial networks. Lecture Notes in Computer Science (including subseries
lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 11302 (Springer
International Publishing, 2018). https://doi.org/10.1007/978-3-030-04179-3_22
23. X. Liao, S. Guo, J. Yin, H. Wang, X. Li, A.K. Sangaiah, New cubic reference table based
image steganography. Multimed. Tools Appl. 77(8), 10033–10050 (2018). https://doi.org/10.
1007/s11042-017-4946-9
24. J. Yang, K. Liu, X. Kang, E.K. Wong, Y.-Q. Shi, Spatial image steganography based on
generative adversarial network 1, 1–7 (2018). http://arxiv.org/abs/1804.07939
25. Y. Zhang, W. Zhang, K. Chen, J. Liu, Y. Liu, N. Yu, Adversarial examples against deep neural
network based steganalysis, in Proceedings of the 6th ACM Workshop on Information Hiding
and Multimedia Security (IH&MMSec’18). (Association for Computing Machinery, 2018),
pp. 67–72. https://doi.org/10.1145/3206004.3206012
26. D. Hu, L. Wang, W. Jiang, S. Zheng, B. Li, A novel image steganography method via deep
convolutional generative adversarial networks. IEEE Access 6, 38303–38314 (2018). https://
doi.org/10.1109/ACCESS.2018.2852771
27. J. Ye, J. Ni, Y. Yi, Deep learning hierarchical representations for image steganalysis. IEEE
Trans. Inf. Forensics Secur. 12(11), 2545–2557 (2017). https://doi.org/10.1109/TIFS.2017.
2710946
28. C.C. Chang, Y.H. Huang, T.C. Lu, A difference expansion based reversible information hiding
scheme with high stego image visual quality. Multimed. Tools Appl. 76(10), 12659–12681
(2017). https://doi.org/10.1007/s11042-016-3689-3
Bibliography 99
29. M. Hussain, A.W. Abdul Wahab, A.T.S. Ho, N. Javed, K.H. Jung, A data hiding scheme using
parity-bit pixel value differencing and improved rightmost digit replacement. Signal Process.
Image Commun. 50, 44–57 (2017). https://doi.org/10.1016/j.image.2016.10.005
30. W. Tang, S. Tan, B. Li, J. Huang, Automatic steganographic distortion learning using a gener-
ative adversarial network. IEEE Signal Process. Lett. 24(10), 1547–1551 (2017). https://doi.
org/10.1109/LSP.2017.2745572
31. J. Hayes, G. Danezis, Generating steganographic images via adversarial training. Adv. Neural
Inf. Process. Syst. 2017, 1955–1964 (2017)
32. Y. Ke, M. Zhang, J. Liu, T. Su, X. Yang, Generative steganography with Kerckhoffs’ principle
based on generative adversarial networks 1–5 (2017). http://arxiv.org/abs/1711.04916
33. M. Abadi, D.G. Andersen, Learning to protect communications with adversarial neural cryp-
tography. Nature 1–15 (2016). http://arxiv.org/abs/1610.06918. https://doi.org/10.1007/978-3-
030-22741-8_3
34. G. Swain, A steganographic method combining LSB substitution and PVD in a block. Procedia
Comput. Sci. 85, 39–44 (2016). https://doi.org/10.1016/J.PROCS.2016.05.174
35. G. Xu, H.Z. Wu, Y.Q. Shi, Structural design of convolutional neural networks for steganalysis.
IEEE Signal Process. Lett. 23(5), 708–712 (2016). https://doi.org/10.1109/LSP.2016.2548421
36. D. Lerch-Hostalot, D. Megías, Unsupervised steganalysis based on artificial training sets. Eng.
Appl. Artif. Intell. 50(April), 45–59 (2016). https://doi.org/10.1016/j.engappai.2015.12.013
37. V. Holub, J. Fridrich, T. Denemark, Universal distortion function for steganography in an
arbitrary domain. EURASIP J. Inf. Secur. 2014, 1–13 (2014). https://doi.org/10.1186/1687-
417X-2014-1
38. V. Holub, J. Fridrich, Designing steganographic distortion using directional filters, in 2012
IEEE International Workshop on Information Forensics and Security (WIFS) (Costa Adeje,
Spain, 2012), pp. 234–239. https://doi.org/10.1109/WIFS.2012.6412655
39. J. Fridrich, J. Kodovsky, Rich models for steganalysis of digital images. IEEE Trans. Inf.
Forensics Secur. 7(3), 868–882 (2012). https://doi.org/10.1109/TIFS.2012.2190402
40. T. Pevný, T. Filler, P. Bas, Using high-dimensional image models to perform highly unde-
tectable steganography. Lecture Notes in Computer Science (including subseries lecture notes
in artificial intelligence and lecture notes in bioinformatics), 6387 LNCS (2010), pp. 161–177.
https://doi.org/10.1007/978-3-642-16435-4_13
Chapter 7
Two-Stage Depth-Balanced Generative
Adversarial Networks for Image
Steganography
7.1 Introduction
S = f (C, M ) (7.1)
The approach involves a depth balancing mechanism that optimizes the depth of
the two stages of the generator network, ensuring efficient and effective data hiding
while maintaining a good quality of the stego image. It also enhances its robustness
against detection by advanced steganalysis techniques.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 101
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_7
102 7 Two-Stage Depth-Balanced Generative Adversarial Networks …
Fine Tuning
Secret
Message (M) Lr + LS +
Steganalyzer
LE Network Steganalyzer
Loss LS (C, C’)
Stego Image (S)
Reconstruction Construction
of cover image of Stego
C’1 = f1(C) Image S =
f2(C’1, M) Extractor
Concat Decoder LE (M, M’)
Network
Cover Image (C) Generator Network
Reconstruction
Fine Tuning Loss Lr (C, S)
Fine Tuning
Fig. 7.2 Two stage generator for depth balanced steganography approach
Here cover images (C) undergo an initial reconstruction process, which is unlike the
traditional approach, where a cover image is directly used to create the stego image.
The cover image is subjected to a series of transformations to produce the recon-
structed cover image, referred to as C R . The dimensions of the reconstructed cover
image, C R , are modified according to the depth of the cover image reconstruction
architecture. The process of reconstruction is expressed as follows:
CR = f1 (C) (7.2)
This stage employs the reconstructed cover image and the secret message to produce
the stego image (S). A set of convolution operations is applied to the reconstructed
cover image (C R ) and the secret message (M) to create the stego image. The process
is expressed as follows:
S = f2 (CR , M ) (7.3)
The training algorithms optimize the model parameters, ensuring faster convergence
and improved performance. Binary cross-entropy and mean squared error loss func-
tions are used for both the cover image reconstruction and the stego image generation
tasks.
1. Choose a random sample C with dimensions 64 × 64 from the dataset of cover images.
2. Flatten the selected sample C and feed it into stage 1 of the Generator Network.
3. Implement a series of transformations on this flattened cover image and subject it to training to generate
a reconstructed cover image CR, the dimensions of which are determined by the depth of the stage1 of
the generator network.
Algorithm 2 embeds the secret messages with the reconstructed cover image.
It takes a mini-batch of reconstructed cover images and a corresponding mini-
batch of secret messages to produce stego images. The resulting stego images are
indistinguishable from cover images but securely hide secret messages.
Algorithm 2. Stego Image Construction
1. Concatenate the secret messages M with the reconstructed cover images CR.
2. Send the concatenated input to stage 2 of the Generator Network.
3. Implement a series of transformations on the concatenated input and subject it to training to generate
the stego image S with dimensions.
The loss functions compute the difference between the predicted and actual outputs,
facilitating the model to optimize its parameters. The first stage of the generator
model takes cover image C as input and it produces the reconstructed cover image
C R which is given as:
CR = GCR (θCR , C)
• GCR (.) represents the first stage of the Generator Model (cover image reconstruc-
tion stage).
• θCR represents the parameters of the first stage of the Generator Model.
The second stage of the generator model takes the reconstructed cover image C R
and the secret message M as input and it produces the stego image S which is given
as:
S = GSC (θSC , CR , M )
• GSC (.) represents the 2nd stage of the generator model (stego image construction
stage).
• θSC represents the parameters of the second stage of the generator model.
The decoder model uses the stego image S to recover the secret message M which
is given as:
M = D(θD , S)
The decoder network is fine-tuned by reducing the Euclidean distance ‘d’ between
the original secret message (M) and its reconstructed version (M ). This optimization
is accomplished by utilizing the loss function LD given in Eq. (7.4) below.
LD θD , M = d M , M
= d M . D θD,S (7.4)
106 7 Two-Stage Depth-Balanced Generative Adversarial Networks …
The steganalyzer model conducts binary classification, with its loss function LST
defined by the sigmoid cross-entropy equation as shown in Eq. (7.5) below.
When the input x is the cover image C (x = C), the resulting output y is assigned
the value of 1, and when x is stego image S (x = S), y is set to 0.
The generator network is trained in two stages. The cover image reconstruction
stage and the stego images construction stage. The loss function of the cover image
reconstruction stage LCR is defined in Eq. (7.6) below.
The stego image construction stage undergoes training using a weighted blend of
three loss functions: the cover image reconstruction loss LCR , the decoder loss LD ,
the steganalyzer loss LST . The stego image construction loss function LSC is shown
in Eq. (7.7) below.
The experimental setup and dataset employed are discussed in this section.
The performance of the approach discussed here is compared to other GAN-based
models reported by other researchers.
The experiments used 495 MRI and CT images of size 512 × 512 × 3 from DICOM
dataset, which were converted from dcm format to jpg format. Out of 495 images,
396 were used for training while the remaining 99 were utilized for testing purposes.
Adam optimizer with a learning rate set at 0.0002 was used. The experiments were
conducted using TensorFlow 1.15.0, running on the DGX A100 GPU.
The two-stage steganography model was trained with varying payloads ranging from
1 bit per pixel (bpp) to 3 bpp and results are shown in Table 7.1. It can be seen
from Table 7.1 that as the embedding capacity is increased from 1 to 2 bpp, the
message extraction accuracy decreases from 99.58 to 96.34%, peak signal-to-noise
ratio (PSNR) decreases from 39.27 to 35.77, and structural similarity index (SSIM)
decreases from 0.9375 to 0.9072. As the embedding capacity is further increased to
3 bpp, these metrics further decrease to 92.52, 32.66, and 0.8935 respectively. This
decline can be attributed to the fact that as embedded information increases, images
inevitably become noisier, resulting in a reduction in image quality.
The cover image and the corresponding stego image generated by the two-stage
model for 1 bpp embedding capacity are shown in Fig. 7.4. It can be seen by visual
inspection that the quality of stego image is not degraded at 1 bpp embedding capacity
value.
a b
Fig. 7.4 Shows the cover and stego images generated by the two-stage model for visual comparison
Table 7.2 compares the performance results of the two-stage depth balance approach
with other existing GAN-based models using the CelebA dataset. The results
presented in Table 2 indicate that the two-stage depth balance approach outperforms
existing deep learning models across all evaluation metrics. It is noteworthy that
deep learning methods often face challenges in achieving high extraction accuracy
as the capacity increases. However, the proposed two-stage depth-balanced method
achieves good extraction accuracy even with higher payloads compared to other
existing deep learning methods.
Table 7.2 Comparison of two-stage depth balanced model with other GAN-based methods using
CelebA dataset
Method Capacity (bpp) Extraction accuracy Steganalyzer Steganalyzer error
(%)
GSIVAT (Hayes 0.4 100 Self-defined 21
and Danezis [3])
SsteGAN (Wang 0.4 98.8 Self-defined 40.84
et al. [4])
SWE-GAN (Hu 0.0732 89.3 CRM 44
et al. [6])
Generative 0.05 91 SPAM <50
sampling (Zhang
et al. [7])
Two-stage depth 1 bpp 99.98 XuNET 50
balanced method
Two-stage depth 2 bpp 97.68 XuNET 49
balanced method
7.6 Summary 109
Table 7.3 Comparison of two-stage depth balanced model with hybrid methods using CelebA
dataset
Method Capacity (bpp) Steganalyzer Error rate of steganalyzer (%)
SSGAN (Shi et al. [9]) 0.4 Self-defined 50
CNN-Adv-Emb (Tang et al. 0.4 Qian’s NET 71
[12])
Two-stage depth balanced 1 bpp XuNET 50
method
Two-stage depth balanced 2 bpp XuNET 49
Method
Hybrid systems blend traditional methods with deep neural networks or GANs to
improve steganography. The performance of the two-stage depth-balanced method is
compared with hybrid methods in Table 7.3. The two-stage depth-balanced method
demonstrates superior performance in terms of capacity. It is worth noting that
existing hybrid methods generally operate at a capacity level of around 0.4 bits per
pixel (bpp). However, with the two-stage depth-balanced method, as the capacity is
increased to more than four times, the robust security measures are still maintained.
7.6 Summary
with other GAN-based models. It was observed that the two-state depth-balanced
approach produced better results than other GAN-based steganography methods.
Bibliography
1. J. Mao, Y. Yang, T. Zhang, Empirical analysis of attribute inference techniques in online social
network. IEEE Trans. Netw. Sci. Eng. 8(2), 881–893 (2021). https://doi.org/10.1109/TNSE.
2020.3009864
2. M.A. Wani, B. Sultan, Deep learning based image steganography: a review. Wiley Interdiscip.
Rev. Data Min. Knowl. Discov. 1–26 (2022). https://doi.org/10.1002/widm.1481
3. J. Hayes, G. Danezis, Generating steganographic images via adversarial training. Adv. Neural
Inf. Process. Syst. 1955–1964 (2017)
4. Z. Wang, N. Gao, X. Wang, X. Qu, L. Li, SSteGAN: Self-Learning Steganography Based on
Generative Adversarial Networks, vol. 11302 (Springer International Publishing, LNCS, 2018)
5. M. Yedroudj, F. Comby, M. Chaumont, Yedroudj-Net: an efficient CNN for spatial steganalysis,
in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, vol.
1 (2018), pp. 2092–2096. https://doi.org/10.1109/ICASSP.2018.8461438
6. D. Hu, L. Wang, W. Jiang, S. Zheng, B. Li, A novel image steganography method via deep
convolutional generative adversarial networks. IEEE Access 6(c), 38303–38314 (2018). https://
doi.org/10.1109/ACCESS.2018.2852771
7. Z. Zhang et al., Generative steganography by sampling. IEEE Access 7, 118586–118597 (2019).
https://doi.org/10.1109/ACCESS.2019.2920313
8. W. Tang, S. Tan, B. Li, J. Huang, Automatic steganographic distortion learning using a gener-
ative adversarial network. IEEE Signal Process. Lett. 24(10), 1547–1551 (2017). https://doi.
org/10.1109/LSP.2017.2745572
9. H. Shi, J. Dong, W. Wang, Y. Qian, X. Zhang, SSGAN: Secure Steganography Based on
Generative Adversarial Networks, vol. 10735 (Springer International Publishing, LNCS, 2018)
10. Y. Zhang, W. Zhang, K. Chen, J. Liu, Y. Liu, N. Yu, Adversarial examples against deep neural
network based steganalysis, in IH MMSec 2018 Proceedings of the 6th ACM Workshop on
Information Hiding and Multimedia Security (2018), pp. 67–72. https://doi.org/10.1145/320
6004.3206012
11. I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in
3rd International Conference on Learning Representations, ICLR 2015. Conference Track
Proceedings (2015). pp. 1–11
12. W. Tang, B. Li, S. Tan, M. Barni, J. Huang, CNN-based adversarial embedding for image
steganography. IEEE Trans. Inf. Forensics Secur. 14(8), 2074–2087 (2019). https://doi.org/10.
1109/TIFS.2019.2891237
13. J. Tan, X. Liao, J. Liu, Y. Cao, H. Jiang, Channel attention image steganography with generative
adversarial networks. IEEE Trans. Netw. Sci. Eng. 9(2), 888–903 (2022). https://doi.org/10.
1109/TNSE.2021.3139671
14. B. Sultan, M. ArifWani, A new framework for analyzing color models with generative adver-
sarial networks for improved steganography. Multimed. Tools Appl. (2023). https://doi.org/10.
1007/s11042-023-14348-7
Chapter 8
Two-Stage Generative Adversarial
Networks for Image Steganography
with Multiple Secret Messages
8.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 111
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_8
112 8 Two-Stage Generative Adversarial Networks for Image Steganography …
The GAN-based image steganography system for multiple secret messages incor-
porates multiple decoder networks, each dedicated to decoding a distinct secret
message.
The workflow diagram of the GAN-based image steganography process for multiple
secret messages is given in Fig. 8.1.
The cover image (C) is fed into the Generator network’s first stage, where several
transformations are applied. The two secret messages (M1 and M2 ) are combined
with the output of the first stage and provided as input to the generator’s second
stage. The combined input is subjected to a series of transformations in the second
stage, which creates the stego images (S). The stego image is used in the extractor
networks and the steganalyzer network. The task of the extractor1 network is to
recover the first secret message so that M1 = M1 and the extractor2 network is
to recover the second secret message so that M2 = M2 . The steganalyzer network
distinguishes between the cover image and the stego image. The generator network’s
first stage is trained on the cover image. The generator network’s second stage is
Fig. 8.1 Workflow diagram of image steganography process for multiple secret messages
8.2 GAN Based Image Steganography System for Multiple Secret Messages 113
trained simultaneously against the steganalyzer to learn the two hidden message
embeddings. As a result, the generator network can create a stego image that is
difficult for the steganalyzer network to distinguish it from the cover image, and
the extractor networks can successfully recover the hidden messages from the stego
image. Initially, the extractor networks cannot retrieve the secret messages, while the
steganalyzer network can easily distinguish between cover images and stego images.
As training progresses, the generator learns to create stego images that cannot be
distinguished from the cover images, and the secret messages are fully decoded. The
generator networks proposed here are described in the following section.
Four generator networks are described which differ in the depth of the network
at which the secret message is embedded. These four networks are the conven-
tional embedding generator network, the early embedding generator network, the
mid-embedding generator network, and the late embedding generator network. The
architecture of these four generator networks is discussed below.
The early embedding generator network for multiple secret messages is shown in
Fig. 8.3. Only the FC layer of the generator constitutes the first stage. The second
stage comprises four fractionally strided convolution layers. The first stage of the
network does not use the secret messages M1 and M2 . The cover image (C) is flattened
and input into the first stage of the generator network. The output of the first stage
(FC) is reshaped into 256 feature maps of size 4 × 4. Secret messages M1 and M2 of
size 4096 each are also reshaped into two tensors of size 4 × 4 × 256 to match the
shape of the feature maps. The input for the second stage is prepared by concatenating
these two reshaped tensors (4 × 4 × 256) with the reshaped output of the first stage
(4 × 4 × 512) resulting in 1024 feature maps of size 4 × 4. The concatenated input
114 8 Two-Stage Generative Adversarial Networks for Image Steganography …
Fig. 8.2 Conventional embedding generator network for multiple secret messages
Secret Message M1
(4096)
Fractionally-strided
convolutions for up-sampling
Fig. 8.3 Early embedding generator network for multiple secret messages
is then passed through the layers of the second stage of the generator network to
generate the stego image S of size 64 × 64 × 3.
Secret Message
M1 (4096)
Fractionally-strided
convolutions for up-sampling
Reshape 64 * 64 * 3
Flattened Cover 8 * 8 * 64
Image (12288) FC with 32 * 32 * 64
8192 Units 16 * 16 * 128
8 * 8* 256 8 * 8 * 384
4 * 4 * 512
Cover Concat
Images
(64 * 64
* 3) Reshape Fractional
Conv1 Fractional
Conv2 Fractional
8 * 8 * 64
Conv3 Fractional
First Stage
Reshape Conv4
Stego
Second Stage Image C’
Secret Message
M2 (4096)
the second stage of the generator network. The cover image C is flattened and passed
through the first stage of the generator network. The output of the first stage is 256
feature maps of size 8 × 8. The secret messages M1 and M2 of size 4096 each are
reshaped into two tensors of size 8 × 8 × 64 to match the output shape of the first
stage. The input for the second stage is prepared by concatenating the two reshaped
tensors with the output of the first stage resulting in 384 feature maps of size 8 ×
8. The concatenated input is then passed through the second stage of the generator
network to generate the stego image S of size 64 × 64 × 3.
The late embedding generator network for multiple secret messages is displayed
in Fig. 8.5. The FC layer and the first two fractionally-strided convolution layers
constitute the first stage while the last two fractionally-strided convolutions constitute
the second stage of the generator network. The cover image (C) is flattened and fed
into the first stage of the generator network. The output of the first stage is 128 feature
maps of size 16 × 16. The secret messages M1 and M2 each of size 4096 are reshaped
into two tensors of size 16 × 16 × 16 each to match the output shape of the first
stage. The input for the second stage is prepared by concatenating these reshaped
tensors with the output of the first stage, resulting in 160 feature maps of size 16 ×
16. This concatenated input is then passed through the second stage of the generator
network to generate the stego image S of size 64 × 64 × 3.
The generator networks employ ReLU activation and batch normalization in all
layers except the last layer, which only uses tanh activation.
116 8 Two-Stage Generative Adversarial Networks for Image Steganography …
Secret Message
M1 (4096)
Fractionally-strided
convolutions for up-sampling
Reshape 64 * 64 * 3
Flattened Cover 16 * 16 * 16
Image (12288) FC with
8192 units 16 * 16 * 128 16 * 16 * 160 32 * 32 * 64
8 * 8 * 256
4 * 4 * 512 concat
Cover
Images
(64 * 64
* 3) Reshape
Fractional Fractional
Conv1 Conv2 Fractional
16 * 16 * 16 Conv3
First Stage Reshape Fractional
Conv4
Fig. 8.5 Late embedding generator network for multiple secret messages
The architecture of the extractor network is similar to the steganalyzer except that
its fully connected layer has several elements equal to the message size. Instead of
sigmoid activation, it employs tanh activation.
The algorithms used for training the networks are discussed in this section.
8.2 GAN Based Image Steganography System for Multiple Secret Messages 117
The generator network receives the cover image and secret messages to be embedded
to produce the stego image. Algorithm 1 gives the steps to generate the stego image.
Algorithm 1. Training the Generator Network
The steganalyzer network’s task is to differentiate between the cover image and the
generated stego image. Algorithm 2 gives the steps for this task.
Algorithm 2. Training the Steganalyzer Network Network
Input: Cover image , Stego Image S, Secret Message M1, Secret Message M2
Output: Probability value whether the input Image is Cover Image or stego Image
Each extractor network receives the stego image (S) and outputs the extracted secret
message. Algorithm 3 gives the steps for this task. The algorithm is repeated for each
extractor network.
Algorithm 3 Training the Extractor Networks
Input: Secret Message M1, Secret Message M2, Cover image , Stego image S
The experimental design and dataset used here are discussed in this section. The
performance of the proposed models is compared to that of other GANs-based
methods described in the literature.
The CelebA dataset containing 200 k images of celebrity faces is used here for
a comparison of proposed models with other models. The images were reduced to 64
∗ 64 size for experimental purposes. The proposed models were trained using a set of
60 k images chosen randomly. With a batch size of 32, the stochastic gradient descent
approach was used to train the generator, steganalyzer, and extractor networks. With
the learning rate set to 0.0002, the model was optimized using Adam’s optimizer.
Tensorflow 1.15.0 with the DGX A100 GPU was used to conduct the experiments.
8.3 Results and Discussions 119
All four generator models were trained for a capacity of 2 bits per pixel (bpp),
allowing 1 bpp for each secret message. The performance of the four generator
models is shown in Table 8.1, which shows the accuracy for the extraction of secret
messages and distortion measurement of these models.
It can be seen from Table 8.1 that the late embedding generator model outperforms
the other three models on all the evaluation metrics. The extraction accuracy is also
significantly increased with the late embedding model compared to the other three
generator models. The late embedding model is capable of embedding 2bpp without
deteriorating the other evaluation metrics. As the capacity is increased, the extractor
is not able to successfully extract the secret message.
To assess the security of the steganography, the steganalyzer network (Xu’s NET)
was trained with 4000 images, and 2000 images (1000 cover and 1000 stego) were
utilized for validation, while 500 cover and 500 stego images were used for testing.
Table 8.2 lists the steganalyzer error rates of the four generator models. A high error
rate denotes higher security because the model can conceal the secret data without
being noticed by steganalyzer. The late embedding generator model has the highest
error rate compared to other models.
Table 8.3 shows the results of changing the embedding capacity from 1 to 3 bpp.
As the embedding capacity is increased from 1 to 2 bpp, the image quality depicted
Table 8.2 Security comparison of the four generator models for 2 bpp
Generator model # Of images misclassified Error rate of steganalyzer (%)
Conventional embedding generator 440 44
model
Early embedding generator model 470 47
Mid embedding generator model 289 28.90
Late embedding generator model 500 50
120 8 Two-Stage Generative Adversarial Networks for Image Steganography …
by PSNR drops slightly from 77.04 to 74.32 but the extraction accuracy and security
are still good. However, when the capacity is increased to 3bpp, it is observed that all
metrics degrade. This is because as embedded information increases, images become
noisier, lowering the quality of the images.
The visual comparison of cover images and stego images generated by the Late
Embedding Generator model for the capacity of 1 and 2 bpp values is given in
Table 8.4.
Table 8.4 Cover and stego images of late embedding generator model for capacity of 1 and 2 bpp
values
Capacity Cover images Stego images
1 bpp Block 1(a) Block 1(b)
Table 8.5 Performance of late embedding generator model with other steganography methods
Model Capacity (bpp) SSIM PSNR
Multi-image steganography (Sharma et al. [13]) 0.114 – 68.67
RSM (Jarusek et al. [12]) 0.0015 0.9744 30
Hiding images within images (Baluja [2]) 2 0.98 41.2
Multi-data steganography (Sultan and Wani [3]) 0.2 0.89 69.54
Late embedding model 1 0.9658 77.04
Late embedding model 2 0.9494 74.32
The performance of the late embedding generator model and other steganography
methods described in the literature is given in Table 8.5. It can be seen that the late
embedding generator model outperforms the other steganography methods in terms
of invisibility and capacity metrics.
The performance of the late embedding generator model and deep learning models
described in the literature on the CelebA dataset is given in Table 8.6. It can be seen
that the late embedding generator model outperforms the other deep learning models.
The other deep learning models mostly have problems in achieving good extraction
accuracy with increased capacity.
Table 8.6 Performance of late embedding generator model and deep learning models
Model Capacity (bpp) Extraction accuracy Steganalyzer Error rate of
steganalyzer (%)
GSIVAT (Hayes 0.4 100 Self-defined 21
and Danezis [4])
SsteGAN (Wang 0.4 98.8 Self-defined 40.84
et al. [8])
SWE-GAN (Hu 0.0732 89.3 CRM 44
et al. [11])
Generative 0.05 91 SPAM <50
sampling (Zhang
et al. [1])
Late embedding 1 bpp 99.99 XuNET 50
generator model
Late embedding 2 bpp 97.96 XuNET 50
generator model
8.5 Summary
This chapter addressed the challenges of embedding multiple secret messages within
a cover image, primarily due to the degradation of stego image quality and the
complexity of accurately extracting the secret messages. Four generator models using
traditional, early, mid, and late embedding were discussed and their performances
were compared. The late embedding generator model produced the best results. The
performance of the late embedding generator model was also compared with other
steganography methods and deep-learning models used for steganography. The late
embedding generator model outperformed all these models.
Bibliography
9.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 125
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_9
126 9 Deep Learning in Healthcare and Computational Biology
Protein secondary structure prediction (PSSP) is a key area of study within compu-
tational biology and bioinformatics. It is essentially a sequence-to-sequence classi-
fication problem where the input is a protein’s amino acid sequence, and the output
is the projected secondary structure (SS) sequence. Each residue in the protein is
assigned a secondary structure label. Over the past several decades, this area of
research has grown significantly, making substantial strides in accuracy. The impor-
tance of PSSP lies in its ability to enhance our understanding of protein structure
and function and its pivotal role in predicting protein tertiary structure. After all,
SS elements significantly influence protein stability, folding, and interaction with
other molecules. Although protein structures can be manually determined in exper-
imental labs using various techniques like X-ray crystallography, this process can
be slow, difficult, and costly. Given the exponential growth in the number of protein
sequences, there is an evident gap as the number of solved structures is considerably
lower. As of January 2023, the number of determined structures according to PDB
statistics was 203,084. This gap necessitates the use of computational methods as an
alternate route for predicting protein structures. While significant research has been
devoted to predicting the three-dimensional (3D) structures of proteins from primary
sequences, this task remains one of the most challenging in computational biology
and bioinformatics. Therefore, significant focus has been to develop methods to
predict certain aspects of the structure with increasing accuracy, such as identifying
regular spatial arrangements of amino acids like alpha helices and beta sheets.
PSSP is an important area of study in protein research due to its impact on stability
and function. It has garnered considerable attention as it serves as a crucial stepping
stone in predicting 3D structures. The objective of this prediction is to infer the SS
of proteins, given the knowledge of their primary structure, i.e., their amino acid
sequence. The assembly and organization of SSs within the protein significantly
influence its final conformation. The general outline of the PSSP problem can be
viewed in the graphical abstract depicted in Fig. 9.1.
To pave the way for the development of AI-based prediction methods, it is of
utmost importance to meticulously annotate protein sequences with their respective
SSs. These SSs are derived experimentally utilizing a multitude of techniques. Of
these, the Dictionary of SS of Proteins (DSSP) is preeminent, serving as the standard
tool for attributing SS annotations to amino acids. This algorithm employs hydrogen
bond energy calculations and assigns three states (Q3) which can be further expanded
to eight states (Q8) (Kabsch et al. 1983) (see Table 9.1). DSSP remains the standard
technique for assigning SSs to amino acids using experimental 3D coordinates.
9.2 Deep Learning for Sequence Data: A Case Study of Protein Secondary … 127
Protein SSs can be used to provide valuable information for predicting the protein’s
three-dimensional structure. Moreover, it can help to identify specific regions within
the protein that are important for its function, which can be used to guide experiments
where specific amino acids are mutated to study the impact on the protein’s function.
Thus, accurately predicting the SS of a protein can have a significant impact on
predicting its three-dimensional structure, solvent accessibility (how easily it can
be accessed by other molecules). Remarkably, protein SSs reduces the degree of
freedom in a protein which helps to limit the possible ways in which the protein can
fold, leading to more accurate predictions of the protein’s final, three-dimensional
structure. Three-state SS prediction categorizes residues into alpha-helix, beta-sheet,
or coil, providing a simplified view. In contrast, eight-state prediction captures more
detailed structural features within alpha-helices and beta-sheets, offering a higher-
resolution representation. Figure 9.2 illustrates the 3-state and 8-state SS of the 1AKD
protein from the Protein Data Bank (PDB) (Berman 2002).
Deep learning has had a significant impact on PSSP, offering improved accuracy
and speed. Traditional methods relied on statistical and machine learning techniques,
128 9 Deep Learning in Healthcare and Computational Biology
Fig. 9.2 3-State and 8-state secondary structures; Q3 (left), Q8 (right) of 1AKD protein
and BRCA2 gene mutations amplify the likelihood of breast and ovarian cancers,
whereas Lynch syndrome genes are associated with escalated risks of colorectal and
various other cancers. Alterations in gene expression can sometimes modify protein
synthesis patterns, consequently disrupting normal cellular functions. Such aberrant
cells undergo rapid division, resulting in tumorous growths within the affected region.
Therefore, delineating these genetic alterations is pivotal as they hold the promise
for targeted therapeutic strategies.
Historically, cancers were classified based on their anatomical origin. However, with
the advent of molecular biology and genomics, there’s a paradigm shift towards a
more holistic classification: the pan-cancer classification. This innovative approach
seeks to categorize cancers based on their genetic and molecular signatures, sidelining
the conventional organ-centric view. Such a classification offers profound insights,
illuminating shared molecular themes across different cancer types, thus refining
therapeutic strategies and bolstering personalized medicine’s efficacy.
The monumental Pan-Cancer Atlas project, spearheaded by TCGA, sought to
meticulously analyze and chronicle the genetic similarities and disparities across
diverse cancer types. With the advent of Next Generation Sequencing, the scrutiny
of human genomics has reached unprecedented levels of precision and efficiency.
TCGA harnessed this power to sequence an impressive repertoire of tumor tissues,
eventually analyzing over 11,000 tumors from 33 predominant cancer forms. The
culmination of these Herculean efforts is the Pan-Cancer Atlas, a goldmine of infor-
mation for researchers worldwide. However, the voluminous data generated, espe-
cially by whole-genome sequencing, presents an analytical challenge. Traditional
manual and experimental methodologies for cancer classification are not only time-
consuming but are susceptible to human error, especially in the backdrop of the
multifactorial nature of cancer. As cancer research evolves, the complexity of data
interpretation escalates, rendering manual methods less feasible and more error-
prone. Consequently, there’s a pressing need for sophisticated prediction methods
for pan-cancer classification. Here, deep learning, a subset of machine learning,
emerges as a promising solution. By training on expansive genetic and molecular
datasets, deep learning algorithms can classify new samples with remarkable accu-
racy. These algorithms can discern intricate patterns in gene expression data, often
elusive to manual inspection, paving the way for an enhanced and precise pan-cancer
classification paradigm.
The integration of computational techniques in pan-cancer classification has
emerged as a promising avenue, with the potential to refine clinical outcomes. This
is achieved by facilitating enhanced diagnostic precision, tailoring treatments, and
unearthing novel therapeutic avenues. Consequently, a myriad of machine learning
(ML) methodologies has been employed to address challenges related to cancer
detection and prognosis using gene-expression data. However, a notable limitation of
130 9 Deep Learning in Healthcare and Computational Biology
Brain tumors pose a substantial challenge in the medical field due to their intri-
cate and potentially life-threatening nature. In the United States alone, the diagnosis
of malignant brain tumors in 2022 reached 25,050 cases, highlighting the severity
of this condition. Despite being a relatively small subset among primary central
nervous system (CNS) tumors, brain tumors account for a significant percentage,
ranging from 85–90%, emphasizing their clinical importance. On a global scale,
an estimated 308,102 individuals were diagnosed with primary brain or spinal cord
tumors in 2020, with around 4,170 cases identified among children under 15 years
old, accentuating the widespread impact of this medical issue. Hence, early detec-
tion of brain tumors plays a pivotal role in improving treatment outcomes and patient
9.4 Deep Learning for Image Data: A Case Study of Tumor Prediction … 131
In the realm of Deep Learning (DL) applications within healthcare and computa-
tional biology, we’re confronted with a series of challenges that span across protein
secondary structure prediction (PSSP), pan-cancer classification using gene expres-
sion data, and brain tumor prediction through MRI scans analysis. These challenges
underscore the complexity and dynamic nature of biological systems and the need
for advanced computational methodologies to decipher them effectively.
A primary challenge in the field is the accurate prediction and classification of
protein secondary structures (PSSP) from amino acid sequences. This task is compli-
cated by the vast diversity and complexity of proteins, making it difficult to capture
the intricate patterns and interactions that dictate protein folding and function. Devel-
oping hybrid deep learning architectures that can integrate multiple types of data and
leverage the strengths of various DL models is crucial for improving prediction
accuracy and understanding protein dynamics on a deeper level. In the domain of
pan-cancer classification, a significant hurdle is the heterogeneity and high dimen-
sionality of cancer genomic data. Traditional DL models often struggle to cope
with the vast array of genetic alterations that characterize different cancer types. An
ensemble approach that combines multiple DL models could potentially enhance
the ability to classify cancers more accurately. However, this introduces the chal-
lenge of effectively integrating diverse models to handle the complexities of cancer
genomics, requiring innovative strategies to balance the contributions of different
models and manage the computational complexity. For brain tumor prediction using
MRI scans, the challenge lies in processing and analyzing high-resolution imaging
data efficiently while maintaining high accuracy in tumor detection and classification.
The adoption of transfer learning with lightweight architectures presents a promising
avenue, yet optimizing these models for medical imaging tasks and ensuring they are
adaptable to the nuances of brain tumor characteristics requires careful consideration.
This challenge is compounded by the need for models that can operate efficiently on
limited computational resources, making it accessible for widespread clinical use.
9.6 Summary
In this chapter, we have delved into the multifaceted applications of deep learning
within the realms of bioinformatics and medical diagnostics. Through a targeted
exploration, we have highlighted the profound impact of deep learning in addressing
critical challenges across three distinct areas: protein secondary structure prediction,
PAN-cancer classification using gene expression data, and brain tumor prediction
through the analysis of MRI scans. The first application focuses on leveraging deep
learning techniques for protein secondary structure prediction, demonstrating the
ability to unravel complex sequence data with high accuracy. Next, we explored
the realm of PAN-cancer classification, where deep learning algorithms analyze
134 9 Deep Learning in Healthcare and Computational Biology
gene expression data to classify different cancer types, showcasing the potential
for personalized medicine and treatment strategies. Finally, we examined the use of
deep learning in brain tumor prediction via the analysis of MRI scans, highlighting
its role in aiding clinicians in early detection and treatment planning.
Each of these case studies represents a distinct facet of the intersection between
deep learning, bioinformatics, and medical diagnostics. In the subsequent chapters,
we delve deeper into each of these applications, showcasing the specific deep learning
methodologies employed and how they have contributed to improving performance
metrics such as accuracy, sensitivity, and specificity. Through these detailed explo-
rations, we aim to underscore the nuanced contributions of deep learning in revo-
lutionizing the field of bioinformatics and medical diagnostics, paving the way for
more efficient and effective disease detection, diagnosis, and treatment.
Bibliography
1. B. Alberts et al., Molecular biology of the cell. Biochem. Mol. Biol. Educ. 36(4), 317–318
(2008). https://doi.org/10.1002/bmb.20192
2. H.M. Berman, The protein data bank. Acta Crystallogr. Sect. D Biol. Crystallogr. 58(6 I), 899–90
(2002). https://www.rcsb.org
3. Q. Jiang, X. Jin, S.J. Lee, S. Yao, PSSP: a survey of the state of the art. J. Mol. Graph. Model.
76, 379–402 (2017). https://doi.org/10.1016/j.jmgm.2010015
4. Evaluation of machine learning algorithm utilization for lung cancer classification based on gene
expression levels. Asian Pac. J. Cancer Prev. 17(2), 835–838. https://doi.org/10.7314/APJCP.
2016.12.835
5. Z. Wang, M.A. Jensen, J.C. Zenklusen, A practical guide to the cancer genome atlas (TCGA), in
Statistical Genomics (Humana Press, New York, 2016), pp. 111–141. https://doi.org/10.1007/
978-1-4939-3578-9_6
6. E.H. Yau, I.R. Kummetha, G. Lichinchi, R. Tang, Y. Zhang, T.M. Rana, Genome-wide CRISPR
screen for essential cell growth mediators in mutant KRAS colorectal CancersGenome-wide
CRISPR screen of KRAS-mutant tumor xenografts. Can. Res. 77(22), 6330–6339 (2017). https://
doi.org/10.1158/0008-5472.can-17-2043
7. M.A. Sofi, M.A. Wani, IRNN-SS: deep learning for optimised protein secondary structure predic-
tion through PROMOTIF and DSSP annotation fusion. Int. J. Bioinfor. Res. Appl. 20(6), 608-626
(2024).
Chapter 10
Selected Deep Learning Architectures
for Medical Applications
10.1 Introduction
Deep Learning has become a highly popular technology in the last decade due to
its ability to process massive amounts of data with state-of-the-art computational
resources. Deep Learning has tremendous potential to be used in medical appli-
cations. Architectures such as Convolutional neural networks (CNN) and Recurrent
neural networks (RNN) have achieved significant success in tasks such as image clas-
sification, text classification, and sequence-to-sequence classification (Jurtz et al.
2017). In this chapter, we discuss selected deep learning architectures that are
commonly used for sequence and image data. These models were selected because
they have been shown to be particularly effective in learning complex relationships
present in datasets related to medical applications.
A convolutional neural network (CNN) is one of the popular Deep Artificial Neural
Network made up of learnable weights and biases. CNNs have been specifically
designed for visual tasks such as object recognition and image classification. A
Convolutional Neural Network (CNN) is comprised of one or more convolutional
layers and then followed by one or more fully connected layers as in a standard
multilayer neural network. It receives an input and the features from that input are
extracted through the convolution operation (Saha 2018). The convolution operation
is mathematically defined as:
s
p(x) = w ∗ y = w(a)y(x − a)
a=−s
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 135
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_10
136 10 Selected Deep Learning Architectures for Medical Applications
Zi = f (W ∗ ai:i+k−1 ) + b
where convolution operation is denoted by ‘∗’, W are the weights and ai denote
element at ith position, ‘k’ represents the 1D kernel size, and ‘b’ is the bias term.
Multilayer perceptron and convolutional neural networks have been popularly used
for vector and image data. However, for Natural language processing, sequence,
and time series data, sequence information is extremely important. But conven-
tional techniques such as Bag of Words, TFIDF, ARGQZV, etc., completely discard
the sequence information. Therefore, the core idea was to develop a new type of
neural network which gives importance to the sequence information and leverages
the information to perform better than non-sequence based approaches.
Recurrent Neural networks (Medsker 2001), is a popular technique with a lot of
repeating structures which uses the sequence information for training the models.
Over the years, RNNs received an immense attention and have been widely applied
for time series prediction, Machine translation problems, speech recognition, and
other sequence based problems. RNNs, instead of breaking the input into intervals
or windows, worked on an entire input at once. In addition, RNNs were able to
efficiently handle the input data of varying lengths. A simple representation of RNN
is shown in Fig. 10.3.
To connect input vector xi1 , xi2 , xi3 , xi4 , . . . . . . , xit to the activation unit (see
Fig. 10.3), the same weight is used in this repeated structure. Similarly, to use the
sequence information, we have another weight matrix W which connects outputs
from a previous layer as an input to the next layer. After processing the whole
sequence, the softmax or logistic layer is used to get the final output. The weight
matrix W used at softmax layer is Rd ∗ i dimensional vector where Rd is the dimen-
sionality of the input vector and i represents number of output classes. In case of
where f is the activation unit, ot is the output, w, and w are the weight matrices, and
Ot −1 represents the output of the previous layer.
Simple RNNs have a major limitation known as the vanishing gradient problem,
which occurs when gradients become very small during back-propagation, making
it difficult to learn long-term dependencies (Bengio et al. 2014). In addition, simple
RNNs can only use information from the past to predict the future, which limits
their ability to capture complex patterns in sequential data. In problems that require
knowledge of the elements preceding and following the current element, simple or
unidirectional Recurrent Neural Networks (RNNs) have limitations as they can only
move forward in the sequence of input elements (Schuster 1997). Bidirectional RNNs
(BRNNs) have been developed to address this issue, enabling the capture of informa-
tion in both the forward and reverse directions. The structure of a unidirectional RNN
10.5 Long Short-Term Memory Networks 139
and BRNN is illustrated in Fig. 10.5a and b, with BRNNs allowing for the integra-
tion of context-dependent information in both forward and backward directions. By
utilizing two time directions, BRNNs are able to extract relevant information from
input sequences and have been utilized in a variety of applications, such as speech
recognition and natural language processing.
Although recurrent neural networks can handle sequence information efficiently,
they suffer from vanishing and exploding gradient problem. The main reason for the
problem is not because there are huge numbers of layers but because the derivatives
with respect to weights have a lot of multiplication of other partial derivatives as the
forward propagation and backward propagation is performed over time resulting in
vanishing and exploding gradients (Hochreiter et al. 1997). Therefore, to avoid the
vanishing and exploding gradients of RNNs, several modified and enhanced RNN
methods such as Long short-term memory networks (LSTM) and Gated recurrent
units (GRU) have been developed.
Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network
(RNN) architecture that are specifically designed to overcome the vanishing gradient
problem, which can occur in standard RNNs. Gradient magnitudes are influenced
by two main factors: weights and activation functions, particularly their deriva-
tives. When either factor is less than 1, gradients can diminish over time (vanishing
gradients), whereas values greater than 1 can cause gradients to grow exponentially
(exploding gradients). For example, the tanh activation function has a derivative less
than 1 for all inputs except zero, and the sigmoid function’s derivative is always less
than or equal to 0.25, exacerbating the vanishing gradient problem. LSTM networks
mitigate these issues with their unique architecture. In the recurrence of LSTM cells,
the activation function is the identity function, which has a derivative of 1.0. This
140 10 Selected Deep Learning Architectures for Medical Applications
ensures that the backpropagated gradient remains constant, avoiding both vanishing
and exploding gradients. The effective weight of the recurrent connection in LSTM
is controlled by the forget gate, whose activation typically ranges between 0 and 1.
If the forget gate activation is close to 1.0, the gradient does not vanish. Since the
forget gate activation is never greater than 1.0, the gradient cannot explode either.
Thus, the LSTM architecture effectively stabilizes gradient flow during training.
LSTMs are particularly useful for processing and making predictions on sequen-
tial data, such as time series data, speech, and text. The main advantage of LSTMs
over standard RNNs is their ability to selectively remember or forget information
from previous time steps, which makes them well-suited for modeling long-term
dependencies in sequential data. They do this by incorporating memory cells, which
allow the network to selectively store and retrieve information over multiple time
steps. The architecture of LSTM cell is shown in Fig. 10.6 (source Zarzycki et al.
2021). LSTM architecture is able to handle short-term as well as long-term depen-
dencies efficiently because of short circuit connections (Christopher 2015; Gers et al.
2002).
In the LSTM unit, the calculations are performed sequentially using the following
equations.
The
Forget Gate in LSTM unit decides what fraction of the previous cell
state s{t−1} should be forgotten and is computed using the following equation.
ft = σ Wf at + rf s{t−1} + qf
The Input Gate (it ) determines how much new information to add to the cell state
from the current input (at ) and is computed using the following equation.
it = σ Wi at + ri s{t−1} + qi
Fig. 10.6 Architecture of LSTM with 3 gates; ft represents forget gate, it is the input gate, and ot
is the output gate. Gt is the updated cell state. wf , wi , wg , wo are the weight matrices associated
with the input data., r f , r i , r g , r o are the recursive weights. qf , qi , qg , and qo are the bias values
10.6 Bidirectional Gated Recurrent Units 141
The Cell gate generates new candidate values (gt ) to potentially add to the cell
state and is computed using the following equation.
gt = tan h Wg at + rg s{t−1} + qg
The new cell state (st ) is updated by combining the retained cell state and new
candidate values as shown in equation. The symbol ◯ indicates element-wise product.
st = ft ◦ s{t−1} + it ◦ gt
The output gate decides the part of the cell state to output based on the current
input and previous hidden state.
ot = σ Wo at + ro s{t−1} + qo
To compute the current hidden state (ht ) by applying the output gate to the cell
state, following equation is used.
ht = ot ◦ tan h(st )
Gated Recurrent Units (GRUs) are a type of Recurrent Neural Network (RNN)
architecture that was developed as an alternative to LSTMs (Cho et al. 2014). GRUs
have several advantages over LSTMs. First, they have a simpler architecture with
fewer parameters, which makes them faster to train and less prone to overfitting.
Second, they have been found to perform well in a variety of sequence modeling
tasks, such as speech recognition, natural language processing, and image captioning,
among others. The architecture of GRU is shown in Fig. 10.7 (source Zarzycki et al.
2021).
In the GRU unit, the calculations are performed sequentially using the following
equations. The reset gate rt determines how much of the previous hidden state st−1
to forget and is computed using the following equation.
rt = σ Wr at + rr s{t−1} + qr
142 10 Selected Deep Learning Architectures for Medical Applications
Fig. 10.7 Architecture of GRU; r t is the reset gate, gt is the candidate-state gate, and zt is the
update gate. Bias for reset and update gate is represented by qr and qz . st is the current hidden
state and at is the current input. W r , wz , wg are the weight matrices associated update gate, state
candidate, and candidate hidden state. r r , r z are the recursive weights
The update gate zt controls the degree to which the previous hidden state st−1 is
retained in the current state. It is computed using the current input at , the previous
hidden state st−1 , and a bias term qz .
zt = σ Wz at + rz s{t−1} + qz
The Candidate Hidden State: The candidate hidden state gt provides new potential
values for the current hidden state. It is influenced by the current input at , the previous
hidden state st−1 adjusted by the reset gate, and a weight matrix wg .
gt = tan h Wg at + rt · rt s{t−1}
The hidden state st is updated by blending the previous hidden state st−1 and the
candidate hidden state gt and is computed using the following equation.
st = (1 − zt ) · s{t−1} + zt · gt
Unlike LSTM, Gated Recurrent Units (GRUs) are better and simplified model
developed in 2014, mainly used to train deep recurrent neural networks (Cho et al.
2014). It has 2 gates; reset (r k ) and update (zk ). The update gate controls how much
information from the previous time step should be passed to the current time step,
while the reset gate controls how much of the previous hidden state should be
forgotten. GRU is as powerful as LSTM but most often it is faster to train as it
has fewer equations than LSTM.
The remarkable success of deep neural networks (DNNs) in artificial intelligence
has fueled a growing desire to deploy these networks in resource-constrained devices,
such as mobile phones and edge devices. These devices require more compact and
10.7 Light Weighted Depth-Wise Separable Convolutional Neural Networks … 143
efficient models for practical use as these devices are inherently constrained by
limited computational power, storage capacity, and energy resources, all of which
pose substantial obstacles to the seamless deployment of DNNs, especially in the
context of Internet of Things (IoT) and various on-device applications at the edge.
Against this backdrop, light-weight deep learning emerges as an attractive solution.
These new, streamlined architectures are characterized by their enhanced computa-
tional efficiency. Unlike their larger counterparts that comprise millions or billions
of parameters, light-weight architectures are less demanding in terms of processing
power. This makes them an ideal choice for devices with limited computational capa-
bilities, broadening the scope of where and how advanced AI models can be utilized.
Furthermore, light-weight architectures address the substantial memory requirements
of traditional deep learning models. Their compact nature allows them to be stored
and operated on devices such as mobile phones and Internet of Things (IoT) sensors,
which typically have limited storage capacity. This attribute enhances their practical
applicability in today’s increasingly mobile and connected world. Their versatility,
efficiency, and compactness make them particularly well-suited for tasks that require
processing substantial amounts of data in real time, while also being constrained by
device capabilities. Among such tasks, image classification and object detection
stand out as prominent examples. These tasks typically involve processing high-
resolution images and making rapid decisions, making them prime candidates for
the application of light-weight architectures.
Fig. 10.8 Architecture of depth-wise separable convolutional neural networks consisting depth-
wise convolution followed by pointwise convolution
Fig. 10.9 Architecture of Depth-wise Convolution (Dn ): Dimension of the input. Dk is Dimension
of the filter (both width and height). Dm Dimension of the output. (C) represents the number of
channels in the input. (z) is the number of filters (typically (z = C) in depth-wise convolution)
typical CNN inputs, the 1 × 1 convolution operation is executed across all C chan-
nels, resulting in C multiplications for every spatial location. The process is further
generalized by applying N such 1 × 1 filters, thereby enabling the transformation of
the channel dimensions from C to N. Each filter produces an output channel, making
the depth of the output N. This entails a computational cost of C multiplications per
spatial position. When extended over the entire spatial grid and across all N filters, the
cumulative computational overhead is succinctly captured by the following formula:
C × Dp2 × N .
where N represents the number of filters, Dp denotes the dimension of output, and
C is the number of channels.
The depth-wise separable convolution is named as such because the depth-wise
convolution applies to the depth of the input (the channels), and the pointwise convo-
lution separately applies to the output of the depth-wise convolution. This separation
of operations reduces the computational complexity while still maintaining a high
level of model performance.
Group convolution and channel shuffle are techniques that have been introduced
in the deep learning domain, especially in the design of convolutional neural
networks (CNNs), to optimize computational efficiency and model accuracy. Intro-
duced with AlexNet and popularized by models like ResNeXt, group convolution is
a method of partitioning the input and output channels of a convolutional layer into
smaller groups, and then convolving each group separately. This can greatly reduce
computational cost.
For a standard convolution with C in input channels and C out output channels, the
convolution would require C in × C out operations. With group convolution, where
the channels are split into G groups, the number of operations required by each
convolution are:
Cin × Cout
G G
Which is 1/Gth, of the original cost if each group has the same number of channels.
Channel shuffle is a technique introduced with ShuffleNet. It’s designed to allow
group convolutions to learn from information across different channel groups. After
group convolution, channels in one group might only be able to communicate with
10.8 Challenges and Future Directions 147
channels in the same group. Channel shuffle aims to remedy this by reorganizing the
channels so that subsequent layers can mix information across the original groups.
Deep learning, while revolutionary, grapples with inherent limitations within indi-
vidual models such as convolutional neural networks (CNNs) and recurrent neural
networks (RNNs), which hinder their effectiveness across diverse data types. CNNs
struggle to efficiently capture long-term dependencies crucial in sequence or genomic
data, while RNNs face challenges such as vanishing gradients and difficulty in
retaining information over long sequences. Moreover, the high computational cost
associated with training and deploying deep learning models exacerbates scalability
issues, necessitating more efficient algorithms and hardware solutions. Additionally,
slow convergence during training and difficulties in parameter tuning impede the
rapid deployment of deep learning solutions in real-world scenarios, highlighting
the need for accelerated learning techniques and automated optimization strategies.
Future research endeavors can focus on overcoming these challenges by charting
innovative pathways to enhance the efficacy and scalability of deep learning tech-
nologies. Combining multiple deep learning models in a hybrid or ensemble approach
efficiently can harness the complementary strengths of individual models, improving
performance across various tasks and data types. Automated hyperparameter opti-
mization techniques and novel architectures tailored to specific applications can
streamline the parameter tuning process and enhance model interpretability. More-
over, advancements in hardware accelerators and distributed computing frameworks
can alleviate the computational burden associated with deep learning, enabling
more widespread adoption and deployment in resource-constrained environments.
By addressing these challenges and leveraging emerging technologies, the field of
deep learning can continue to drive transformative advancements in artificial intelli-
gence, paving the way for more robust and scalable solutions to complex real-world
problems.
148
Table 10.1 Comparison of various Convolution types against filter size, total number of multiplications for the input parameters (Dk = 3, Dm = 10, C = 3,
N = 16, G = 2, Cin = 3, Cout = 16)
Convolution type Description Filter size Total multiplications Example
Standard Uses multiple filters that Dk × Dk × C N × (Dk )2 × (Dm )2 xC 16 × 32 × 102 × 3 = 43,200
process all input channels
together. Each filter is
applied across all
channels, capturing
cross-channel features
Depth-Wise Each input channel is Dk × Dk × 1 (Dk )2 × (Dm )2 xC 32 × 102 × 3 = 2700
processed independently
with its own filter.
Reduces the number of
multiplications by
handling channels
separately
Pointwise Uses 1 × 1 filters to 1×1×C (Dm )2 xC × N 102 × 3 × 16 = 48,00
combine information
across channels after
depth-wise convolution.
Adds flexibility and
reduces dimensionality
3×16×32 ×102
Group Divides input channels Cin × Cout
G G G × Cin G × Cout G 2
into groups and performs Or Or 43,200
convolutions
Cin × Dk × Dk
2 = 21,600
independently within each G G × Cin G × Cout G × (Dk )2 × (Dm )2
group, balancing between = (Cin × Cout × (Dk )2 × (Dm )2 )/G
computational efficiency
and cross-channel feature
learning
10 Selected Deep Learning Architectures for Medical Applications
Bibliography 149
10.9 Summary
Bibliography
1. P. Adarsh, P. Rathi, M. Kumar, YOLO v3-Tiny: object detection and recognition using one
stage improved model, in 2020 6th International Conference on Advanced Computing and
Communication Systems (ICACCS) (2020, March), pp. 687–694. IEEE
2. F. Chollet, Xception: Deep learning with depthwise separable convolutions, in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 1251–1258.
3. F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, SqueezeNet: AlexNet-
level accuracy with 50x fewer parameters and< 0.5 MB model size (2016). arXiv:1602.07360
4. D. Sinha, M. El-Sharkawy, Thin mobilenet: an enhanced mobilenet architecture, in 2019
IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON) (2019, October), pp. 0280–0285. IEEE
5. M.A. Sofi, M.A. Wani, Protein secondary structure prediction using data-partitioning combined
with stacked convolutional neural networks and bidirectional gated recurrent units. Int. J. Inf.
Technol. 14(5), 2285–2295 (2022)
6. Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: LSTM cells and network
architectures. Neural Comput. 31(7), 1235–1270 (2019)
Chapter 11
Hybrid Deep Learning Architecture
for Protein Secondary Structure
Prediction
11.1 Introduction
Traditional deep learning models may struggle to capture all the relevant information
and patterns due to the diverse and intricate nature of protein structures. By integrating
multiple models or techniques, hybrid architectures aim to overcome these limitations
and enhance prediction performance. In this chapter, we propose a novel hybrid deep
learning architecture that combines the strengths of Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs) to enhance PSSP accuracy. Our
approach integrates Inception modules with Bidirectional Gated Recurrent Units
(BGRUs) and incorporates attention mechanisms to dynamically focus on the most
relevant features within the protein sequences. By leveraging this hybrid architecture,
we aim to address the limitations of existing methods and improve the prediction
performance across diverse protein datasets.
Preprocessing techniques play a crucial role in PSSP by preparing the input data
compatible for deep neural networks and analysis. Following are some common
preprocessing steps used in PSSP.
• Sequence cleaning: This step involves removing or handling non-standard char-
acters, gaps, or ambiguous residues in the protein sequence. It ensures that the
sequence is in a consistent and suitable format for further analysis. During this
step, missing or incorrect amino acids are identified and marked as ‘X’ in the
input sequence and the corresponding output label as ‘No sequence’ is assigned
for these positions.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 151
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_11
152 11 Hybrid Deep Learning Architecture for Protein Secondary Structure …
Combining RNN and CNN in PSSP allows for the integration of local and non-local
information. CNNs capture local patterns and spatial relationships, while RNNs
model sequential dependencies and long-range interactions. By combining these
architectures, the CNN component extracts local features from the protein sequence,
which are then fed into the RNN component to capture non-local dependencies. This
integration enables the model to effectively capture both local and non-local informa-
tion, leading to improved accuracy in predicting the protein SS. This hybrid approach
has been widely adopted and contributes to advancing protein structure prediction.
Figure 11.1 illustrates a network comprising two Inception blocks, followed by
convolutional, recurrent, and dense layers. Convolving over the data with multiple
filter sizes ensures effective extraction of local and non-local interactions among
residues across a diverse range. Convolution layers, like ‘Conv (3)’ operation involve
four consecutive steps: first, a one-dimensional convolution is applied using a kernel
size of three; next, Batch Normalization is used to speed up training and improve regu-
larization; then, RELU activation function is applied; finally, Dropout is employed to
prevent overfitting by randomly deactivating neurons during training. The proposed
network is developed, trained, and tested using TensorFlow and Keras, with various
parameters explored. A dropout rate of 0.4 was set, and a learning rate scheduler
controlled the learning rate, decreasing it gradually every 40 epochs. Early stopping
criteria were employed to halt training when validation metrics ceased to improve,
with Tensor Board used for dynamic visualization of training and validation metrics.
Fig. 11.2 The architecture of the inception-BGRU method is integrated with the attention
mechanism. The attention block is shown in Fig. 11.3
11.4 Attention Enhanced Inception-BGRU Architecture for PSSP 155
Fig. 11.3 Attention module for PSSP. s denotes input features and s ∈ Rdproteins ×dfeatures
the outputs. In the self-attention mechanism, each vector in the input sequence is
converted into three different vectors: query, key, and value. The output vectors are
created as a weighted sum of the value vectors, where the weights are determined
based on how well the query vectors match with the key vectors, using a special
function known as the compatibility function.
The input features ‘s’ are transformed into three distinct feature spaces: Query (Q),
key (K), and Value (V). These spaces are essential for computing the scaled dot-
product (SDP) attention. To Compute the SDP ‘mi,j ’ for two vectors si andsj , the
score is computed using the following equation
T
Q(si ) · K sj
mi,j = √
dK
√
where, (dk ) is defined as the dimensionality of the feature space K · dk is the scaling
factor, which ensures that the result of the dot-product does not get prohibitively
T
large for very long sequences. The numerator Q(si )K sj represents the dot product
of the query vector Q(si ) and the key vector K sj , which measures the similarity
between the two vectors. To compute the attention weights, ai,j , as shown in equation
below, the weights are obtained by taking the exponential of the similarity score mi,j
and normalizing it by the sum of exponentials of all similarity scores for the input
sequence. This normalization ensures that the attention weights sum up to 1.
exp mi,j
aj,i = d
protein
n=1 exp mi,j
156 11 Hybrid Deep Learning Architecture for Protein Secondary Structure …
The attention weights a are then used to weight the feature vectors V (s). To
mitigate internal covariate shift (changes in the input distribution of each layer), the
result is normalized using batch normalization (BN) as shown in equation below. This
process results in, which is the output of the scaled dot-product attention mechanism.
⎛ ⎞
dprotein
rj = BN ⎝ aj,i · V (si )⎠
n=1
yi = (α)ri + (1 − α)xi
• Accuracy: Accuracy is one of the popular and simplest types of measure to access
the quality of the model. It is the percentage at which our model predicted or
classified our data points correctly and is calculated as:
NH + NE + NC
Q3 Acc = × 100
N
NH + NE + NG + NI + NB + NT + NS + NL
Q8 Acc = × 100
N
where H, E, G, I, E, T, S, and L represent 8-state SSs and H, E, and C represent
3-state SSs predicted correctly and N denotes total no. of residues.
• Precision: Since Accuracy does not fit well for models trained on imbalanced
data, Precision-Recall metrics give a more generalized assessment of the model.
Precision is quantified as the proportion of true positives relative to the total
number of positive predictions made by the model, as formulated below.
True Positives
Precision =
True Positives + False Positives
• Recall: Recall is computed as the proportion of true positives in relation to the
total number of actual positive data points in the given dataset, as represented
below.
True Positives
Recall =
True Positives + False Negatives
11.6 Results and Discussion 157
(Precision × Recall)
F1 = 2 ×
(Precision + Recall)
• Macro and micro average: Macro averaging treats each category with equal impor-
tance, while micro averaging assigns equal significance to each individual sample.
The computation of both macro and micro averaging follows the subsequent
formulae.
• Segment overlap score: It is a metric used in PSSP to evaluate the accuracy
of predicted SSs. It measures the degree of overlap between the predicted
SS segments and the corresponding segments in the ground truth or reference
structure. SOV is calculated using the equation.
1 minov zi , zi + δ zi , zi
1 2 1 2
SOV = × i i × len zi1 × 100
N ⎧ maxov z1 , z2
⎨ |H,E,G,I,B,T,S,L| z(i)
i∈
⎩ |H , E, C|
where zi1 and zi2 signify a pair of overlapping segments within the predicted and
actual assignments, respectively.
minov zi1 , zi2 denotes the length of two overlapped segments i i
i i z1 and z2 . Total
extent of overlapping
pair
is computed using maxov z ,
1 2z . N signifies total
protein residues. δ zi1 , zi2 is calculated using equation:
i i
δ zi1 , zi2 = min maxov zi1 , zi2 − minov zi1 , zi2 , minov zi1 , zi2 , z1 2, z2 2
Fig. 11.4 Number of amino acids of each SS type in Training (CB6133) and testing (CB513,
CASP10, and CASP11) datasets
while CB513, CASP10, and CASP11 datasets are used for evaluating the perfor-
mance of deep learning methods. The amino acid frequency (structure wise) is shown
in Fig. 11.4.
Several experiments have been conducted to evaluate the performance of various
deep learning models individually as well as in hybrid. The first set of experiments are
conducted to evaluate the performance of CNN with different model configurations
for PSSP. The results are shown in Table 11.1. The 1D-CNN, configured with 2
layers and 42 filters for each of the filter sizes 3 and 5, achieved an accuracy of
76.9%, effectively capturing essential structural patterns. When the additional layer
of convolution with same filter size and numbers is added, the accuracy improved to
77.40%. We also tested the effect of filter sizes and we observed that increasing the
filter size beyond 7 adds more to the computational cost than accuracy. For further
improvements, we added 2D-CNN with filter size of 3 × 1 and 5 × 1 on the output
produced by the 1D-CNN with 4 layers to capture the patterns along the feature
dimension and we observed that accuracy improved with a combination of both 1D
and 2D CNNs as shown in Table 11.1.
In addition, we conducted experiments using various window sizes and evaluated
the model’s performance. Figure 11.5 illustrates the results obtained from testing the
model with 14 different window sizes. It was observed that the model’s performance
11.6 Results and Discussion 159
Table 11.1 Performance comparison of CNN with different model configurations for PSSP
Method Filter size Accuracy (%)
1D-CNN with 2 layers 3, 5 Q3–76.91
1D-CNN with 3 layers 3, 5 Q3–77.40
1D-CNN with 3 layers 3, 5, 7 Q3–77.42
1D-CNN with 4 layers 3, 5 Q3–82.03
Cascaded 1D-CNN and 2D-CNN with 2 layers 3, 5 and 3 × 1, 5 × 1 Q3–83.48
Cascaded 1D-CNN and 2D-CNN with 3 layers 3, 5 and 3 × 1, 5 × 1 Q3–84.55
Fig. 11.5 Performance (Q3 /Q8 accuracy) of CNN on different window sizes
remained consistent for window sizes larger than 20. Using window sizes greater
than 20 increased computational requirements without significantly improving the
accuracy of the predicted SSs.
We conducted several experiments to determine the optimal dropout rate for both
local and non-local blocks. Figure 11.6 (left) illustrates the results, indicating that a
dropout rate of 0.5 achieved the highest accuracy. Additionally, we tested different
batch sizes to identify the optimal value for training our model in PSSP (PSSP).
Figure 11.6 (right) displays the results, revealing that a batch size of 64 yielded the
highest accuracy.
Table 11.3 Performance comparison of various hybrid deep learning methods for prediction of
protein SSs
Methods Q3 (%) Q8 (%)
2D-CNN and bidirectional LSTM 85.4 73.0
1D-CNN and bidirectional LSTM 83.7 72.9
Inception inside Inception networks with 1D-CNN 85.1 70.5
Dilated 1D-CNN 83.6 71.1
1D-CNN + 2D-CNN and bidirectional LSTM 85.7 70.5
1D-CNN-2D-CNN-BGRU and data partitioning 85.8 74.1
Inception-BGRU 86.4 74.2
Attention enhanced inception-BGRU 87.5 75.4
11.8 Summary 161
PSSP remains a challenging task, with the current methods still lacking the desired
accuracy for practical use. Over the past two decades, significant progress has been
made in developing deep learning models for predicting protein SSs. However, the
accuracy of these models is still below the desired level, with the best reported
accuracies ranging from 70 ± 2% for eight-state predictions to 83 ± 2% for
three-state predictions. Despite the challenges, there has been a slow but steady
improvement in prediction accuracy, driven by the increasing availability of protein
sequences and solved structures. One key challenge in PSSP is the selection and
extraction of important input features. Since the raw protein sequence does not
provide sufficient information for prediction, various numerical features such as
amino acid composition, dipeptide composition, sequence profiles, and position-
specific scoring matrices (PSSMs) are used to represent the sequence information.
The choice and representation of these features greatly impact the performance of
prediction models, and careful consideration is required in their selection. Capturing
non-local interactions between amino acids is another critical challenge in PSSP.
While the primary sequence plays a role, the SS is also influenced by spatial relation-
ships between residues. Existing techniques, such as sliding windows and recurrent
neural networks (RNNs), have limitations in effectively capturing non-local interac-
tions. More research attention is needed to develop new models that can efficiently
capture complex non-local interactions for accurate prediction. Protein length is
another significant factor that poses a challenge in SS prediction. Proteins can vary
widely in length, and their shape and function are impacted by their sequence length.
However, current methods often require fixed-length proteins as input, resulting in
accuracy issues at boundary regions. Excessive zero-padding of shorter sequences
has been shown to negatively impact deep learning-based prediction methods. Further
research is needed to address this challenge and improve accuracy, particularly in
boundary regions. Another challenge in PSSP is the separate prediction of regular
and non-regular SSs. Regular SSs, such as alpha-helices and sheets, are distinct from
non-regular structures like turns and coils. Current methods often develop separate
prediction models for regular and irregular structures, making their utilization for
structural analysis and function prediction complex and challenging.
11.8 Summary
This work proposed a novel hybrid deep learning architecture, integrating Inception
modules with Bidirectional Gated Recurrent Units (BGRUs) and attention mech-
anisms, aimed at enhancing protein secondary structure prediction (PSSP) accu-
racy. Extensive experiments were conducted across various CNN, RNN, and hybrid
configurations to evaluate and compare their effectiveness. The results demonstrated
significant improvements over traditional models. Specifically, the Inception-BGRU
162 11 Hybrid Deep Learning Architecture for Protein Secondary Structure …
Bibliography
1. B. Alberts et al., Molecular biology of the cell. Biochem. Mol. Biol. Educ. 36(4), 317–318
(2008). https://doi.org/10.1002/bmb.20192
2. H.M. Berman, The protein data bank. Acta Crystallogr. Sect. D Biol. Crystallogr 58(6 I),
899–907 (2002). https://www.rcsb.org
3. Q. Jiang, X. Jin, S.J. Lee, S. Yao, PSSP: a survey of the state of the art. J. Mol. Graph. Model.
76, 379–402 (2017). https://doi.org/10.1016/j.jmgm.2017.07.015
4. W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983). https://
doi.org/10.1002/bip.360221211
5. M.A.R. Ratul, M.T. Elahi, M.H. Mozaffari, W.S. Lee, PS8-Net: a deep convolutional neural
network to predict the eight-state protein secondary structure. 2020 Digital image computing:
techniques and applications. DICTA 2020, 8–10 (2020)
6. M.A. Sofi, M.A. Wani, Protein secondary structure prediction using data-partitioning combined
with stacked convolutional neural networks and bidirectional gated recurrent units. Int. J. Inf.
Technol. 1–11 (2022). https://doi.org/10.1007/s41870-022-00978-x
7. M.A. Sofi, M.A. Wani, Improving prediction of protein secondary structures using attention-
enhanced deep neural networks, in 2022 9th International Conference on Computing for
Sustainable Global Development (INDIACom) (2022, March), pp. 664–668. IEEE. https://
doi.org/10.23919/INDIACom54597.2022.9763114
8. M.A. Sofi, M.A. Wani, RiRPSSP: a unified deep learning method for prediction of regular and
irregular protein secondary structures. J. Bioinform. Comput. Biol. 21(01), 2350001 (2023)
9. M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in Deep Learning (Springer, 2020)
10. Y. Yang, J. Gao, J. Wang, R. Heffernan, J. Hanson, K. Paliwal, Y. Zhou, Sixty-five years of the
long march in PSSP: the final stretch? Brief. Bioinform. 19(3), 482–494 (2018). https://doi.
org/10.1093/bib/bbw129
11. B. Zhang, J. Li, Q. Lü, Prediction of 8-state protein secondary structures by a novel deep
learning architecture. BMC Bioinform. 19(1), 1–13 (2018). https://doi.org/10.1186/s12859-
018-2280-5
12. J. Zhou, O.G. Troyanskaya, Deep supervised and convolutional generative stochastic network
for protein secondary structure prediction. 31st Int. Conf. Mach. Learn. (ICML 2014) 2, 1121–
112 (2014)
Chapter 12
Enhanced Accuracy in Pan-Cancer
Classification Using Ensemble Deep
Learning
12.1 Introduction
yi = f (zi ) = G(d1 , d2 , . . . , dh )
where yi is the predicted output for the ith sample, zi is the input feature vector
for the ith sample, f (zi ) prediction function, and G is the aggregation function for
combining the outputs of the base classifiers.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 163
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_12
164 12 Enhanced Accuracy in Pan-Cancer Classification Using Ensemble Deep …
We utilized the normalized gene expression data for 33 cancer types, obtained from
the Pan-cancer atlas. The dataset comprises 10,267 cancer samples corresponding
to 20,531 genes. Due to the presence of non-essential information in the dataset, we
combined data from 33 different cancer types, carefully curated it, and removed any
genes with low variability. Additionally, because genes located close to each other are
more likely to interact, they were organized based on their chromosomal positions.
After curating the data, the high-dimensional gene expression data, originally in a
one-dimensional format of 10,381 × 1, was converted into a two-dimensional form
with dimensions of 102 × 102 (see Fig. 12.2 for sample data). This transformed data
serves as the input for the ensemble deep learning architecture.
This section introduces a new ensemble method for classifying 33 different types
of cancer. The proposed method integrates advanced deep learning architectures,
including DenseNet, ResNet, Inception-V3, and the Xception model. The initial
phase of the approach involves preprocessing the pan-cancer dataset to convert the
raw gene expression data into formats suitable for deep learning models. Following
this, the preprocessed data is input into the proposed method for cancer type clas-
sification. The 33 classification output computed by individual models is fused and
the final output is obtained using different aggregation methods. The structure of
Ensemble method for Pan-Cancer Subtype classification is illustrated in Fig. 12.3.
12.2 Ensemble Deep Learning Method 165
Fig. 12.2 Representation of size 102 × 102 samples of 33 cancer types transformed from gene
expression data
• Max Voting
Max Voting is a popular ensemble technique used to combine the predictions of
multiple classifiers. It comes in two forms: Hard Voting and Soft Voting. Hard
(Majority) voting involves selecting the class label that receives the most votes among
all classifiers. Suppose classifiers (d1 , d2 , . . . ., dk ) give class predictions for an input
(z). The final output (y∗ ) is determined by the most frequent prediction:
y∗ = Mod d1(z) , d2(z) , . . . , dk(z)
where y∗ is the predicted class label, d1(z) , d2(z) , . . . ., dk(z) are the predictions from
the individual classifiers.
In soft voting, the predicted probabilities from each base classifier dj are aggregated
to determine the final prediction. The final class label y∗ is chosen based on the
highest combined weighted probability.
k
y∗ = argmax wj dj (z)
c
j=1
where y∗ is the predicted class label, argmax selects the class c that maximizes
the sum of weighted predictions. wj is the weight assigned to the jth classifier. dj
represents the predicted probability or score for class c from the jth classifier for the
input feature vector z.
• Average Voting
Averaging Voting is a method where predictions are gathered from multiple classi-
fiers, and their average is used to make the final prediction. This approach calculates
12.3 Experimental Setup and Results 167
the arithmetic mean of the predictions, where the mean is the sum of all predic-
tions divided by the number of predictions. This method is powerful in terms of
predictive strength and is generally more accurate than majority voting. However, it
is more computationally intensive because it requires averaging the results from all
classifiers. The mathematical equation for averaging voting is:
⎛ ⎞
1 k
y∗ = argmax⎝ dj (z)⎠
c k j=1
where y∗ is the predicted class label, argmax selects the class c having highest average
score.k is the total number of classifiers in the ensemble. dj represents the predicted
probability or score for class c from the jth classifier for the input feature vector z.
• Weighted Average Method
It is an advanced version of the averaging voting method where different weights
are assigned to each classifier in the ensemble. These weights indicate the relative
importance of each model in making the final prediction. To compute the weighted
average, each prediction from the classifiers is multiplied by its corresponding weight.
The sum of these weighted predictions is then divided by the sum of the weights.
The equation for weighted average voting can be expressed as:
k
wj · dj (z)
y∗ = argmax
j=1
k
c j=1 wj
where y∗ is the predicted class label, argmax selects the class c having highest average
score.k is the total number of classifiers in the ensemble. wj is the weight assigned
to the jth classifier, dj represents the predicted probability or score for class c from
the jth classifier for the input feature vector z.
For the classification of 33 cancer types using the proposed deep learning archi-
tecture, several evaluation metrics are essential to assess the model’s performance
accurately. Accuracy measures the percentage of correctly classified samples out
of the total, providing a basic but sometimes insufficient metric, especially with
imbalanced datasets, and is computed using the following equation.
Precision focuses on the proportion of true positives among the instances predicted
as positive, which is crucial in determining the reliability of positive predictions.
Recall assesses the ratio of true positives against all actual positive instances in the
dataset, highlighting the model’s ability to capture all relevant cases. To compute the
precision and recall, following equations are used.
True Positives
Precision =
True Positives + False Positives
True Positives
Recall =
True Positives + False Negatives
The F1-score, a harmonic mean of Precision and Recall, offers a balanced metric
when dealing with models where either metric alone could be misleading. It is
computed using the equation below.
(Precision × Recall)
F1 = 2 ×
(Precision + Recall)
All experimental undertakings for this study were orchestrated on an NVidia DGX
A100 server, equipped with 8 GPUs, each boasting 40 GB of dedicated memory,
supplemented by a RAM capacity of 1 TB. The architectures for pan-cancer classifi-
cation were meticulously developed and trained using the Tensorflow and Keras deep
learning frameworks. The ensemble model was initialized with the default Tensor-
flow weights. In ensemble model, to circumvent model overfitting, L2 regularizers
were employed in tandem with an early stopping strategy, which was set with a
monitoring patience threshold of 7 epochs. Training leveraged a learning rate of
0.0001, accompanied by a decay factor of 0.5, implemented to adjust the learning
rate after intervals of 40 epochs. Training persisted for a comprehensive span of 300
epochs. After rigorous experimentation, the optimal batch size was determined to be
128. The terminal output layer was characterized by a softmax activation function.
To address the variance predicament—often manifesting as overfitting—a dropout
rate of 0.5 was instituted. This methodology selectively deactivates nodes, a strategic
move grounded in the uneven explanatory power some nodes present during training.
In the realm of cancer genomics and molecular profiling, the concept of a pan-
cancer analysis aims to discern overarching patterns and insights across multiple
cancer types. While this holistic approach promises a comprehensive understanding
of oncogenic processes, it also introduces a series of challenges, chief among them
being dataset imbalance. Given the conspicuous imbalance in pan-cancer data, a
class weighting strategy becomes indispensable. This approach, often referred to as
12.3 Experimental Setup and Results 169
14
12
10
class weight
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
samples (%) Class-weight
Fig. 12.4 Magnitude of class weight value on major and minor classes of pan-cancer
12.3.2 Results
Table 12.1 Comparative performance metrics of individual deep learning architectures and the
ensemble method on the pan-cancer dataset. Metrics include Accuracy, Precision, Recall, and F-
score for each method
Method Accuracy Precision Recall F-score
DenseNet 87.32 87.32 87.67 87.44
ResNet50 90.45 91.05 90.77 91.22
Inception-V3 93.10 93.53 93.05 93.15
Xception 92.54 93.76 93.45 93.71
Ensemble method 96.88 96.94 96.90 96.93
The empirical outcomes, delineated in Table 12.1, show that DenseNet obtained an
accuracy of 87.32%. In contrast, ResNet50, Inception-V3, and the Xception method
achieved accuracies of 90.45%, 93.10%, and 92.54%, respectively. Analyzing these
standalone results, the Xception architecture emerges as superior, recording a preci-
sion of 93.76%, a recall of 93.45%, and an F-score of 93.71%. The differential
performance can be attributed to the intrinsic architectural distinctions of each model.
For instance, Xception’s depth-wise separable convolutions likely offer enhanced
feature extraction capabilities for this specific dataset compared to the other archi-
tectures. In pursuit of augmenting classification efficacy, an ensemble strategy was
deployed, amalgamating the strengths of all aforementioned architectures. The resul-
tant ensemble model, leveraged an averaging mechanism for its final output determi-
nation. Notably, as presented in Table 12.1, the Ensemble method outperformed indi-
vidual models, registering a commendable accuracy of 96.89%, complemented by a
precision of 96.94%, a recall of 96.90%, and an F-score of 96.93%. The augmented
performance of the ensemble, can be rationalized by its ability to harness diverse
feature extraction proficiencies from each individual model. This diversity, when
combined, offers a more holistic representation, mitigating individual model biases
and capturing a broader spectrum of patterns intrinsic to the Pan-cancer dataset.
We evaluated the performance of our ensemble architecture on the classification
of 33 cancer types using different voting methods, as shown in Fig. 12.5. Max Voting,
which relies solely on majority rule, achieved the lowest accuracy, precision, recall,
and F-score, as it doesn’t account for the varying confidence levels of the models.
Average Voting improved performance across all metrics by incorporating the prob-
ability distributions of each prediction, reflecting a more balanced and informed
aggregation. However, the best results were obtained using Weighted Average Voting,
which assigns different importance to each classifier based on their reliability. This
approach capitalized on the strengths of the more accurate models, leading to the
highest scores in accuracy (97.15%), precision (97.2%), recall (97.1), and F-score
(97.15). The superior performance of Weighted Average Voting underscores the
importance of tailoring the aggregation strategy to the varying contributions of
individual classifiers, especially in complex tasks like pan-cancer classification.
To further evaluate the ensemble method, a confusion matrix was employed,
providing a cross-tabulation of true versus predicted classes. As depicted in Fig. 12.6,
12.3 Experimental Setup and Results 171
98
97
96
95
94
93
92
Accuracy Precision Recall F-score
Fig. 12.5 Performance metrics (Accuracy, Precision, Recall, F-score) for different voting methods
in the ensemble model for pan-cancer classification
each column of the matrix signifies the predicted class instances, while each row
indicates actual class instances. This matrix elucidates both the general performance
of our classifier and specific error types.
The intricacies of cancer classification are underscored by the multitude of unique
malignancies, each exhibiting its own molecular and histological signature. In this
realm, the efficacy of a predictive model is truly gauged by its ability to discern
between these subtle variations and categorize them accurately. Table 12.2 provides
an in-depth exposition into the class-wise performance of the Ensemble method,
offering insights into its diagnostic precision across a diverse spectrum of cancer
types. As depicted in Table 12.2, each cancer type is denoted by both its full name
Fig. 12.6 Confusion matrix for the Ensemble method applied to the pan-cancer test dataset
172 12 Enhanced Accuracy in Pan-Cancer Classification Using Ensemble Deep …
and an associated code. For each cancer type, the precision, recall, and f1-score
are meticulously documented. Metrics such as precision capture the proportion of
true positive predictions among all positive predictions, while recall evaluates the
proportion of true positive predictions among all actual positives. The F1-score offers
a harmonized measure, considering both precision and recall. The table reveals
nuanced insights into how effectively the ensemble model classifies each cancer
type, providing a granular perspective on its diagnostic capabilities across a diverse
spectrum of malignancies. Through this granular breakdown, we aim to underscore
the model’s strengths and areas of improvement in distinguishing between specific
cancer classes.
The table delineates detailed classification metrics for a wide array of cancer types
using the ensemble of deep learning method. Each cancer type is identified both by
its full designation and a concise code. Across the spectrum of malignancies:
• Breast invasive carcinoma (BRCA) exhibits a precision of 1, coupled with a recall
of 0.88, resulting in an F1-score of 0.93.
• Brain Lower Grade Glioma (LGG) stands out with perfect precision and an
impressive recall of 0.96, culminating in an F1-score of 0.98.
• The table also highlights certain cancer types, such as Liver hepatocellular carci-
noma (LIHC), Colon adenocarcinoma (COAD), and Cervical and endocervical
cancers (CESC), that achieved exemplary performance metrics, registering a
perfect score across all three categories.
• On the other hand, Adrenocortical carcinoma (ACC) has a slightly lower precision,
recall, and F1-score, all at 0.75.
The majority of cancer types, like Lung adenocarcinoma (LUAD) and Head and
Neck squamous cell carcinoma (HNSC), showcase metrics in the high nineties,
underlining the method’s efficacy. The comparative analysis of F-scores for various
cancer types across individual and ensemble deep learning models is shown in
Fig. 12.7.
While methods tailored for various data types, including RNASeq, show versa-
tility, they underscore challenges in broad applicability. The inherent preprocessing
complexities, spanning both filter and wrapper approaches, emphasize the intrica-
cies involved in data preparation for efficient model learning. Genomic datasets,
due to their high-dimensional nature, present substantial challenges for traditional
deep learning architectures, primarily tailored for 2-D imagery. Naively mapping
genomic data onto these images, while innovative, may miss capturing the nuanced
relationships intrinsic to genomic sequences, potentially leading to classification
inaccuracies. A significant concern in pan-cancer classification is the stark imbal-
ance in data samples across tumor types. Some tumors are well-represented with
thousands of samples, whereas rarer ones might only feature sparsely. This disparity
12.4 Challenges and Issues 173
Table 12.2 Class-wise Precision and Recall comparison between proposed method and baseline
methods
Cancer Type Code Precision Recall F1-score
Breast invasive carcinoma BRCA 1 0.88 0.93
Brain lower grade glioma LGG 1 0.96 0.98
Uterine corpus endometrial carcinoma UCEC 0.78 0.78 0.78
Lung adenocarcinoma LUAD 0.97 0.97 0.97
Head and neck squamous cell carcinoma HNSC 0.97 0.97 0.97
Thyroid carcinoma THCA 1 1 1
Prostate adenocarcinoma PRAD 0.98 0.98 0.98
Lung squamous cell carcinoma LUSC 0.98 0.98 0.98
Bladder urothelial carcinoma BLCA 0.95 0.98 0.97
Stomach adenocarcinoma STAD 0.96 0.96 0.96
Skin cutaneous melanoma SKCM 1 0.89 0.94
Kidney renal clear cell carcinoma KIRC 0.96 1 0.98
Liver hepatocellular carcinoma LIHC 1 1 1
Colon adenocarcinoma COAD 1 1 1
Cervical and endocervical cancers CESC 1 1 1
Kidney renal papillary cell carcinoma KIRP 1 1 1
Sarcoma SARC 1 0.7 0.82
Ovarian serous cystadenocarcinoma OV 0.96 0.96 0.96
Esophageal carcinoma ESCA 1 1 1
Pheochromocytoma and paraganglioma PCPG 0.89 0.91 0.9
Pancreatic adenocarcinoma PAAD 1 1 1
Testicular germ cell tumors TGCT 1 1 1
Glioblastoma multiforme GBM 1 0.99 1
Thymoma THYM 1 1 1
Rectum adenocarcinoma READ 0.87 1 0.93
Acute myeloid leukemia LAML 1 0.83 0.91
Mesothelioma MESO 1 1 1
Uveal melanoma UVM 0.97 0.91 0.94
Adrenocortical carcinoma ACC 0.75 0.75 0.75
Kidney chromophobe KICH 0.92 1 0.96
Uterine carcinosarcoma UCS 1 1 1
Lymphoid neoplasm diffuse large B-cell lymphoma DLBC 0.8 0.8 0.8
Cholangiocarcinoma CHOL 0.94 0.94 0.94
174 12 Enhanced Accuracy in Pan-Cancer Classification Using Ensemble Deep …
Fig. 12.7 Comparative analysis of F-scores for various cancer types across individual and ensemble
deep learning models. Each subplot represents a specific cancer type, with the F-score of each model
depicted as bars. The models under evaluation include Inception-V3, DenseNet-201, Xception
Model, ReSNET50, and an Ensemble Model. Overall, the ensemble model consistently showcases
competitive or superior performance across most cancer types
12.4 Challenges and Issues 175
can bias deep learning models, potentially skewing them towards overrepresented
classes. The current focus on highly expressed genes also neglects those with lower
expression levels. While this strategy minimizes false positives, it may omit key
biomarkers that are expressed more subtly. An added dimension to this challenge
is the primary concentration on overexpressed genes, possibly sidelining markers
downregulated in specific cancer types.
Deep learning models, by nature, strive to optimize specific objectives. In this
context, they might bypass genes showcasing particular cancer type expressions if
other genes better suit their objective, potentially missing out on pivotal biomarkers.
For instance, the NAT1 gene’s higher expression in breast cancer was notably over-
looked. Moreover, while neural network visualization methods promise insights into
influential genes, ensuring that these visual insights resonate with biological signif-
icance remains a challenge. Given the noted advancements over previous methods,
there’s an imperative for continual benchmarking against emergent models in tumor
type classification. Bridging the gap between the swift progression in computer vision
176 12 Enhanced Accuracy in Pan-Cancer Classification Using Ensemble Deep …
methodologies and the unique challenges posed by genomic data becomes essential
in this nascent field of genomic deep learning. Addressing these concerns will help
unlock profound insights and refine diagnostic tools in pan-cancer classification.
12.5 Summary
Bibliographys
1. F. Bray, J. Ferlay, I. Soerjomataram, R.L. Siegel, L.A. Torre, A. Jemal, Global cancer statistics
2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185
countries. CA: Cancer J. Clin. 68(6), 394–424 (2018). https://doi.org/10.3322/caac.21492
2. D. Crosby, S., Bhatia, K.M. Brindle, L.M. Coussens, C. Dive, M. Emberton, S. Balasubramanian,
Early detection of cancer. Science 375(6586), eaay9040 (2022). https://doi.org/10.1126/science.
aay9040
3. A. Cruz-Roa, H. Gilmore, A. Basavanhally, M. Feldman, S. Ganesan, N.N. Shih, J. Tomaszewski,
F.A. González, A. Madabhushi, Accurate and reproducible invasive breast cancer detection in
Bibliographys 177
whole-slide images: a Deep Learning approach for quantifying tumor extent. Sci. Rep. 7(1),
1–14 (2017). https://doi.org/10.1038/srep46450
4. B. Lyu, A. Haque, Deep learning based tumor type classification using gene expression data,
in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational
Biology, and Health Informatics (2018, August), pp. 89–96. https://doi.org/10.1155/2022/471
5998
5. M.D. Podolsky, A.A. Barchuk, V.I. Kuznetcov, N.F. Gusarova, V.S. Gaidukov, S.A. Tarakanov,
Evaluation of machine learning algorithm utilization for lung cancer classification based on gene
expression levels. Asian Pac. J. Cancer Prev. 17(2), 835–838 (2016). https://doi.org/10.7314/
apjcp.2016.17.2.835
6. Z. Wang, M.A. Jensen, J.C. Zenklusen, A practical guide to the cancer genome atlas (TCGA), in
Statistical Genomics (Humana Press, New York, 2016), pp. 111–141. https://doi.org/10.1007/
978-1-4939-3578-9_6
7. E.H. Yau, I.R. Kummetha, G. Lichinchi, R. Tang, Y. Zhang, T.M. Rana, Genome-wide CRISPR
screen for essential cell growth mediators in mutant KRAS colorectal CancersGenome-wide
CRISPR screen of KRAS-mutant tumor xenografts. Cancer Res. 77(22), 6330–6339 (2017).
https://doi.org/10.1158/0008-5472.CAN-17-2043
8. S. Zuo, G. Dai, X. Ren, Identification of a 6-gene signature predicting prognosis for colorectal
cancer. Cancer Cell Int. 19(1), 1–15 (2019). https://doi.org/10.1186/s12935-018-0724-7
Chapter 13
Brain Tumor Prediction Using Transfer
Learning and Light-Weight Deep
Learning Architectures
13.1 Introduction
In recent times, brain cancer has been recognized as one of the most deadly brain
diseases. Gliomas are the most common type of brain tumor and have a high mortality
rate (Bauer et al., 2013). These tumors develop in the brain or spinal cord and are clas-
sified into low-grade (LGGs) and high-grade gliomas (HGGs). High-grade gliomas
are highly aggressive and typically result in a life expectancy of around two years
after diagnosis. Meningiomas, another type of brain tumor, arise from the meninges,
while tumors in the pituitary gland are called pituitary tumors. Figure 13.1 shows
the various types of brain tumors. Several imaging techniques like MRI, CT, and
PET are frequently used to examine brain tumors and other brain conditions. MRI,
in particular, is a powerful tool for detecting and treating brain cancer due to its
ability to provide detailed images, high contrast in soft tissues, and multi-directional
imaging.
Previous approaches in brain tumor detection often faced limitations, particu-
larly in terms of accuracy and computational efficiency. Manual methods struggled
with precise tumor type identification, often leading to delayed or less-targeted treat-
ment interventions. Computational methodologies employed in brain tumor predic-
tion involve the utilization of techniques like machine learning, deep learning, and
image processing algorithms specifically applied to Magnetic Resonance Imaging
(MRI) scans. These methods are designed to automatically detect, classify, and
predict brain tumors by extracting patterns and relevant features from these images.
Utilizing algorithms such as Convolutional Neural Networks (CNNs), or other clas-
sification models, these computational techniques strive to achieve early and precise
identification of brain tumors. CNNs have proven effective in analyzing complex
image data, recognizing patterns, and facilitating classification tasks, making them
valuable in medical image interpretation. However, these computational approaches
have inherent limitations. Predicting brain tumors heavily relied on computation-
ally intensive models, posing challenges in their practical deployment on devices
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 179
M. A. Wani et al., Advances in Deep Learning, Volume 2, Studies in Big Data 12,
https://doi.org/10.1007/978-981-96-3498-9_13
180 13 Brain Tumor Prediction Using Transfer Learning and Light-Weight …
Fig. 13.1 MRI images showcasing different brain tumor types: a Glioma, b Glioma, c Pituitary
tumor, and d Meningioma
to fit the input size required by Mobile Net (224 × 224 pixels) and are normal-
ized to ensure consistency in the input data. To enhance the dataset’s diversity and
robustness, data augmentation techniques such as rotation and flipping as shown in
Fig. 13.3 are employed. These techniques help in addressing the class imbalance and
minimizing overfitting during model training. Additionally, denoising methods like
Wiener and median filtering are applied to remove common noise types (Gaussian,
salt-and-pepper, and speckle noise) from the MRI scans, ensuring that the input data
is clean and reliable.
The input images then pass through several stages of the MobileNet V3 archi-
tecture. Initially, the images are processed by a standard convolution layer, followed
by multiple layers of depth-wise separable convolutions. These convolutions are
essential in extracting spatial features while maintaining computational efficiency.
Residual connections are integrated into the architecture to facilitate the flow of infor-
mation and prevent vanishing gradients in deeper networks. Bottleneck blocks are
also employed, consisting of depth-wise and pointwise convolutions, which further
reduce the computational load by compressing the output channels. Stride and expan-
sion operations are used within these blocks to balance the resolution and the number
of output channels, ensuring that the model remains efficient without sacrificing accu-
racy. As the data flows through the network, it undergoes global average pooling to
reduce the spatial dimensions and is then passed through fully connected layers that
transform the feature maps into class probabilities. The final output layer utilizes
a softmax activation function to classify the input MRI scans into one of three
categories: glioma, pituitary tumor, or meningioma. The model’s performance is
evaluated using accuracy, precision, recall, and F1 score, ensuring that it effectively
distinguishes between the different types of brain tumors. The evaluation metrics are
calculated as follows:
True Positives + True negatives
Accuracy =
Total Instances
True Positives
Precision =
True Positives + False Positives
True Positives
Recall =
True Positives + False Negatives
Precision × Recall
F1 Score = 2 ×
Precision + Recall
13.3 Results
100
98
96
94 Glioma
92
Meniongioma
90
Pituitary
88
86
84
Precision Recall F-Score
Fig. 13.4 Class-wise performance (Precision, Recall, and F-Score) of brain tumor classification
184 13 Brain Tumor Prediction Using Transfer Learning and Light-Weight …
One of the primary challenges identified in this work is the computational efficiency
required for deploying brain tumor classification models on devices with limited
resources. While traditional deep learning models like CNNs have proven effective
in tumor detection, their heavy computational demands make them less suitable for
real-time applications on mobile or edge devices. The use of light-weight architec-
tures like MobileNet V3 is a step towards addressing this challenge, but even with
these optimizations, balancing accuracy and computational efficiency remains a crit-
ical issue. The need for more efficient data pre-processing techniques, especially
in managing noise and performing effective data augmentation, further complicates
this task. The pre-processing stage is vital for ensuring the robustness of the model,
yet it adds to the computational burden, which is already a significant challenge in
resource-constrained environments.
Future work could explore the development of even more optimized architectures
that can maintain high accuracy while further reducing computational requirements.
This could involve experimenting with novel neural network designs or integrating
more advanced transfer learning techniques to leverage existing models without
extensive retraining. Another potential direction is enhancing the model’s ability to
handle a broader range of brain tumor types beyond the three categories currently
186 13 Brain Tumor Prediction Using Transfer Learning and Light-Weight …
addressed, which would involve expanding the dataset and refining the model to
ensure it can generalize across different conditions. Furthermore, continued advance-
ments in noise reduction and data augmentation strategies could improve model
robustness, particularly in dealing with the diverse quality of MRI scans encountered
in clinical settings.
13.5 Summary
This work presents a robust approach to brain tumor classification using the
MobileNet V3 architecture, demonstrating the feasibility of deploying lightweight
models for medical image analysis on resource-constrained devices. The proposed
system effectively addresses the challenges of computational efficiency and accu-
racy, achieving impressive results on the Figshare dataset, which consists of 3064
T1-weighted MRI scans categorized into glioma, pituitary tumor, and meningioma.
Specifically, the model attained precision scores of 95.02% for gliomas, 96.77% for
meningiomas, and 89.93% for pituitary tumors, with corresponding recall scores
of 96.87%, 91.6%, and 97.81%, respectively. The balanced F-scores of 95.42% for
gliomas, 90.56% for meningiomas, and 92.80% for pituitary tumors underscore the
effectiveness of the MobileNet V3 architecture in accurately classifying these tumor
types. This performance not only surpasses existing methods but also highlights the
potential of transfer learning and light-weight architectures in advancing medical
diagnostics. The system’s high accuracy and efficiency make it a promising tool
for real-time applications, particularly in clinical settings where quick and reliable
decision-making is crucial.
Bibliography
1. S. Deepak, P.M. Ameer, Brain tumor classification using deep CNN features via transfer learning.
Comput. Biol. Med. 111, 103345 (2019)
2. M.A. Khan, I. Ashraf, M. Alhaisoni, R. Damaševičius, R. Scherer, A. Rehman, S.A.C. Bukhari,
Multimodal brain tumor classification using deep learning and robust feature selection: a machine
learning application for radiologists. Diagnostics 10(8), 565 (2020)
3. H. Mohsen, E.S.A. El-Dahshan, E.S.M. El-Horbaty, A.B.M. Salem, Classification using deep
learning neural networks for brain tumors. Futur. Comput. Inform. J. 3(1), 68–71 (2018)
4. J.S. Paul, A.J. Plassard, B.A. Landman, D. Fabbri, Deep learning for brain tumor classification,
in Medical Imaging 2017: Biomedical Applications in Molecular, Structural, and Functional
Imaging, vol. 10137 (2017, March), pp. 253–268. SPIE
5. M.I. Sharif, M.A. Khan, M. Alhussein, K. Aurangzeb, M. Raza, A decision support system for
multimodal brain tumor classification using deep learning. Complex & Intell. Syst. 1–14 (2021)