Document IA
Document IA
Nael Y. A. Al-Fasfous
Vollständiger Abdruck der von der Fakultät für Elektrotechnik und Informationstechnik der
Technischen Universität München zur Erlangung des akademischen Grades eines
genehmigten Dissertation.
Vorsitzender:
Prof. Dr.-Ing. Georg Sigl
Prüfende der Dissertation:
1. apl. Prof. Dr.-Ing. Walter Stechele
2. Prof. Dr.-Ing. Dr. h. c. Jürgen Becker,
Karlsruher Institut für Technologie (KIT)
Die Dissertation wurde am 04.05.2022 bei der Technischen Universität München eingereicht und
durch die Fakultät für Elektrotechnik und Informationstechnik am 18.08.2022 angenommen.
Two truths cannot contradict one another.
GALILEO GALILEI
Acknowledgement
My deepest gratitude goes to my supervisor, mentor, and role model, Prof. Dr.-Ing. Walter
Stechele, for his continuous support throughout this work. Walter’s kindness, guidance, and calm
demeanor gave me the confidence and reassurance I needed during the highs and lows of this
journey. I am eternally indebted to Walter for his impact on this chapter of my life. I also have to
express how grateful I am to have undertaken this journey with Dr.-Ing. Alexander Frickenstein
and Manoj-Rohit Vemparala. The three musketeers. A key motivator to keep going when times
are hard is to see that others have not stopped. We were the perfect sparring partners for each
other. We celebrated each other’s successes and gave comfort to each other during the hard times.
Even being away from friends, family, and most colleagues during the pandemic years, I never
felt lonely in this endeavor, always having them by my side. To my mother, my father, my sister,
and my late brother, thank you for your continuous support throughout my life, some things
cannot be expressed in words on a paper. To my friends, Ramzi, Serina, Endri, Jose, Jorge, and
Anne, thank you for taking care of me like my own family would. I would like to thank my
colleagues at the Chair of Integrated Systems at the Technical University of Munich, BMW AG,
Politecnico di Torino, Karlsruhe Institute of Technology, and the FINN team at AMD Xilinx for
being great academic and industry partners and nurturing strong collaborations that will go on
beyond the scope of this work.
v
Abstract
Over the past decade, deep neural network (DNN) algorithms have become increasingly popular in
the field of machine learning (ML). Year-on-year improvements in off-the-shelf parallel computing
hardware and the accessibility of big data have democratized the training, optimization, and
development of DNNs. After rapidly surpassing classical algorithms in many domains, such as
autonomous driving and robotics, DNNs solidified their state-of-the-art status for a wide range
of classification and forecasting problems. Along with their popularity, new use-cases emerged
to incorporate them in more deployment scenarios, ranging from constrained edge deployment
to safety-critical settings. These presented several challenges in hardware and software design,
where tight latency, energy, and resource budgets are typically set. This work reinterprets
concepts from the mature discipline of hardware-software (HW-SW) co-design, which provides
processes for finding synergies when deploying complex algorithms on hardware with precise
execution targets and deployment costs. Handcrafted, semi-automated, and fully-automated
methodologies are proposed to introduce co-design at different stages of development with
varying design challenges. Hardware models in the form of analytical schedulers and mappers,
look-up tables, hardware-in-the-loop setups, and differentiable regression models are developed to
inject hardware-awareness into co-design problems for general-purpose or customized platforms,
and spatial or dataflow architectures. Abstraction levels are exploited to enable divide-and-
conquer approaches that tackle design challenges throughout the HW-SW development life cycle.
The contributions shed light on the benefits of bringing together algorithm and hardware design
to achieve the targets set in both worlds, while reducing the independent development effort on
both sides and avoiding incoherent design compromises. Over the course of this work, hardware
components were handcrafted to suit different types of neural network computations. Genetic
and gradient descent algorithms, autoencoders, and reinforcement learning agents were used to
compress neural networks. Fast analytical hardware models were developed for evaluation and
automated hardware design. Neural networks were made safer by analyzing threats of adversarial
attacks and hardware errors on their function, and training them for joint efficiency, robustness,
and accuracy preservation. The resulting co-designed algorithms tackled autonomous driving
problems with high efficiency, enabled power-forecasting on multiprocessor chips, provided mask
detection and correction during the COVID-19 pandemic, and empowered semi-autonomous
prostheses for amputees. The contributions of this work in HW-SW co-design of DNNs brought
applications with societal impact to edge devices.
vii
Zusammenfassung
Im Verlauf der letzten zehn Jahre sind tiefe künstliche neuronale Netze (engl. deep neural
networks (DNNs)), eine Kategorie selbstlernender Algorithmen, im Bereich des maschinellen
Lernens (ML) immer beliebter geworden. Jährliche Verbesserungen bei handelsüblicher par-
alleler Computerhardware und die Zugänglichkeit von “Big Data” haben das Training, die
Optimierung und die Entwicklung von neuronalen Netzen demokratisiert. Nachdem sie die
klassischen Algorithmen in vielen Bereichen wie dem autonomen Fahren und der Robotik schnell
überholt hatten, festigten DNNs zunehmend ihren Status als Stand der Technik für ein breites
Spektrum von Klassifizierungs- und Antizipationsaufgaben. Durch ihrer Popularität entstehen
neuartige Anwendungsfälle und eine Vielzahl an Einsatzszenarien, welche von Anwendungen
in stark eingeschränkten eingebetteten Systemen bis hin zu sicherheitskritischen Anwendungen
reichen. Dies stellte eine Reihe von Herausforderungen für das Hardware- und Softwaredesign
dar, bei denen in der Regel enge Latenz-, Energie- und Ressourcenbudgets vorgegeben sind. In
dieser Arbeit werden Konzepte aus der ausgereiften Disziplin des Hardware-Software (HW-SW)
Co-designs neu interpretiert, die Prozesse zur Erzielung von Synergien bei der Bereitstellung
komplexer Algorithmen auf Hardware mit präzisen Ausführungszielen und Bereitstellungskosten
bieten. Es werden im Rahmen dieser Arbeit handgefertigte, halbautomatische und vollau-
tomatische Methoden vorgeschlagen, um das Co-design in verschiedenen Entwicklungsstadien
mit unterschiedlichen Designherausforderungen einzuführen. Hardware-Modelle in Form von
analytischen Schedulern, Umsetzungstabellen, “Hardware-in-the-Loop” Simulationen und dif-
ferenzierbaren Regressionsmodellen werden entwickelt, um ein Hardwareverständnis in die
Co-design Aufgabe für Allzweck- oder kundenspezifische Platformen sowie in räumliche oder
Datenfluss-Architekturen einzubringen. Abstraktionsebenen wurden ausgenutzt, um Teile-und-
herrsche (engl. divide-and-conquer) Verfahren zur Bewältigung von Designherausforderungen
während des gesamten HW-SW-Entwicklungszyklus zu ermöglichen. Die Beiträge beleuchten
die Vorteile der Zusammenführung des Algorithmen- und Hardwaredesigns, um die in beiden
Welten gesetzten Ziele zu erreichen und gleichzeitig den unabhängigen Entwicklungsaufwand
zu reduzieren und inkohärente Designkompromisse zu vermeiden. Im Rahmen dieser Arbeit
werden Hardwarekomponenten zur Berechnung von verschiedenen Arten von neuronalen Net-
zen entwickelt. Genetische Algorithmen und Gradientenabstiegsalgorithmen, Autocodierer und
Agenten für bestärkendes Lernen (engl. reinforcement learning) werden zur Komprimierung
neuronaler Netze eingesetzt. Es werden schnelle analytische Hardwaremodelle für die Bewertung
und den automatischen Entwurf von Hardware entwickelt. Zudem werden neuronale Netze
sicherer gemacht, indem die Bedrohungen durch gegnerische Angriffe und Hardwarefehler auf
ihre Funktion analysiert und sie für eine gemeinsame Effizienz, Robustheit und Erhaltung der
Genauigkeit trainiert werden. Die daraus resultierenden, gemeinsam entwickelten Algorith-
ix
Zusammenfassung
men bewältigten Herausforderungen im Bereich des autonomen Fahrens mit hoher Effizienz,
ermöglichten Leistungsvorhersagen auf Multiprozessor-Chips, bietet eine Maskenerkennung und
-korrektur während der COVID-19-Pandemie und ermöglichten halbautonome Prothesen für Am-
putierte. Die Beiträge dieser Arbeit zum HW-SW-Co-design von DNNs bringen Anwendungen
mit gesellschaftlicher Bedeutung für eingebettete Systeme und Edge-Devices.
x
Contents
Abstract vii
Zusammenfassung ix
List of Figures xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Academic Work and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Copyright Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Fundamentals of Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 7
2.1.1 Dense Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 9
2.1.3.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . 10
2.1.3.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3.4 Dilated Convolution Layers . . . . . . . . . . . . . . . . . . 12
2.1.4 Learning and Classifying . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Compression and Optimization of Deep Neural Networks . . . . . . . . . . . . 13
2.2.1 Data Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Parameter Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Neural Architecture Search . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Hardware Acceleration of Deep Neural Networks . . . . . . . . . . . . . . . . 20
2.3.1 Deep Neural Networks on General-Purpose Hardware . . . . . . . . . 20
xi
Contents
4 Handcrafted Co-Design 37
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 BNN Training Challenges and Motivation for Reconfigurable PEs . . . 38
4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 Accurate Binary Convolutional Neural Networks . . . . . . . . . . . . 39
4.1.4 OrthrusPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.4.1 SIMD Binary Hadamard Product in Binary Mode . . . . . . 42
4.1.4.2 Arithmetic Operations in Fixed-Precision Mode . . . . . . . 43
4.1.4.3 Mode Switching and Partial Sum Accumulation . . . . . . . 44
4.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 46
4.1.5.2 Resource Utilization Analysis . . . . . . . . . . . . . . . . . 47
4.1.5.3 Dynamic Power Analysis . . . . . . . . . . . . . . . . . . . 48
4.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust
CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Hardware Fault Resilience and Adversarial Robustness . . . . . . . . . 50
4.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2.1 Hardware Fault Resilience Analysis . . . . . . . . . . . . . . 51
4.2.2.2 Fault Resilient Training and Adversarial Robustness . . . . . 52
4.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3.1 Problem Formulation: Quantization and Bit-Flips . . . . . . 52
4.2.3.2 Error Model and Benchmark Phases . . . . . . . . . . . . . 55
4.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.4.1 Large Scale Resilience Analysis . . . . . . . . . . . . . . . . 57
4.2.4.2 In-depth Analysis of Adversarially Trained CNNs . . . . . . 59
4.2.4.3 Results and Conclusions . . . . . . . . . . . . . . . . . . . . 59
4.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xii
Contents
5 Semi-Automated Co-Design 63
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Pros-
thetic Hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 HW-DNN Co-design for Intelligent Prosthetics . . . . . . . . . . . . . 64
5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1.2.1 Efficient Intelligent Prosthetics . . . . . . . . . . . . . . . . 65
5.1.2.2 Binary Neural Networks for Intelligent Prosthetics . . . . . . 65
5.1.2.3 The XILINX FINN Framework . . . . . . . . . . . . . . . . 66
5.1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3.1 Training and Inference of Simple BNNs . . . . . . . . . . . 66
5.1.3.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . 68
5.1.3.3 Runtime Dynamic Frequency Scaling . . . . . . . . . . . . . 68
5.1.3.4 SIMD Binary Products on DSP Blocks . . . . . . . . . . . . 69
5.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 69
5.1.4.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . 70
5.1.4.3 Runtime Dynamic Frequency Scaling . . . . . . . . . . . . . 70
5.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor . . . 74
5.2.1 Efficient Deployment of CNNs for Mask Detection . . . . . . . . . . . 74
5.2.2 COVID-19 Face-Mask Wear and Positioning . . . . . . . . . . . . . . 75
5.2.3 BNN Interpretability with Grad-CAM . . . . . . . . . . . . . . . . . . 75
5.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 77
5.2.4.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . 78
5.2.4.3 Grad-CAM and Confusion Matrix Analysis . . . . . . . . . 79
5.2.4.4 Comparison with Other Works . . . . . . . . . . . . . . . . 82
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Fully-Automated Co-Design 85
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 The Tripartite Search Space . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.2.1 Quantization Methods . . . . . . . . . . . . . . . . . . . . . 87
6.1.2.2 Quantization & Search Schemes . . . . . . . . . . . . . . . 87
6.1.2.3 Hardware Modeling . . . . . . . . . . . . . . . . . . . . . . 87
6.1.2.4 Hardware-Software Co-Design . . . . . . . . . . . . . . . . 88
6.1.3 HW-FlowQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.3.1 HW-Model Abstraction Levels . . . . . . . . . . . . . . . . 89
6.1.3.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.3.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3.4 Genetic Operators . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.3.5 Modeling Mixed-Precision Inference . . . . . . . . . . . . . 94
xiii
Contents
Bibliography 131
A Appendix 147
xiv
List of Figures
4.1 Binary bases can differentiate the values of the full-precision kernel more accu-
rately by preserving more information through linear transformations. . . . . . 40
4.2 Preconditioning signals A, B and C to compute five 3× 3 Hadamard products.
Pixels represented with an X are not relevant for this cycle of operation. . . . . 42
xv
List of Figures
4.3 The DSP48E1 Slice [1]. Appended bold paths illustrate the relevant signals for
our operating modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 SIMD register utilization of the DSP48 in OrthrusPE, with and without partial
operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Block diagram showing the main components of the OrthrusPE . . . . . . . . . 45
4.6 Switch count and partial result memory analysis for a single input channel from
different convolutional layers of binary ResNet18, with M = 3, N = 3. Each
point represents a different configuration of P . . . . . . . . . . . . . . . . . . . 46
4.7 Synthesis results for look-up table (LUT) utilization across different design target
frequencies. Each plot point represents a different synthesis run. . . . . . . . . 47
4.8 Dynamic power estimation at different design target frequencies. Each plot point
represents a different synthesis run. . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Batch-norm limits activation range at training time, effectively lowering v and c
of the subsequent convolutional layer at deployment time (on hardware). Errors
in the convolutional layer can at most grow in magnitude to the defined clip c of
the next layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.10 Layer-wise scaling factors v of ResNet20 CNNs trained on CIFAR-10, with
and without batch-norm. Works investigating bit-flips on aged CNNs (without
batch-norm after every layer) cannot be extended to modern CNNs. . . . . . . 54
4.11 Adversarial attacks apply input perturbations to cause incorrect classifications.
Training for such attacks implies training for pixel value distributions outside of
the natural dataset. Differently, hardware faults can occur at any point within the
CNN, and are not limited to the input of the network. . . . . . . . . . . . . . . 54
4.12 Parameters to determine bit-flip characteristics of the benchmark. . . . . . . . . 56
4.13 Bit-flip experiments following algorithm 4.1 on vanilla, pruned and adversarially
trained ResNet20 and ResNet56. Each bar represents the failure rate of a par-
ticular bit-flip setting {f, t, b, m} tested over 10K test images. Each sub-figure
comprises 900K bit-flip experiments. . . . . . . . . . . . . . . . . . . . . . . . 58
4.14 Convolutional layer scaling factors for vanilla trained and adversarially robust
variants of ResNet20 and ResNet56. High weight decay (αd = 0.05) brings the
high scaling factors v of FastAT back to vanilla levels. . . . . . . . . . . . . . 60
5.1 KIT Prosthetic Hand (50th percentile female) with Zynq Z7010-based processing
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Overview of Binary-LoRAX: binary neural network (BNN) tensor slices are
fed into digital signal processing (DSP) blocks which perform high-throughput
XNOR operations. DSP results are forwarded to the PEs of a matrix-vector-
threshold unit (MVTU). A single MVTU of the pipeline is shown for com-
pactness. Runtime frequency scaling allows high-performance functions, or
power-saving mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 The large input image is sliced into smaller images and reclassified. High
confidence classifications are bounded. . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Runtime frequency scaling ranging from 2MHz to 111MHz for the v-CNV
prototype. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xvi
List of Figures
5.5 Runtime change in operation mode based on application scenario, e.g. motion,
delicate object or low battery. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Main components of BinaryCoP. The BNN requires low memory and provides
good generalization. The FINN-based accelerator allows for privacy-preserving
edge deployment of the algorithms without sacrificing performance. The synthetic
data helps in maintaining a diverse set of subjects and gradient-weighted class
activation mapping (Grad-CAM) can be used to assert the features being learned. 76
5.7 The Grad-CAM approach used to assert that correct and reasonable features are
being learned from the synthetic data. . . . . . . . . . . . . . . . . . . . . . . 77
5.8 Binary operations and layer-wise latency estimates based on PE/single instruction
multiple data (SIMD) choices for BinaryCoP-n-CNV. . . . . . . . . . . . . . . 79
5.9 Confusion matrix of BinaryCoP-CNV on the test set. . . . . . . . . . . . . . . 80
5.10 Grad-CAM output of two BinaryCoP variants and a single-precision floating-
point (FP32) CNN. Results are collected for all four wearing positions on a
diverse set of individuals. Binarized models show distinct regions of interest
which are focused on the exposed part of the face rather than the mask. The FP32
model is difficult to interpret in some cases. It is recommended to view this
figure in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.11 Grad-CAM results for age generalization. It is recommended to view this
figure in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.12 Grad-CAM results for hair/headgear generalization. It is recommended to view
this figure in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.13 Grad-CAM results for face manipulation with double-masks, face paint and
sunglasses. It is recommended to view this figure in color. . . . . . . . . . . 83
xvii
List of Figures
6.7 2-D projections of three 3-D Pareto-fronts for ResNet56 quantization: left to
right (|P|, generations) = (25, 25), (25, 50), (50, 50). Grey to black shades
represent Pareto-fronts of older to newer generations, red points belong to the
final Pareto-front. It is recommended to view this figure in color. . . . . . . 103
6.8 2-D projections of 3-D Pareto-fronts of 3 exploration experiments on ResNet20
for CIFAR-10 for hardware dimensioning, bit-serial processing and dataflow
variants. It is recommended to view this figure in color. . . . . . . . . . . . . 105
6.9 Layer-wise bitwidth strategy for BS-256 hardware. Batch size 1 (left) and 4
(right). non-dominated sorting genetic algorithm (NSGA-II) compensates for
larger activations (batch=4) by lowering bA and maintains accuracy by increasing
bW , when compared to batch=1 inference. . . . . . . . . . . . . . . . . . . . . 107
6.10 Layer-wise bitwidths (bW =bA ) of a DeepLabv3 Pareto-choice strategy with
67.3% mean intersection over union (mIoU) on Cityscapes. Short and parallel
layers have bA equal to their respective bottom layer. . . . . . . . . . . . . . . 108
6.11 Qualitative results of DeepLabv3 quantization on Cityscapes scenarios. Black re-
gions have no ground-truth labels. Pareto-choice has 21.6% fractional operations
(Frac. OPs) compression compared to uniform 8-bit PACT. It is recommended
to view this figure in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.12 High-level abstraction of a bit-serial accelerator [6]: The dimensions Dm , Dn , Dk
determine the tiling degree of matrices RHS and LHS. . . . . . . . . . . . . . 116
6.13 Validation of the HW-model vs. real HW measurements for compute cycles
and DRAM accesses on three BISMO configurations (HW1-3). Small and large
workloads are verified from ResNet20-CIFAR-10 (left) and ResNet18-ImageNet
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.14 AnaCoNGA: Each individual from quantization strategy search (QSS) executes
its own hardware architecture search (HAS) multi-objective genetic algorithm
(MOGA). Any QSS individual can prove itself efficient on its own hardware
design to get a chance for its accuracy to be evaluated. QSS is relieved from
optimizing hardware and is transformed to a single objective genetic algorithm
(SOGA) (i.e. accuracy focused). . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.15 HAS: 2-D projections of a 4-D Pareto-front in a multi-objective search space.
The genetic algorithm (GA) optimizes for hardware resources (LUTs, block
random-access memory (BRAM)) and performance metrics (dynamic random-
access memory (DRAM) accesses, execution cycles) for ResNet20 (top) and
ResNet18 (bottom). The proposed analytical model allows for fast exploration
and evaluation of solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.16 Breakdown of execution on synthesized hardware. Higher DRAM accesses are
correlated with lower compute efficiency and stalls. AnaCoNGA reduces latency
and DRAM accesses while maintaining high accuracy. . . . . . . . . . . . . . 124
xviii
List of Figures
A.1 QSS: 2-D projections of a 3-D Pareto-front for optimal quantization with respect
to accuracy, compute cycles, and DRAM accesses on HW3. Compute cycles
and DRAM accesses are normalized to an 8-bit execution on HW3. “Reward
Accuracy” is with minimal fine-tuning (not fully trained). It is recommended to
view this figure in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.2 Comparison of a HAS solution (Dm , Dn , Dk = 8, 14, 96) found for ResNet18-
ImageNet 4-bit against the larger standard symmetric hardware configuration
HW3. The CONV1 layer follows the same trend but is not shown to maintain
plot scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
xix
List of Tables
4.1 Requirements of most common binary neural networks and the respective hard-
ware operations for execution. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Resource Utilization results of the tested implementations. . . . . . . . . . . . 47
4.3 Summary of results on shallow (ResNet20) and deep (ResNet56) CNNs as vanilla,
pruned, and adversarially trained variants. Percentage improvement shown for
FastAT αd = 0.05 over regular FastAT. . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Hardware results of design space exploration. Power is averaged over a period of
100 seconds of operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Hardware results of design space exploration. Power is averaged over 100s of
operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1 Hardware configurations and normalized access energy costs used for experiments
and validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 ResNet20 for CIFAR-10 quantized at different abstraction levels of the Spatial-
256 hardware with SOGA and NSGA-II. . . . . . . . . . . . . . . . . . . . . . 101
6.3 Quantization of ResNet20 for CIFAR-10 on different hardware dimensions. . . 106
6.4 Quantization of ResNet20 for CIFAR-10 on bit-serial accelerators. . . . . . . . 106
6.5 Quantization of ResNet20 for CIFAR-10 on different dataflows. . . . . . . . . 107
6.6 Comparison of HW-FlowQ with state-of-the-art quantization methods on Eyeriss-
256 Vectorized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7 Classification of HW-CNN optimization methods. . . . . . . . . . . . . . . . . 114
6.8 Hardware configurations used for model validation. . . . . . . . . . . . . . . . 117
6.9 Hardware and quantization search space. . . . . . . . . . . . . . . . . . . . . . 121
6.10 Quantization and hardware design experiments. Uniform and standalone QSS
are executed on a standard edge variant (HW3) used in [7]. Latency and DRAM
are measured on hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
xxi
List of Abbreviations
AI artificial intelligence.
CTC computation-to-communication.
EMG electromyographic.
FC fully-connected.
xxiii
List of Abbreviations
GA genetic algorithm.
HIL hardware-in-the-loop.
HV hypervolume.
HW hardware.
HW-SW hardware-software.
IS input-stationary.
MAC multiply-accumulate.
xxiv
List of Abbreviations
ML machine learning.
OP operation.
OS output-stationary.
PE processing element.
PL programmable logic.
PS processing system.
RF register file.
RL reinforcement learning.
RS row-stationary.
xxv
List of Abbreviations
SoC system-on-chip.
WS weight-stationary.
xxvi
List of Symbols
C Effective capacitance.
Dk PE SIMD lanes.
Dm PE array height.
Dn PE array width.
F Fitness.
I Input image.
P Predictions.
xxvii
List of Symbols
Γ Learning rate.
αd Weight decay.
δ Adversarial perturbation.
E Expected loss.
A Activation tensor/matrix.
D Dataset.
xxviii
List of Symbols
G Genetic algorithm.
L Loss.
N Neural network.
P Population.
T Set of datatypes.
xb Binary variable.
xq Quantized variable.
µ Hardware model.
ψ Task-related accuracy.
ρ Individual/Genome.
f Clock frequency.
xxix
List of Symbols
m Matrix rows.
n Matrix columns.
s Convolution stride.
xxx
1 Introduction
A
LGORITHMS
defined sequence. Fundamentally, algorithms are omnipresent, both in natural and arti-
ficial forms, simple and complex, explainable and emergent. The seemingly miraculous
existence of animate, biological organisms is, at its core, a composition of looped algorithms
in the form of chemical reactions involving inanimate material. The elegance of algorithms
can be compelling enough to convince an observer of being greater than the sum of its parts.
Consciousness and intelligence are examples of such phenomena.
Algorithms can be thought of as blueprints, existing as abstract concepts. Without an imple-
mentation in the real world, they cannot interact with or affect real systems. In biological systems,
algorithms are implemented through matter and chemical processes. The existence of matter
in particular amounts under specific conditions results in sequences of chemical processes and
reactions, based on the fundamental physical properties of the matter. The algorithm is, therefore,
more of a description of what happens in such settings than a planned sequence of operations
to be executed in a particular manner. Scaling up in abstraction from fundamental chemical
processes, we can consider biological algorithms at the neuronal level in the nervous systems of
complex organisms. The placement, positioning, firing rate, and other properties of the neurons in
the context of an organism’s brain produce its reasoning and interaction with the physical world.
The algorithm resulting in how an individual behaves is not designed beforehand, but emerges
from the neural networks and the biological processes of the individual. The algorithm and the
biological matter are one and the same.
For artificial algorithms, a human-designed set of operations is planned. This abstract artificial
algorithm must then be executed in some form to be tested in the real world. Here, a second stage
of design takes place, where the execution medium must be decided. Hence, artificial algorithms
require two phases of development, one for the algorithm itself, and one for its execution medium.
The difference between algorithms emerging in nature and algorithms developed by humans
is that the former fundamentally leads to one holistic manifestation of algorithm and medium,
whereas the latter is a two-step process, which decouples the planned artificial algorithm and
the design of the medium, through which it interacts with the real world. The argument for
co-designing artificial algorithms and the execution medium is not only logical, but also the most
natural approach to achieve efficient, performant, and seemingly miraculous algorithms, as those
we observe in nature.
The work presented in this dissertation focuses on co-designing hardware and artificial deep
neural network algorithms. Hardware components were handcrafted to suit different types of
neural network computations [8]. Genetic and gradient descent algorithms, autoencoders, and
reinforcement learning agents were used to compress neural networks, while fast analytical
1
1 Introduction
hardware models are developed for evaluation and automated hardware design [9, 10, 11, 12,
13, 14, 15]. Neural networks were made safer by analyzing threats of adversarial attacks and
hardware errors on their function, and training them for joint efficiency, robustness, and accuracy
preservation [16, 17, 18]. The resulting co-designed algorithms tackled autonomous driving
problems with high efficiency [19, 20], enabled power-forecasting on multiprocessor chips [21,
22, 23], provided mask detection and correction during the COVID-19 pandemic [24], and
empowered semi-autonomous prosthetics to help amputees [25].
In the following sections 1.1-1.4, the motivation of the work is elaborated further, the objectives
are listed out, and the scope of the dissertation is defined. In chapter 2, a coarse background
and literature review of relevant related topics is presented. Chapter 3 covers the pitfalls of
sub-optimal deployments, incoherent co-design, and other challenges faced by machine learning
(ML) and hardware (HW) engineers in this field. This chapter also introduces the paths proposed
to achieve hardware-software (HW-SW) co-design for deep neural network (DNN) deployments.
Chapters 4, 5, and 6 elaborate the proposed paths towards HW-DNN co-design by presenting
six design challenges tackled by handcrafted, semi-automated, and fully-automated co-design
techniques. Finally, chapter 7 concludes the dissertation and presents the outlook and future work
in this field.
1.1 Motivation
Over the past decade, DNN algorithms have become increasingly popular in the field of ML.
Year-on-year improvements in off-the-shelf parallel computing hardware and the accessibility
of big data have democratized the training, optimization, and development of DNNs [26, 27].
After rapidly surpassing classical algorithms in many domains, such as autonomous driving
and robotics, DNNs solidified their state-of-the-art status for a wide range of classification and
forecasting problems [28, 29, 30].
Along with the popularity of DNNs, new use-cases emerged to incorporate them in more
deployment scenarios, ranging from constrained edge deployment to safety-critical settings.
These present several challenges in hardware and software design, where tight latency, energy,
and resource budgets are typically set. In essence, the challenge of developing efficient DNN
deployments necessitates searching multiple design spaces, from hardware designs to neural
network architectures and the compression space. Several high-impact works in this field have
investigated one design space at a time, with the assumption that solutions from other design
spaces are static and/or already provided [7, 31, 32, 33, 34, 35, 36, 37]. A large co-design
opportunity is often missed, where multiple search spaces are open for co-exploration.
HW-SW co-design is a matured discipline which provides processes for finding synergies
when deploying complex algorithms on hardware with precise execution targets and deployment
costs [38, 2]. These processes heavily rely on divide-and-conquer approaches to achieve near-
optimal solutions in prohibitively large design spaces. Techniques from this field can produce
solutions to new problems emerging in the field of edge DNN deployment. In this work, HW-SW
co-design methods are reinterpreted and applied to DNN deployment challenges.
2
1.2 Objectives
1.2 Objectives
This work sets out to identify the challenges of HW-DNN co-design at different stages of
development, for different use-cases, and different deployment goals. The objectives can be
summarized in the following:
• Identifying key characteristics to classify DNN design and optimization problems, which
help in planning and choosing the correct methods for search and metaheuristics, design
automation, and handcrafted design.
• Analyzing the design challenges that occur throughout the development life cycle of DNN
deployments and breaking down the complex co-design paradigm into stages that can be
addressed independently with less effort.
1.3 Contributions
Inspired by the discipline of HW-SW co-design, this work studies the holistic formulation of
hardware design, DNN design and training, and optimization techniques, through a combination
of models, metaheuristics, and expert knowledge. Different paths towards co-design are proposed
based on the properties of the design problem at hand. Enabled by these paths, multiple design
challenges which necessitate handcrafted to fully-automated solutions are presented.
The contributions of this work can be summarized in the following:
3
1 Introduction
4
1.4 Academic Work and Scope
5
1 Introduction
The following papers were published in peer-reviewed conferences and journals during the
course of this work, but are out of the scope of this thesis:
9
Best paper award winner.
10
Best paper award nominee.
6
2 Background
A
RTIFICIAL INTELLIGENCE
which implement functions that exhibit a general perception of intelligence. Although
defining intelligence itself is complex, artificial intelligence (AI) typically refers to
algorithms which perform classification, forecasting, decision-making, and generative tasks.
Machine learning (ML) algorithms are an approach to realize artificial intelligence. Particularly,
ML algorithms are able to learn the task at hand without being explicitly programmed for it.
Exposure to data allows such algorithms to improve their performance in executing the desired
task. Deep neural networks (DNNs) are one such class of algorithms, loosely inspired by
biological neural networks, and are the focus of this dissertation. In the following sections, the
fundamental components of DNNs are presented. Following that, a specialized form of DNNs
suited for computer vision tasks is presented, namely convolutional neural networks (CNNs).
The procedure through which such algorithms learn and perform their tasks is elaborated. The
challenges of deploying these algorithms in resource constrained settings are discussed, followed
by an overview of optimization and hardware acceleration techniques applied to DNNs.
7
2 Background
Inputs Outputs
𝑙
𝑤𝑗−2 …
𝑙 𝑛𝑜𝑛- 𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦
𝑤𝑗−1
𝑎𝑗𝑖𝑛 ∑ 𝛼 𝑎𝑜𝑢𝑡
𝑤𝑗𝑙
𝑏𝑙
information from each preceding neuron with respect to the current neuron’s sub-function. This
creates paths through the network which get activated when certain data patterns are observed.
The phenomenon is similar to strong and weak neural firing paths in biological brains. Following
the weighted sum of inputs, a bias term b is added to the resulting value. The composition of
linear operations till this point cannot represent a non-linear function, which would limit the
neural network’s representation capability. For this reason, the neuron’s output is finally activated
with a non-linear function α, typically referred to as the activation function.
X
aout = α( wj ain
j + b) (2.1)
j
Generally, artificial neural networks have their neurons organized in layers, as shown in
figure 2.1. In the simplest form, a feed-forward network has layers in a sequential order, one
feeding its output to the next. Other networks exist where recurrent reuse of activations among
the layers takes place, or bypass and parallel paths are incorporated in the network [28]. When
referring to DNNs, the networks are composed of more than three layers. The depth of a neural
network is simply one parameter in deciding its architecture, among other important parameters
such as the number of neurons, their organization, the reuse and bypass of activations, etc. Deeper
neural networks tend to exhibit better performance on complex tasks, as they have more layers
to aggregate simple features into complex ones [28, 40]. In the following subsections, common
neural network layers and operations relevant to this dissertation are introduced.
8
2.1 Fundamentals of Artificial Neural Networks
Al = α(Al−1 Wl + b) (2.2)
Dense layers remain prevalent in modern neural network architectures, such as transformers
and convolutional neural networks [40, 28, 41]. They are typically memory-bound due to the
high number of unique weights that are needed for each fully-connected neuron. For this reason,
most modern neural network architectures employ dense layers only after the input activation
dimensions have been reduced by other preceding layers in the network. Their main function in
CNNs is to combine features extracted from preceding layers into classification logits [42].
9
2 Background
1
1
0
y
y
0.5
−1
0 −4 −2 0 2 4 −4 −2 0 2 4
x x
(a) Sigmoid activation function. (b) Hyperbolic tangent activation func-
tion.
4
4
2
2
y
y
0
0
−4 −2 0 2 4 −4 −2 0 2 4
x x
(c) Rectified linear unit (ReLU). (d) Leaky ReLU.
Figure 2.2: Activation functions used to introduce non-linearities in DNNs. The simple ReLU function is
the most commonly used in modern DNNs due to its simplicity in terms of computation and
effective training results.
regions of the visual field [44]. Their discoveries also classified neurons of the visual cortex into
simple and complex ones, based on the features they react to. For example, simple neurons react
to lines and edges of specific orientation, whereas complex neurons identify sub-features and
patterns irrespective of orientation. This inspired Fukushima to develop the first neural networks
with such inductive biases [39], followed by the modernization of the concept by LeCun et al.
who created LeNet [42], laying the groundwork for today’s CNNs.
The weights of a convolutional layer are organized as a 4-D tensor in RKx ×Ky ×Ci ×Co . A set
of weights Kx × Ky form a kernel, which defines the 2-D receptive field of the convolution.
All kernels along the input channel dimension Ci represent a single filter. All filters along the
output channel dimension Co compose the weights of the convolutional layer. Each neuron in a
convolutional layer reacts to a Kx × Ky × Ci region of the input. The subsequent convolutional
layer would effectively have a larger receptive field, as it aggregates simpler features detected by
the preceding convolutional layer [40]. This inductive bias not only makes CNNs perform better
on localized visual data, but also lowers their weight count compared to a fully-connected DNN.
For a 32×32 pixel image with 3 color channels, a single fully-connected neuron would require
3072 weights. In contrast, a typical 3×3×3 convolution filter would have 27 weights, which are
then shared among multiple neurons reacting to different regions of the image as the filter moves
10
2.1 Fundamentals of Artificial Neural Networks
𝑋𝑖 𝐾𝑥 𝑋𝑜
2-D Convolution 𝐾𝑦
𝑌𝑖 𝐶𝑜 𝑌𝑜
𝐀 𝑙−1
* 𝐖 𝑙
𝐀𝑙
𝐶𝑖
𝐶𝑖 𝐶𝑜
Computation of an
output pixel:
Figure 2.3: Visualization of the convolution operation in CNNs. The computation of a single output pixel
is highlighted.
over the input by a stride of s. For completeness, equation 2.3 represents the computation of
a single neuron in a convolutional layer (i.e. a single output pixel), which is also visualized in
figure 2.3. The input activation tensor of layer l is denoted by Al−1 ∈ RXi ×Yi ×Ci and the output
tensor is Al ∈ RXo ×Yo ×Co . Xi and Yi are the spatial width and height of the input activation,
whereas Xo and Yo correspond to the width and height of the output activation. Activation tensors
in CNNs can also be referred to as feature maps. The two tensors Al−1 and Wl represent the
input feature maps and the weights of the convolutional layer, respectively. The bias addition and
batch dimension are not shown for simplicity. Lastly, s represents the stride of the weight kernels
over the input feature map.
Ci X Ky
Kx X
X
l
A [co ][xo ][yo ] = Al−1 [ci ][xo · s + kx ][yo · s + ky ] · Wl [co ][ci ][kx ][ky ] (2.3)
ci kx ky
Pooling layers downsample the spatial dimensions of the intermediate activation maps in the
network by applying a striding window operation which collapses the covered region into a single
output. This single output pixel is often the maximum value that is present in the pooling window
(max-pool), or the average of the overall values in the window (average-pool) [45]. Pooling layers
offer many advantages, from reducing the computational complexity and memory demands to
regularization effects which control overfitting.
11
2 Background
Batch normalization layers condition the activations to have zero mean and unit variance [46].
This speeds up the training process and generally improves the accuracy of DNNs. Although a
definitive explanation for these improvements has not been found, it is thought that the reason
lies in the network not needing to learn widely different input distributions for each batch of
inputs during training, mitigating the problem of internal covariate shift [46]. Equation 2.4 shows
the batch normalization operation applied to one activation pixel a. µbn is the mean of the batch
activations and σbn is the standard deviation. stab is an arbitrarily small value appended to
maintain numerical stability. γbn and βbn are scale and shift parameters learned to improve the
representation capability of the normalized tensor.
a − µbn
anorm = q γbn + βbn (2.4)
2 −
σbn stab
Batch normalization also allows for high learning rates during training, reduces the emphasis
on weight initialization, and improves generalization. These advantages have made batch normal-
ization a fundamental layer in modern CNNs as well as other DNN architectures. Nevertheless,
it has some disadvantages, most prominent of which is the increase in computational overhead
at run-time, as well as introducing a discrepancy between the model’s performance on training
and test samples versus real-world data. These reasons have motivated researchers to investi-
gate normalization-free neural networks [47], however the approach still remains prevalent in
state-of-the-art DNNs to date.
Early CNNs were used for classification tasks, which involve predicting the presence of objects in
a scene. With further development, researchers were able reuse the inductive biases of CNNs for
other vision tasks, such as object detection, localization, and semantic segmentation. In semantic
segmentation, each pixel of the input image is given a classification [48]. This requires far more
information about locality than simple classification tasks, since the network needs to precisely
segment the object in the scene by classifying each pixel belonging to it. In classification-based
CNNs, the input spatial dimension is reduced throughout the network as features get aggregated.
However, for semantic segmentation CNNs, the input’s spatial dimensions must be preserved
to produce the desired pixel-wise output. This was initially achieved by using de-convolution
layers, where zeroes were introduced into the feature maps for upsampling [48]. More recently,
dilated convolutions offered an alternative method of capturing contextual information, without
diminishing the input resolution [29]. For dilated convolution, zeroes are inserted into the
Kx × Ky kernel, which increases the receptive field of the convolution without increasing the
number of parameters. With dilated convolution, the input spatial dimension can remain large,
and the CNN can capture contextual information by spreading out its receptive field. As an added
benefit, adding zeroes into the kernel allows for optimization possibilities on hardware, where
these computations can be skipped.
12
2.2 Compression and Optimization of Deep Neural Networks
∂L
wiupdated = wi − Γ( ) (2.5)
∂wi
Since the weights of the network contribute to the loss at the output of the computation graph, the
∂L
chain-rule can be applied to find the gradients ∂w i
∀i, where i refers to the index of a particular
weight in the neural network. Computing the gradients and nudging the weights to better values
is referred to as backpropagation, shown in equation 2.5 . In practice, training on complex, large
datasets cannot be performed with the standard gradient descent approach. This would imply that
the gradients computed for backpropagation result from the entire training dataset, which can
surpass millions of samples for many common datasets [26]. For this reason, an approximation
of the gradient descent approach is typically applied, namely the SGD approach. Here, only a
sub-set of the training dataset is considered in each training step, based on which the weights
are updated. Once enough training steps cover the entire training dataset, a training epoch is
complete.
After training, the DNN can be deployed to perform the intended task. When deployed
for an inference task, the network is typically only executed in a forward-pass, as a standard
computation graph. This process is less computationally intensive compared to training, as no
gradient computation or weight updates are necessary. Nevertheless, as the size and computational
diversity of modern DNNs grow, inference can still create challenges in real-time, embedded
deployment settings.
13
2 Background
30 180
Error rate Layers
Error Rate (%) 152
20 120
Layers
10 60
0 0
an
]
]
50
51
52
53
40
54
28
55
he
um
-[
-[
-[
-[
-[
-[
-[
-[
us
H
So
10
11
13
14
15
7
1
1
-
20
20
20
20
20
20
20
20
ps
rim
-T
16
20
Figure 2.4: Reduction in error rate during the ILSVRC. Models needed more layers and parameters to
push the boundaries each year. In 2010 and 2011, classical computer-vision algorithms were
used.
other domains, such as large-scale natural language processing, where transformers have already
surpassed 530 billion parameters in model size [49]. Next to compute complexity and model
size, the variety of layer types and dimensions in modern DNNs introduces algorithmic diversity,
which requires flexible hardware components for optimal execution across the network. As a
consequence of these software requirements, hardware design becomes more difficult, given
the tight area, power, latency, throughput, and safety thresholds typically defined in modern
edge applications. In this section, methods for compressing and optimizing DNN algorithms for
efficient deployment are presented.
14
2.2 Compression and Optimization of Deep Neural Networks
of weights and activations accurately, as well as the fine gradients computed and applied during
backpropagation. The FP32 neural network can then be quantized to a simpler representation,
such as the fixed-point or integer representations, before deploying it on constrained, edge
hardware. The process of converting a fully-trained FP32 DNN to a quantized, lower precision
representation is referred to as post-training quantization [57, 58]. Equation 2.6 shows the
basic principle of linear quantization for an arbitrary FP32 operand xf into a more constrained
numerical representation xq .
15
2 Background
Another form of non-linear quantization can be a simple look-up table of 2b values that should be
represented [56]. The bits b are then used only as indices to read the true value of the operand
from the look-up table, which can be stored at a higher bitwidth than b. The true values stored in
the look-up table can be distributed across the number line arbitrarily and learned by the network
during training.
Quantization levels
Frequency
Frequency
Values Values
(a) Linear quantization. (b) Log quantization.
Figure 2.5: Linear quantization represents the numerical distribution with uniformly spaced quantization
levels. Log-based quantization represents the more frequently occurring values more finely,
while less frequent values are more sparsely represented, with higher rounding error.
Recent works have also investigated mixed-precision DNNs, where substructures, e.g. layers,
filters, and/or datatypes, can have different quantization levels [7, 15, 10, 63, 35]. The numerical
distribution in substructures can vary largely from one part of the DNN to another. This makes a
single quantization scheme sub-optimal for many parts of the network. For example, in CNNs, the
fully-connected layers at the end of the network can be represented with much less precision than
the feature-extracting convolutional layers [64]. Nevertheless, developing mixed-precision DNNs
can be challenging. First, finding the optimal mixed-precision configuration can be formulated as
a search problem. The search space for such problems is typically very large; for datatype-wise,
layer-wise mixed quantization of an L layer CNN, Q2L solutions exist, where Q is the set of
possible quantization levels, i.e. supported bitwidths [9]. Some works propose searching this
space using a reinforcement learning (RL) agent [7] or a genetic algorithm (GA) [10], while
other works try to find the optimal bitwidth for each layer at training time, without adding any
overhead of metaheuristic search agents [15]. Second, the hardware deployment platform must
support and gain a speed-up from the proposed Q levels, which typically requires more complex,
non-standard arithmetic units and dynamic memory alignment [65, 66, 67, 68, 69].
16
2.2 Compression and Optimization of Deep Neural Networks
𝐾𝑥
𝐖𝑙
𝐾𝑦
𝐶𝑜
𝐶𝑖
Structured Unstructured
Figure 2.6: Different example pruning regularities showing structured to unstructured parameter removal.
Structured pruning can bring benefits to hardware acceleration without any specialized zero-
detectors or complex memory management.
Many heuristics emerged to decide which DNN substructures can be pruned [70, 71, 72, 73,
31, 74, 75, 76, 77]. L1-norm pruning is a common technique where the norm guides the pruning
algorithm to remove substructures of low magnitude parameters, and thereby, low influence on
the function of the DNN [74]. Other works identified different heuristics, such as geometric
median [75] and lasso regression [76], similarly determining the saliency of neurons based on
the guiding metrics. The pruning problem can also be formulated as a search problem, where an
algorithm must search for the optimal set of substructures to be removed. Works involving RL
agents [31], GAs [12, 11], and other metaheuristics have shown the effectiveness of combining
guiding metrics with search algorithms.
Other works perform pruning without the use of heuristics, but instead try to learn the saliency
of neurons [37, 13, 17]. For example, an autoencoder attached to the target layer can produce
a pruning mask during training to decide which neurons can have an effect on the DNN’s
output [13]. At the end of the training, the produced mask translates to the pruning configuration
which can be applied before deployment. Differently, the in-train pruning approach proposed
in [17] updates pruning masks through SGD during the training process. An in-train approach
has the advantage of allowing the network to learn the task, the optimal pruning masks, as well as
other targets such as robustness against adversarial examples, within the training process at no
extra GPU-hour costs.
Another important aspect to consider during parameter pruning is the regularity of the sub-
structures being removed. For a given sparsity ratio, this has a direct impact on the accuracy
degradation to be expected from the pruning procedure, as well as the hardware benefits that can
be exploited at deployment time [31]. Generally, fine-grain, irregular weight pruning results in
high compression and maintains high task-related accuracy [73]. However, the irregularity breaks
17
2 Background
the structured parallelism in the DNN’s computational workloads. Identifying which weights to
skip and which ones to execute prohibits general-purpose computation platforms and standard
accelerators from achieving a speed-up with this pruning regularity and its irregular memory
access patterns [78, 79]. More coarsely, the pruning algorithm may remove larger structures
such as entire neurons of dense layers, kernels and channels of convolutional layers, or attention
heads in transformers. Pruning large structures generally translates to a change in the tensor’s
dimension. For example, a 4-D weight tensor of a convolutional layer in RKx ×Ky ×Ci ×Co would
maintain the same dimensions if individual weights were pruned. However, removing an input
channel would shrink the Ci dimension. The same applies for pruning an output channel and the
Co dimension. Most DNN accelerators are essentially tensor processing units, resulting in a direct
improvement in hardware performance when tensors are shrunk in this manner. The downside to
coarse, structured pruning is that the task-related accuracy can quickly degrade at high pruning
rates [31]. Structured and unstructured pruning examples are visualized in figure 2.6. It is worth
mentioning that extracting benefits from irregular parallelism is nevertheless an important field
of research in DNN hardware design [78, 79]. Recently, the Ampere general-purpose graphics
processing unit (GPU) architecture by NVIDIA provided the means for semi-structured pruning,
where specialized hardware units can detect up to 2 pruned values out of each 4 elements, offering
a trade-off between irregular and structured pruning [80].
18
2.2 Compression and Optimization of Deep Neural Networks
+ Concat.
Figure 2.7: Example handcrafted DNN architectural blocks developed to improve training and reduce
total computations and parameters.
inference challenges, as activations from past layers need to be stored in memory and reused at
deeper stages of the DNN before being discarded. Figure 2.7 shows some popular DNN blocks
that were designed through handcrafted NAS.
Depending on how granular the search space is defined, NAS can inherently perform compres-
sion using quantization and pruning [84, 88]. For example, if two layers with equal dimensions
but different quantization degrees are considered unique solutions in the NAS space, then the
search is jointly finding the architecture and its quantization in the same process. Consequently,
the efficiency of the architectures being considered can be measured with respect to a target
hardware platform. This can be done using hardware-in-the-loop (HIL) setups, differentiable
and analytical hardware models or look-up table (LUT) approaches, where measurements are
collected on the hardware ahead of the search experiment.
A set of randomly sampled images from the dataset D is chosen, where the expected loss
E on the random samples is minimized through an adversarial training scheme. A commonly
used attack to introduce imperceptible adversarial perturbations is the fast gradient sign method
(FGSM) [91], which was is one of the first white-box attacks to be developed. The advantage of
FGSM is that generating an adversarial example is faster than with other attack methods, such
as projected gradient descent (PGD) [92]. FGSM in combination with random initialization is
19
2 Background
particularly effective to incorporate into the training loop to obtain adversarial training with a
small overhead of GPU-hours, as presented in fast adversarial training (FastAT) [93]. For the
final evaluation of adversarial robustness, the neural network is typically exposed to an unseen
adversarial attack method.
20
2.3 Hardware Acceleration of Deep Neural Networks
PEs reused
NoC for all layers
PE PE PE
Data
shared/reused
across array
PE PE PE
Layer-wise computation units
Figure 2.8: Abstract visualization of spatial and dataflow architectures. Spatial accelerators use an array
of PEs to perform the parallel operations of a DNN. Dataflow architectures reflect the neural
network architecture in hardware and process the layers as a classical dataflow graph.
platform, the execution of DNNs on CPUs is inevitable in some cases. CPUs can be a practical
option for smaller DNNs, latency-relaxed applications, or in constrained embedded systems where
a dedicated DNN accelerator is not feasible. ML software libraries have also been optimized to
exploit the capabilities of modern CPUs, such as hyper-threading and vectorized instructions [98].
For example, CPUs with support for advanced vector extensions (AVX) use the wider SIMD
registers to pack more operations when DNNs are quantized to lower bitwidths [99].
21
2 Background
dataflow graph, i.e. a pipeline of nodes communicating over first in, first out (FIFO) buffers, as
shown in figure 2.8. Such architectures are well-suited for reconfigurable fabric such as field
programmable gate arrays (FPGAs), where a new computation graph can be flashed onto the
fabric whenever the DNN needs to be changed. Another advantage here is that no communication
is necessary with off-chip memory during computation as the entire graph is on-chip and the
intermediate results are passed from one computation node to the next directly. The architecture
also unrolls the layers of the DNN, allowing multiple inputs to be processed in different parts
of the graph to improve throughput [103]. For example, while input i is being processed by the
graph node for layer l, the next input i+1 can already be processed by the preceding graph node
for layer l-1. With a well-dimensioned pipeline, this architecture can achieve a high-throughput,
low-latency execution. However, there are some disadvantages to an in-hardware graph-based
implementation. For large DNNs, it might not be feasible to fully unroll the graph and all its
synaptic weights onto the fabric of an FPGA [104]. Additionally, for DNNs with residual paths,
a large amount of memory might be required to store the intermediate results of previous nodes
of the graph until it can be used by the deeper layers of the graph. This can ultimately stall the
pipeline and deplete the memory resources of the programmable logic.
More exotic accelerators have also appeared in research, particularly in the field of neuro-
morphic computing [105, 106]. However, in this work, the focus remains on more classical
graph-based and parallel computing architectures.
In the following subsections, a more detailed discussion on spatial accelerators is presented, as
well as an elaboration of the challenge of finding the optimal schedule for these accelerators with
respect to different DNN workloads.
22
2.3 Hardware Acceleration of Deep Neural Networks
Algorithm 2.1 Nested loop representation of the convolutional layer execution from equation 2.3
Input: Al−1 [Ci ][Xi ][Yi ]
Weights: Wl [Co ][Ci ][Kx ][Ky ], Stride: s
Output: Al [Co ][Xo ][Yo ] . Required tensors for the convolution operation
for co = 0 ; co < Co ; co + + do . Output channel iterator
for ci = 0 ; ci < Ci ; ci + + do . Input channel iterator
for xo = 0 ; xo < Xo ; xo + + do . Output horizontal spatial iterator
for yo = 0 ; yo < Yo ; yo + + do . Output vertical spatial iterator
for kx = 0 ; kx < Kx ; kx + + do . Kernel horizontal iterator
for ky = 0 ; ky < Ky ; ky + + do . Kernel vertical iterator
w = Wl [co ][ci ][kx ][ky ] . Iterators as tensor indices
al−1 = Al−1 [ci ][xo · s + kx ][yo · s + ky ]
Al [co ][xo ][yo ]+ = w · al−1 . Core MAC, Write to output tensor Al
such as PE registers, are usually a limited, precious resource due to manufacturing costs and
on-chip area constraints. Nevertheless, data reuse is still possible to some extent through clever
scheduling techniques [108, 109, 3]. The main computation is at the core of the inner-most loop,
where many elements are accessed in multiple iterations of the higher loops. Specifically, reuse
occurs when the indices of the parameters involved in the inner-most computation remain fixed
for some loops before iterating in others. In hardware, this translates to a single element being
stored at a lower-level memory for multiple iterations before being purged to make space for
new data. For optimal reuse to occur, no single element should be read more than once from a
higher-level memory.
Loop-tiling is an approach to efficiently exploit the entire memory hierarchy by dividing the
nested-loop into shallower loops, which can fit on multiple memory levels. The loop-tiling
strategy effectively decides which tiles of the CNN computation will take place in one round
of communication with a lower-level memory (on-chip buffer). An example of the Co loop in
algorithm 2.1 being tiled is given in algorithm 2.2. Tiles of size T Co are sent by the outer loop
(off-chip memory) to the inner loop (on-chip memory). Another visual example of loop-tiling is
presented in figure 2.9.
The order of the loops can also be manipulated without affecting the algorithm through
loop-reordering. For example, in algorithm 2.1, the execution iterates over the yo index before
incrementing the xo index, allowing a set of elements with index xo to reside longer on the
lower-level memory while iterating over all possible elements yo ∈ Yo . Swapping these two loops
would result in yo elements residing longer on the lower-level memories. This essentially helps
in extracting improved reuse opportunities, since the upper-level loops remain on the lower-level
memories of the hardware architecture, thus closer to the compute units.
Finally, loop-unrolling is the third loop optimization technique, which can be applied once a
memory level is distributed spatially. The degree of unrolling is limited by the parallelism offered
by the hardware architecture and the on-chip interconnect. For example, in algorithm 2.1, the
kernel’s elements Ky can be assigned to py spatially distributed PEs, effectively executing several
py loop iterations in parallel as shown in algorithm 2.2. Another visual example of loop-unrolling
23
2 Background
𝑋𝑖 𝐾𝑥 𝑋𝑜
𝐖𝑙
2-D Convolution 𝐾𝑦
𝑌𝑖 𝐶𝑜 𝑌𝑜
𝐀 𝑙−1
* Filter 2
First PE row computes 3
𝐀𝑙
𝐶𝑖 output pixels of output 𝐶𝑜
Filter 3 channel 2, in parallel
𝐶𝑖
Computation tile sent Second PE row computes 3
Filter 2 and 3, shared across first
to on-chip memory output pixels of output
and second PE rows, respectively
channel 3, in parallel
PE PE PE
Off-Chip Memory (e.g. DRAM)
Unrolled (parallelized)
computation of 6 output pixels
PE PE PE
PE PE PE
is presented in figure 2.9, where six output pixels from two different channels are computed in
parallel by six PEs, after the required filters and input pixels are unrolled over them.
Algorithm 2.2 Output channel tiling and weight kernel unrolling example based on algorithm 2.1
Input: Al−1 [Ci ][Xi ][Yi ]
Weights: Wl [Co ][Ci ][Kx ][Ky ], Stride: s
Output: Al [Co ][Xo ][Yo ] . Required tensors for the convolution operation
for co = 0 ; co < Co ; co + = T Co do . Off-chip tile iterator with tile size T Co
... . Other nested loops
for tco = 0 ; tco < T Co ; tco + + do . On-chip output channel tile iterator
for ky = 0 ; ky < Ky ; ky + = py do . Unrolling py operations in parallel
w = Wl [tco ][ci ][kx ][ky : ky + py ] . w and a vectors for py parallel operations
al−1 = Al−1 [ci ][xo · s + kx ][yo · s + ky : yo · s + ky + py ]
Al [tco ][xo ][yo ]+ = w · al−1 . Core MAC, Write to output tensor Al
24
2.3 Hardware Acceleration of Deep Neural Networks
The schedule and mapping search space naturally depends on the workload being scheduled as
well as the hardware dimensions (memory and PE array size) being considered. Constraints are
introduced by disallowing certain communication patterns among PEs, or limiting the memory
hierarchy size, which shrinks the search space of valid schedules and simplifies the hardware
design.
In terms of loop reordering, three ordering schemes were identified in [110], which influence
the reuse of one datatype, before being accessed again from the off-chip memory. These can
be categorized into output reuse oriented (ORO), input reuse oriented (IRO), and weight reuse
oriented (WRO). For tiling, every possible combination of weights, input activation, and output
tiles which fits on the available on-chip memory, results in a legal tiling scheme. Finally, for
unrolling, the computations are mapped onto parallel PEs to exploit the parallelism in DNN
computations. When unrolling computations onto the PE array, some on-chip data movement
considerations can help squeeze more efficiency out of the compute array by sharing the data
which is already on the array [109, 3]. For example, PEs operating on the same input feature
map pixels, but different weights may share the input pixels among themselves over the on-chip
interconnect, rather than individually accessing the more expensive and larger on-chip SRAM
memory.
The common taxonomy for on-chip computation mapping and data movements defines which
datatype remains stationary in the PEs, and which datatypes traverse the array or are called from
the on-chip SRAM buffer. For example, the weight-stationary (WS) dataflow allows each PE to
have a unique set of weights in its registers, while input feature map pixels are provided over the
NoC and flow through the PEs which require them. In an output-stationary (OS) dataflow, the
weights and input feature maps can flow through the array while each PE maintains the PSUM of
the output pixels it is responsible for, until they are fully accumulated. For input-stationary (IS)
dataflows, the input activations remain on the PE while weights flow through the array and output
pixels are computed in a distributed manner. For each of the WS, OS, and IS dataflows, there
can be a variety of implementations possible, defining which parts of the data are stationary. For
example, a variant of OS may allow each PE to work on a subset of output pixels of the same
output feature map. Alternatively, each PE may handle an entire output channel on its own [56].
In both cases, the dataflow falls under the OS classification, but is implemented differently.
More complex dataflows also exist in literature, most prominent of which is the row-stationary
(RS) dataflow, proposed for the Eyeriss accelerator [3]. Each PE is responsible for a 1-D
convolution of one row of input activations against one row of the weight kernel. Vertically,
the PEs can share their PSUMs to accumulate the results of their 1-D convolutions, essentially
achieving the accumulation across the 2-D kernel’s window. Weights, output activations and input
activations are shared across the on-chip interconnect horizontally, vertically, and diagonally,
respectively. Within each PE, the registers may contain the weights and activations of different
channels, allowing for more efficient use of the lowest level memory, and performing more
accumulations within the same PE with less data movement.
25
3 HW-SW Co-Design of Deep Neural
Networks
O - DESIGNING hardware and software might seem intuitive, yet the challenge of truly
27
3 HW-SW Co-Design of Deep Neural Networks
and certainty are also critical to such algorithms in safety critical settings, contributing to more
complex algorithms which have redundancy through batch-processing from multiple sensors
and/or having an ensemble of classifiers with a voting mechanism to give the final prediction.
Naturally, these also increase the computational overhead required for the application. With equal
importance, the robustness of the algorithms against adversarial attacks is yet another concern
for the ML-engineer, which motivates the introduction of preprocessing stages to the algorithm,
adversarial training, redundancy, and multiple input processing, similar to the measures taken for
predictability. Counterintuitively, adversarial training and input filtering for improved robustness
against adversarial attacks harms the natural task-related accuracy of the DNN.
In contrast, a HW-engineer seeks other targets. The fabrication cost of an application specific
integrated circuit (ASIC) is correlated with the complexity and area utilization of the hardware
design. Typically, the larger share of an integrated chip’s area is consumed by memory cells,
where 6 transistors are required per 1-bit of SRAM. This incentivizes designs with smaller on-chip
buffers. Smaller and simpler compute logic with fewer registers also helps in lowering the area
cost. Hardware design on FPGA has similar challenges, as the utilization of the design may
not exceed the LUT, digital signal processing (DSP) blocks, and block random-access memory
(BRAM) resources available on the programmable logic. Along with shrinking the on-chip
memory to save area, a contradicting objective is to reduce power consumption. Smaller on-chip
buffers would require more frequent communication with off-chip DRAM, which is costly in
terms of energy consumption and potentially latency, when off-chip communication results in
stalls. Lastly, the HW-engineer also has their concerns with respect to safety and security. For
example, the HW-engineer may employ dynamic voltage and frequency scaling (DVFS) to save
power, but low-voltage operation could result in bit-flips in logic and memory. This could lead
to critical errors at the task level. Bit-flips may also be caused by ionization or aging, neither
of which can be guaranteed by the designer. Therefore, precautions must be taken through
redundancy approaches, which again stress the challenge of minimizing hardware area and power
consumption.
From this brief look at the targets of HW and ML engineers, it is clear that each domain
is complicated in its own right, with design decisions affecting contradicting targets within
the domain itself. A deeper look further reveals cross-domain impacts of the design decisions.
Employing an ensemble of classifiers for algorithmic redundancy could lead to increased latency,
area, and power consumption on hardware. In cases where DVFS, aging, or ionization cause
bit-flips, the accuracy and predictability of the algorithm cannot be guaranteed. Finally, large
complex DNNs executing on hardware with small on-chip buffers increases the amount of
computation tiling and off-chip communication required, thereby increasing the energy cost per
classification. There is a clear motivation for the two domains to consider their own challenges
alongside the targets of the other domain, in order to reach solutions that ease the design effort in
both hardware and software.
28
3.2 Compromises and Extending the Hand of Truce
0 Hardware Methods
1
0101
0
1 Software Methods
Architecture
Parameterization Compute Logic Efficient Layers
𝑓𝑓𝑡(𝑥)
𝑊𝑖𝑛𝑜(𝑥)
GEMM
Memory Efficient
Banking/Hierarchy Quantization Algorithms
reordering, and unrolling are some techniques that achieve improved deployments of DNNs.
However, optimization techniques may be performed without fully extracting the expected
benefits.
ML and HW engineers can potentially compromise their own targets to help each other.
Yet without properly integrating their optimization techniques, the solution would end up in
compromises in both domains with little to no benefits. To elaborate this point three basic
examples can be considered.
Pruning and Execution Schedules. Removing parameters from a DNN can be performed at
different regularities and with different sparsity targets for each layer. For a CNN, convolutional
layers may be pruned in element-wise, channel-wise, or filter-wise regularities. Pruning at a fine
regularity which is not supported by the memory access patterns on hardware would result in
task-related accuracy degradation without improving the hardware execution metrics. Deciding
which parameters to prune can be done based on a heuristic, while a search algorithm can decide
how much to prune each layer, i.e. find the sparsity ratios of each layer. In the hardware domain,
the dataflow supported by a generic spatial accelerator may be optimized for weight-dominated
or activation-dominated layers, i.e. output-stationary or weight-stationary. A typical image
classification CNN has more activations in the initial layers and fewer in the deeper layers. The
opposite is true for weights, where initial layers have fewer filters, and deeper layers tend to have
more filters and input channels. Overall, the decision on whether to prune initial layers or deeper
layers should not only be done to maintain task-related prediction accuracy, but also to benefit
the dataflow which the hardware supports. If the dataflow efficiently unrolls the computations
of the initial layers, and cannot effectively map the computations of the deeper layers, then
the pruning algorithm must take such subtleties into consideration and prune the deeper layers.
Pruning layers which are already efficiently executed on the hardware might lead to task-related
accuracy degradation without any improvements in execution latency. Pruning a layer can also
29
3 HW-SW Co-Design of Deep Neural Networks
change the scheduling and mapping search space. Compute workloads which previously did not
fit on the on-chip memory may become possible after pruning, allowing further mapping options
on hardware. Therefore, the execution reward returned from the scheduler and mapper of the
hardware is also important for the pruning algorithm to decide how much to prune a particular
layer. For example, pruning 10 channels or 15 channels may result in the same schedule on a
particular hardware design, which does not change the latency of the execution. However, pruning
one more channel, e.g. 16 channels, may result in a new tiling scheme becoming possible, which
drastically reduces the latency of the execution.
Quantization, Memory and Compute Logic. Similar to pruning, quantization can be decided
independently for each layer to provide more numerical precision for some critical layers of the
DNN, and low-precision, fast execution for other layers. The numerical precision used to represent
weight and activation data can also be different within the same layer. A search algorithm can be
applied to this problem as well, tasked with finding the optimal bit allocation for each datatype in
each layer. Assuming the search algorithm is given the freedom to choose between 1 to 16 bits
for each datatype in each layer, the hardware must equivalently be able to extract the benefits
at each of those quantization levels. This might involve designing bit-serial PEs, and flexible
data-packing in the memory’s word-length. The interconnect must also flexibly transport the
necessary data at all supported bitwidths, without under-reads or over-reads. If data-packing
and memory alignment allocates 32 bits for 8-bit parameters, the benefits of quantization for
memory movement are not achieved. Similarly, if the arithmetic unit performs the same operation
with the same latency and throughput for all bitwidths, no computation speed-up is achieved.
Finally, similar to pruning, quantization might unlock new legal schedules which fit on the on-chip
memory, thereby directly influencing the latency and power due data movement, as well as the
expected benefits at the compute level. In terms of a vectorized PE, which might support a subset
of quantization levels, the search algorithm must be constrained to make decisions which are
supported by the hardware. The effects of each decision on the dataflow, mapping, and schedule
must be fed back as signals to the search algorithm, guiding it to choose strategies which truly
benefit the target hardware’s capabilities as well as maintain a high task-related accuracy.
Adversarial Robustness and Bit-Flip Resilience. Considering a safety-critical deployment
of DNNs, different threat models must be anticipated by ML and HW engineers. Adversarial
attacks can be seen as algorithmic threats which exploit internal paths of a DNN to produce
high-confidence, incorrect predictions. To improve the robustness of a DNN against adversarial
input perturbations, the ML engineer can introduce adversarial examples during the training,
allowing the DNN to learn such input-based threats. Differently, the HW-engineer must consider
the consequences of hardware-based errors, which may occur due to low-voltage operation, aging,
or exposure to radiation. Protection against such hardware errors or bit-flips can be achieved by
introducing redundant hardware modules and/or redundant computations, both of which are very
costly in terms of latency, power, and potentially area. In essence, both sides are attempting to
achieve the same goal, which is the correct operation of the DNN in the presence of errors or
perturbations. However, by improving the robustness of the DNN against adversarial examples,
its behavior under hardware-based bit-flip errors is affected. The HW-engineer’s target is not
necessarily to eliminate all errors, but to provide sufficient redundancy such that a reasonable
amount of computation errors can be overcome by the DNN’s internal algorithmic redundancy.
If the ML-engineer considers the effect of bit-flips during the adversarial training scheme, a
30
3.3 Problem Statement and Paths to Effective Co-Design
solution which reduces the effort in hardware redundancy might be achieved. In contrast, a
DNN which is adversarially trained without any consideration to hardware errors might be too
sensitive to internal computation bit-flips, making the targets of the HW-engineer significantly
more challenging to achieve.
The three examples represent cases where working on two domains separately can lead to
sub-optimal compromises, but working jointly on solutions makes the overall deployment meet
its targets while reducing the individual efforts of HW and ML engineers. Figure 3.1 shows more
dependencies that exist between hardware and software DNN optimization techniques.
In this dissertation, three core concepts inspired by the field of very-large-scale integration
(VLSI) design are presented and adapted to different HW-DNN co-design problems:
The three concepts are essential to the works presented in the next chapters. A summary of all
works under the scope of this thesis and their use of the three VLSI-inspired concepts is shown in
figure 3.2.
31
3 HW-SW Co-Design of Deep Neural Networks
[8, 9, 10, 11, [9, 10, 11, 12, 13, 14, 15, 17, [8, 9, 10, 11,
16, 24, 25] 18, 19, 20, 21, 22, 23, 24, 25] 12, 15, 24, 25]
Figure 3.2: Works published under the scope of this thesis categorized with respect to concepts used from
the VLSI design domain.
32
3.3 Problem Statement and Paths to Effective Co-Design
are available. In chapter 6, HW-FlowQ [10] and AnaCoNGA [9] are discussed as examples of
fully-automated design frameworks in the scope of this work.
33
3 HW-SW Co-Design of Deep Neural Networks
best solution is more important than measuring the performance of a single solution. Particularly
for automated design agents and metaheuristic search techniques, a model with low-fidelity might
heavily misguide the direction of search space traversal, ending up in optimal solutions with
respect to the model, but sub-optimal implementations on real hardware. The speed of the model
execution is also important. Design space exploration loops for DNNs already suffer from long
GPU-hours for training and accuracy evaluation of neural network configurations. Adding further
delays to consider the hardware design exacerbates this issue and would severely extend the
search time. However, a fast executable model might be used to eliminate hardware-inefficient
DNN configurations early, before GPU training and evaluation, thereby shortening the overall
search time and injecting hardware-awareness to the design exploration loop [9].
34
3.3 Problem Statement and Paths to Effective Co-Design
System
Behavioural Arch. Structural Behavioural Synthesis Structural
View RTL View View Analysis / View
Logic Implementation
Optimization
Device
Generation
Extraction
Figure 3.3: The Gajski-Kuhn diagram (left) and the possible transitions to traverse the views and abstrac-
tion levels (right).
buffer size may be introduced after refinement, helping the designer understand the loop-tiling
characteristics of the DNN workloads with respect to said constraints. Finally, a more refined
abstraction level can further consider the PEs’ arithmetic capabilities, register sizes, and communi-
cation patterns, and implement the entire execution schedule, returning more accurate estimates of
latency and energy consumption. The next chapters showcase six examples where these concepts
are used in handcrafted, semi-automated, and fully-automated design methodologies.
35
4 Handcrafted Co-Design
D
EEP
solutions for complex co-design problems. This is particularly useful in cases where
a design challenge is conceptual and hard to concretely define mathematically or
formulate into a search space for algorithms or metaheuristics to solve. The designers’ conceptual
understanding of the challenge is itself the problem formulation, which can be solved by applying
their knowledge, expertise, and creativity. In this chapter, two examples of handcrafted co-
design are presented. In OrthrusPE [8], a complex form of BNNs is considered, which replaces
expensive multiplication and addition operations with hardware-friendly XNOR and popcount
operations. However, to maintain high task-accuracy, the BNNs still require some fixed-point
arithmetic operations. This motivates the conception of a PE which supports both fixed-point
and binary operations, with minimal hardware overhead. In Mind the Scaling Factors [16],
hardware and software threat models are investigated in the form of on-chip bit-flips and input-
based adversarial attacks. By understanding the theoretical worst-case effect of a bit-flip in the
numerical representation on hardware, the neural network training hyperparameters are tuned
to improve adversarial robustness and bit-flip error resilience. In both works, understanding
the conceptual design challenge or the theoretical aspects of the execution led to the human-
engineered, handcrafted co-design solutions.
37
4 Handcrafted Co-Design
• Developing a flexible computation unit to accelerate a wide range of BNNs (e.g. table 4.1)
and executing SIMD-based binary Hadamard product operations on FPGA hard blocks.
2
All 7-series, Ultrascale and Ultrascale+ FPGAs, Zynq SoCs (DSP48E1 and DSP48E2), and the recent Versal
platform (DSP58). The entry level Spartan-6 presents the only exception (DSP48A1).
38
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
• Reusing FPGA hard blocks to create novel, runtime reconfigurable processing elements,
which dynamically support binary and fixed-point computations.
• Formalizing the relationship between computation mode switching and partial result
memory for BNN layers with multiple binary bases.
39
4 Handcrafted Co-Design
0 0 1
𝑠𝑖𝑔𝑛 𝑥 − 4 𝑏𝑖𝑛𝐶𝑜𝑛𝑣
1 0 0
× 𝛼3
0 1 0
Figure 4.1: Binary bases can differentiate the values of the full-precision kernel more accurately by
preserving more information through linear transformations.
inference. Throughout this section, this affine transformation is considered and the binary
numerical space is denoted with B = {0,1}.
Accurate BNNs use multiple weight and activation bases to approximate a full-precision layer,
which reduces the gap in prediction accuracy between the two implementations. Different to
simpler BNNs where values are binarized using the sign() function (recall equation 2.7), Lin
et al. [62] presented a solution where each base is produced by scaling and shifting the original
values to different degrees before binarization. This produces multiple unique binary bases
which preserve more information collectively due to the additional operations performed before
obtaining them. Figure 4.1 shows a simple example of a 3×3 kernel binarized into 3 bases.
Within one base, it is possible to differentiate whether a value is positive after a shift or not.
Shifting 0.1 and 10 by -5, leads to 0 and 1 binarization respectively, indicating 0.1 is smaller than
5 and 10 is larger. Combining two bases captures information on how certain values fall between
other values, for example, 3 and 4 must lie between 0.1 and 10, since the 0.1 and 10 remained 0
and 1 respectively after shifting by -5 and -3, while 3 and 4 were binarized to 0 after shifting by
-5 and turned to 1 in the binary base where they were shifted by -3. Finally, combining all three
bases, we can extract that the difference between kernel elements 4 and 3 is 1 to 2, as both were
binarized to 0 when the shift was -5 or both to 1 when the shift was -3, but they had different
binarization when the shift was by -4.
To discuss the complexity of a convolution operation with binary bases, we recall the notation
presented in section 2.1. Al−1 ∈ RXi ×Yi ×Ci is an activation tensor of a convolutional layer
l ∈ [1, L] in an L-layer CNN, where Xi and Yi indicate the input spatial dimensions, and Ci is
the number of input channels. The weights Wl ∈ RKx ×Ky ×Ci ×Co are the trainable weights of
the layer. The sign, scale, and shift functions are used to find an appropriate binarization for
Al−1 and approximate it into Hl−1 ∈ BXi ×Yi ×Ci ×N , having N binary bases. Similarly, Wl is
approximated as Bl ∈ BK×K×Ci ×Co ×M , where M is the number of weight bases. Equation 4.1
represents the multi-base binary convolution.
40
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
Table 4.1: Requirements of most common binary neural networks and the respective hardware operations
for execution.
Binary Weights Weight Activation Batch- Multiple Multiple
Method
\Activations Scale αm Scale βn Norm Weight Bases M Activation Bases N
BNN [60] 3 7 7 3 7 7
XNOR-Net [34] 3 3 3 3 7 7
CompactBNN [121] 3 7 3 3 7 3
ABC-Net [62] 3 3 3 3 3 3
Hardware Operation XNOR-Popcount Multiplication Multiplication Multiplication-Shift MAC MAC
Consider a binary weight tensor slice bl ⊂ Blm , where bl ∈ BKx ×Ky . The activation slice
hl−1 ⊂ Hnl−1 , where hl−1 ∈ BXi ×Yi is defined accordingly. Equation 4.3 shows the XNOR
operation performed on bl and hl−1 , which is the core binary operation. The XNOR operations
are grouped as binary Hadamard products of the sliding kernel windows. Next, the partial sum
pm,n,ci is the accumulation of the intermediate XNOR operations over the kernel dimensions
Kx × Ky .
Finally, to compute a single output pixel am,n relative to bases m and n, the popcount values
pm,n,ci need to be accumulated across the input channels Ci , as shown in equation 4.4.
Ci
X
am,n = (pm,n,ci ) (4.4)
ci =1
4.1.4 OrthrusPE
OrthrusPE is a processing element which operates in two modes, a binary and a fixed-precision
mode. In the binary mode, OrthrusPE executes SIMD binary Hadamard products and the
41
4 Handcrafted Co-Design
Sig A
Input: hl-1 ⊂ Hnl-1
1 1 1 1 1 1 0 X 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0
1 0 1 0 1 0 0 X 1 0 1 0 1 0 1 0 1 0
X X X X X X X X Sig B
1 1 1 1 0 X Sig P
0 0 0 0 0 X DSP
1 0 1 0 0 X
Sig C
Kernel: bl ⊂ Bml
1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 X
1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 X
1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 X
Figure 4.2: Preconditioning signals A, B and C to compute five 3× 3 Hadamard products. Pixels repre-
sented with an X are not relevant for this cycle of operation.
42
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
Fixed-point mode
Binary mode
Reconfiguration signals
Figure 4.3: The DSP48E1 Slice [1]. Appended bold paths illustrate the relevant signals for our operating
modes.
each Hadamard product requires 9 bits out of the 48 bits in signal C and signal A:B, leaving 3
unused bits after the calculation of the 5 results. However, this still presents a high utilization of
94%. This is due to the fact that we can fully fit b Kx48×Ky c Hadamard products in a single DSP
cycle’s output, where Kx × Ky is the number of bits per Hadamard product. The solution to
always retain a utilization of 100% is to allow partial operations to take place in each cycle, while
small additional logic rearranges the successive results before being processed by the popcount
logic. In this manner, any arbitrary window size can also be implemented with 100% utilization
of the DSP’s processing bitwidth. Intuitively, for FC layers, the utilization is always 100%.
Figure 4.4 shows the DSP utilization for two possible configurations of OrthrusPE.
Preconditioning the signals, concatenating them, and performing the wide XNOR operation
does not infer a DSP slice in the synthesis tool, but rather generates a regular LUT solution.
Therefore, the DSP slice was instantiated manually and the signals were explicitly passed to the
module.
43
4 Handcrafted Co-Design
100
% SIMD Utilization
80
common
60 operating
points
40
Figure 4.4: SIMD register utilization of the DSP48 in OrthrusPE, with and without partial operations.
layers, such as the input layer and batch normalization layers, require fixed-point arithmetic. The
additional scaling operations in accurate BNNs also introduce fixed-point multiplications, which
are most efficiently executed on DSPs. This is the motivation for reconfiguring the DSP back to
its regular operation mode, by resetting the ALUMODE and OPMODE signals at runtime. With
this solution, we exploit the same hardware resource for two distinct modes of operation.
The work in Double MAC [118] can be appended to OrthrusPE, giving it a further mode to
operate in. This is particularly useful for accumulating popcounts, which have a smaller bitwidth
compared to the fixed-point values used for the non-binarized inputs of the network and the batch
normalization layers.
Figure 4.5 shows a schematic of OrthrusPE’s internal components. The Bin mode register holds
a flag indicating the mode of operation. The value of Bin mode influences the DSP Reconfig
signals ALUMODE and OPMODE, which are fed into the DSP to reprogram it as shown
in figure 4.3. Bin mode also functions as a selector for 3 multiplexers (A MUX, B MUX
and C MUX), allowing the input feature map pixels, weight pixels and partial sums to be
passed directly to the DSP (fixed-point mode) or taken after preconditioning them for the binary
Hadamard product operation as shown in figure 4.2 (binary mode). The result produced from the
DSP is passed to another multiplexer, where it can be written directly to the partial sum register
or postprocessed with a popcount operation, then written to the register.
While in binary Hadamard SIMD mode, OrthrusPE can generate five 3× 3 Hadamard products
per cycle. These products are passed to popcount logic, generating five integer values. Referring
back to section 4.1.3, each of the M binary weight bases needs to be convolved with each of the
N binary activation bases. A single input activation channel and a single weight filter channel
produce M × N partial sum maps. Those M × N maps need to be scaled by αm and βn , then
44
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
Signal
BIN_mode Ifmap REG
Pre-conditioning
Sel_A
A_ MUX Weights REG
B_ MUX
Sel_B
ALUMODE Sig_A Sig_B
DSP Reconfig Sel_C Psum REG
DSP48E1 Sig_C
DSP48E2 C_ MUX
DSP58
OPMODE
Sig P
Sel_P
P_DEMUX
Figure 4.5: Block diagram showing the main components of the OrthrusPE
collapsed into a single partial sum map. The M × N maps represent parasitic partial sums which
will require a considerable amount of memory if not accumulated for an extended time during
execution. To minimize this, OrthrusPE can perform P × M × N Hadamard products, where
P is a set of pm,n,ci pixels (recall equation 4.4), then switch to its MAC mode for scaling and
accumulation, before moving on to another spatial region of the map. Decreasing P reduces the
required memory for parasitic partial sums as shown in equation 4.5. In the last term, the kernel
dimensions dictate the popcount’s bitwidth along with an added sign-bit.
The trade-off is that accumulating the P × M × N popcounts requires switching the mode
of OrthrusPE more often. The number of mode switches per input channel map is expressed in
equation 4.6. Xo and Yo are the dimensions of a single output channel. Since it is possible to
switch the ALUMODE after 1 cycle of operation for non-pipelined DSPs, this trade-off does not
represent a large overhead and can be exploited to reduce partial result memory requirements in
an accelerator.
Xo · Yo
SwitchCount = 2 × −1 (4.6)
P
P can be chosen with some analysis using equation 4.5 and equation 4.6. Figure 4.6 shows
the effect of P on partial result memory and switch count for a single input channel of binary
ResNet18 [28] in each of its convolutional layers, with M = N = 3. The layers are grouped
based on their output spatial dimensions.
Layer 1 is not considered as it is not binarized. The actual scratchpad size depends on other
factors such as dataflow, loop unrolling and loop interleaving. The analysis shown is only with
respect to the minimum requirement necessary for a basic dataflow which maintains the partial
45
4 Handcrafted Co-Design
results within a PE until a single input channel is completely processed against a single filter
kernel. Layers 2 − 5 should dictate the memory requirements as they generate the highest volume
of parasitic partial results. This is due to their output dimensions requiring 56 × 56 pixels per
channel.
L2−5 : 56 × 56
L6−9 : 28 × 28
L10−13 : 14 × 14
40
L14−17 : 7×7
Switch Count
30
20
10
0 2 4 6 8 10
Partial Result Memory (KB)
Figure 4.6: Switch count and partial result memory analysis for a single input channel from different
convolutional layers of binary ResNet18, with M = 3, N = 3. Each point represents a
different configuration of P .
4.1.5 Evaluation
4.1.5.1 Experimental Setup
The proposed implementations in section 4.1.4 (OrthrusPE and OrthrusPE-DS) are compared to
two implementations with equivalent functionality. Typically, BNN processing elements employ
two or more distinct types of resources for the operations described in table 4.1. On FPGAs,
the straightforward approach is to map all binary operations to LUTs and execute the supported
fixed-point operations on DSPs. This translates to a single PE execution spanning two different
types of hardware resources. We refer to this implementation as the “Hybrid” implementation.
For completeness, we compare a fourth implementation that restricts execution of the operations
to the FPGA’s LUT resources.
All four implementations were synthesized and implemented using the Xilinx Vivado 2018.1
synthesis tool targeting the Zynq UltraScale+ MPSoC ZCU102. Correct functionality of Or-
thrusPE was confirmed by the Xilinx Vivado Simulator. Power estimates are obtained using the
Xilinx Power Estimator and the Vivado Power Analysis tool, built into the Vivado Design Suite.
In order to fairly compare the four implementations, we fix the throughput to 1 MAC per
cycle or 48 XNORs per cycle, i.e. a single OrthrusPE’s throughput. Higher performance
of all implementations is possible by replicating the structures. The processing elements were
synthesized across multiple target frequencies to show compatibility with any potential accelerator
46
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
which might utilize them. A BNN accelerator would typically require hundreds of PEs, therefore,
the presented results scale gracefully based on the underlying accelerator.
Implementation of the individual PEs yielded the utilization results presented in table 4.2.
600
LUT Utilization
200
Figure 4.7: Synthesis results for LUT utilization across different design target frequencies. Each plot
point represents a different synthesis run.
The results show that OrthrusPE can operate at the maximum frequency of the DSP48 block, i.e.
the added functionality comes without any cost of latency. Practically BNN accelerators operate
at lower frequencies, therefore OrthrusPE can be implemented on any BNN FPGA accelerator.
The LUT utilization differences within each implementation are minimal when synthesizing at
frequencies above 500 MHz, as shown in figure 4.7. However, from 500 MHz down to 400 MHz,
three of the implementations enjoy some relaxation in the parallelism required to meet the timing
constraints. Another such relaxation occurs when the target frequency is lowered from 400 MHz
47
4 Handcrafted Co-Design
20
10
Figure 4.8: Dynamic power estimation at different design target frequencies. Each plot point represents a
different synthesis run.
down to 333 MHz. Overall, our OrthrusPE and OrthrusPE-DS implementations, both executing
SIMD binary Hadamard product on DSPs, result in the lowest LUT utilization cost.
A further plot point is added showing the resource utilization quoted in FINN [103] for
popcount-accumulation of 128-bits at a target frequency of 200 MHz. The closest matching
OrthrusPE implementation, in terms of bitwidth, provides 16 more bit accumulations and 3
parallel MAC operations (through runtime reconfigurability), while requiring 32% fewer LUTs.
Figure 4.8 shows the power estimates for the implementations at different design target frequen-
cies. The results demonstrate that using a single OrthrusPE for MAC operations and binary
Hadamard products presents the most efficient solution among the evaluated implementations.
The OrthrusPE-DS solution also offers the second best power efficiency among the 4 configura-
tions. In practice, OrthrusPE-DS can execute both types of operations concurrently which makes
it well-suited for a pipelined accelerator.
The results show that DSPs present a good choice for accelerating binary operations in
OrthrusPE and OrthrusPE-DS. Default implementations relying purely on LUTs or hybrids of
LUTs and DSPs were less efficient in all of our experiments. Exploiting DSPs as in OrthrusPE
improves the utilization of hard blocks already employed by accurate BNN accelerators. This
does not prevent the design from employing further LUTs for further binary operations, yet it
allows hard blocks to contribute to more types of computations.
48
4.1 OrthrusPE: Runtime Reconfigurable Processing Elements for Binary Neural Networks
4.1.6 Discussion
The development of OrthrusPE was aimed at achieving an efficient execution of accurate BNNs
at the compute level. Analyzing the computational complexity of accurate BNNs allowed the
hardware designer to use their conceptual understanding of the capabilities of the DSP block on
FPGA to find a creative solution for the problem of supporting two types of numerical operations
in the binary and fixed-point domains. To implement the solution, a handcrafted reprogramming of
the DSP block was required to enable the desired functionality. This led to the HDL description of
OrthrusPE, which wraps around the DSP block and allows the user to switch between the functions,
and access the scratchpads accordingly. The runtime reconfigurable PE satisfies all the functions
required by accurate BNNs, while capitalizing on resource reuse. Accurate BNNs cannot be
achieved without fixed-point operations and reliance on DSP blocks. Instead of separating
binary and fixed-point computations to two types of hardware resources, OrthrusPE improves
the efficiency of the computation by executing both on FPGA hard blocks. Two configurations
were evaluated, OrthrusPE and OrthrusPE-DS, across multiple target accelerator frequencies.
Both solutions achieved improved resource utilization and power efficiency compared to typical
BNN accelerator processing elements. Accurate BNNs solve many of the computation and
memory challenges for deep neural network workloads on edge devices. Efficiently executing
their mixed-precision computations can further exploit the advantages they offer at the hardware
level.
49
4 Handcrafted Co-Design
50
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
effect for bit-flips [126], while others using targeted bit-flip attacks (BFAs) construct network-
specific attacks which are extremely unlikely to happen at random [127, 128, 123]. This work
holistically investigates hardware fault resilience and adversarial robustness with large-scale
resilience analysis on differently trained CNNs and identifies clear relationships between training-
time CNN statistics and their deployment-time effect on scaling factors and clipping limits. The
results of this work show that the common denominator for all resilient CNNs is small inter-layer
data distributions, which result in smaller scaling factors at deployment. This allows small scaling
factors to naturally introduce resilience by attenuating the largest possible perturbation.
The contributions of this work can be summarized as follows:
• Across ∼10M bit-flip experiments, regularly trained, adversarially trained, batch-norm free,
weight decayed and pruned CNNs are considered. The hardware bit-flip module allows for
testing a wide range of bit-flip patterns to analyze the effect of training/compression on
hardware fault resilience.
• Weaknesses in adversarially trained CNNs are identified, which open a backdoor for
injecting faults of large magnitude. A simple weight decay remedy is proposed to shrink
the quantization scaling factors, which improves resilience against faults in activation
pixels by 25% on FastAT ResNet56, while preserving natural accuracy and adversarial
robustness.
51
4 Handcrafted Co-Design
transient errors may happen in any part of the logic, including input pixels, partial sums, or output
activations [125].
Hoang et al. [126] proposed to improve error resilience of CNNs by clipping activations. The
investigations were limited to memory-based bit-flips on weights and only aged CNN archi-
tectures were tested, which have no batch normalization after each convolutional layer. Errors
in such CNNs are typically exaggerated compared to modern CNNs, as batch normalization
naturally reduces the activation distribution and scaling factors (as shown in figure 4.10). In an
adversarial attack scenario, input noise is propagated and amplified through the layers causing
a misclassification. Liao et al. [129] proposed to mitigate the amplification error by using a
denoiser to reduce input perturbations. Lin et al. [124] applied Lipschitz regularization to limit
the error amplification in quantized CNNs. Both works focused on mitigating attacks injected
at the input, but did not consider inter and intra-layer faults (depicted in figure 4.11). Zahid
et al. [122] introduced a fault-injection layer at training time. The work focused on a class of
permanent errors and did not consider adversarial attacks. A defense method against targeted
DRAM bit-flip attacks was proposed by Li et al. [128], where weights were preprocessed to limit
their change of value. The method was limited to weight-based, memory-only, targeted BFA and
did not consider input-based adversarial attacks.
4.2.3 Methodology
4.2.3.1 Problem Formulation: Quantization and Bit-Flips
Recalling equation 2.3, a single weight value w multiplied by an activation pixel a produces a
partial result in the convolution operation of the weight tensor Wl and input activation tensor
Al−1 in layer l of an L-layer CNN. At training time, Al−1 and Wl ∀l ∈ L are represented by high-
precision FP32 values to maintain smooth training and fine adjustments through backpropagation.
During inference, the values are quantized to reduce their memory footprint and arithmetic
computation complexity on embedded HW. The 8-bit signed integer (INT8) representation is
one of the most common numerical representation formats for lean deployment on resource
constrained devices. Equation (2.6) showed the basic principle of linear quantization of xf (either
w or a) into a more constrained numerical representation xq .
The scaling factor v projects the quantized range of INT8 [−128, 127] onto the real range of
values which xf ∈ Xf can take with respect to the clip operator. Note that Xf is either Wl or
Al−1 . The round operation pushes the smooth values of Xf into the limited 256 integer values
of INT8. As mentioned previously, the clip operator cuts-off values of the Xf range beyond
[−c, c], maintaining symmetric linear quantization, even in cases where layers such as ReLU
leave only positive activations, and weights use only a small portion of the negative number
scale. By observing the statistics of weight and activation distributions of a layer, the calibration
process sets c and v, such that the range of values that appear in a certain layer can be covered by
the INT8 static range [58]. Therefore, c and v are directly influenced by the dataset, the weight
values of the CNN (e.g., learned through vanilla or adversarial training, regularized or not) and
52
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
Convolutional Layer
Cumulative
HW-error
Round Clipped
range of
Error scaled by
next layer
Quantization 𝑣 of next layer
before next
layer
-128 -128
Clip (−𝑐) Scale (𝑣) Scale (𝑣) Clip (−𝑐)
Bit-Flips
𝑐 and 𝑣 decided layer-wise based on dataset statistics
Figure 4.9: Batch-norm limits activation range at training time, effectively lowering v and c of the
subsequent convolutional layer at deployment time (on hardware). Errors in the convolutional
layer can at most grow in magnitude to the defined clip c of the next layer.
its structure (e.g., existence of batch-norm layers). The described quantization of Xf to INT8 is
visualized in figure 4.9.
A runtime reconfigurable bit-flip module is implemented to change the value of any position
in the 8-bit representation, for weights and activations, and for any subset of multipliers in a
standard spatial DNN accelerator [96]. Flipping the n-th bit of an operand at the input of any
affected multiplier translates to a 2n absolute change in magnitude within the static INT8 range
[−128, 127]. However, it is more important to analyze the precise severity of a 2n flip with
respect to the values of the projected real range of Xf , i.e. after applying scaling factors v.
With this conceptual understanding of quantization and bit-flips, some general insights can be
made:
• Quantization naturally improves bit-flip resilience. Quantization clips the largest possi-
ble perturbation when projecting a larger, dynamic representation, such as FP32 into a more
constrained range of INT8. As the clip limits c and scaling factors v are decided based
on statistics before deployment on hardware, a single or multiple bit-flips on hardware
cannot perturb the network beyond c of the next layer (figure 4.9). This is an inherent
improvement in bit-flip resilience over float/dynamic numerical representations.
53
4 Handcrafted Co-Design
0.8
Scaling Factor (v) With Batch-Norm Errors can grow 211×
0.6
Without Batch-Norm in aged CNNs larger
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Layer
Figure 4.10: Layer-wise scaling factors v of ResNet20 CNNs trained on CIFAR-10, with and without
batch-norm. Works investigating bit-flips on aged CNNs (without batch-norm after every
layer) cannot be extended to modern CNNs.
Batch Norm
Batch Norm
Batch Norm
CONV CONV CONV CONV FC
Figure 4.11: Adversarial attacks apply input perturbations to cause incorrect classifications. Training
for such attacks implies training for pixel value distributions outside of the natural dataset.
Differently, hardware faults can occur at any point within the CNN, and are not limited to
the input of the network.
aged CNNs will have large scaling factors v to accommodate the activations that appear in
the convolutional layers, resulting in a much larger true magnitude error for any bit-flip.
The scaling factors v of ResNet20 with and without batch-norm are shown in figure 4.10 to
visualize this problem. Errors in aged CNNs can also propagate and get amplified, as the
scaling factors grow in deeper layers.
54
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
Bit-flips at the compute level fall under logic transient errors [125], and capture a broader range
of error patterns compared to memory-based faults. A memory-based fault on a weight parameter
w implies all computations using w are affected. With logic transient errors, memory-based faults
can be replicated, as well as every other case where a subset of w’s computations are affected.
This provides finer granularity in error injection control for the large-scale bit-flip benchmarks
planned in this work. The large-scale bit-flip benchmarks are made possible by exploiting
the flexibility of a run-time reconfigurable bit-flip injection hardware module implemented on
the accelerator as part of this work. Large-scale statistical fault injection is an established
approach to analyzing errors in logic [125]. However, it is often infeasible due to slow RTL
simulations. RTL simulations are circumvented in this work by directly implementing the bit-flip
module on NVDLA [96] and injecting the desired bit-flip patterns on the running hardware. The
benchmarks are developed with well-defined bit-flip patterns, to better understand the effect of
bit-flip characteristics such as position in numerical representation, frequency of occurrence,
affected datatype, and affected percentage of multipliers.
The benchmark is defined in steps, where each successive step changes one aspect of the bit-flip
pattern. The bit-flip pattern is maintained and the accelerator performs inference of an entire
test set of input images. Once the test set is exhausted, the next step begins with a new bit-flip
pattern and the test set is passed once more. The benchmark steps are shown as a nested-loop in
algorithm 4.1.
First, the frequency f of bit-flip occurrence is set. The frequency indicates the rate of bit-flip
injection per computation, i.e., if f is set to 0.1, a bit-flip is introduced at every 10-th computation
of the affected hardware component. Next, the affected datatype t is set, as in activations A or
weights W. Third, the loop goes over the bit-flip position b, indicating the severity in magnitude
change for the value of the input operand of the affected computation. Finally, the inner-most
loop chooses the number of affected multipliers m, as a percentage of the accelerator’s total
MAC units. Figure 4.12 visualizes these bit-flip characteristic parameters. At the core of the
nested-loop in algorithm 4.1, the characteristics are programmed into the bit-flip module, then
the accelerator is allowed to perform inference over the entire test set. Here, system failures are
defined as those cases when the prediction with hardware errors disagrees with that of the same
CNN without any bit-flips. Therefore, failures are not counted based on the accuracy of the model
or the true label of the input image. This definition aligns with existing work [125], and is fair
when comparing different networks, as their underlying baseline accuracy is orthogonal to their
resilience against hardware errors.
4.2.4 Evaluation
The experiments are performed on the CIFAR-10 dataset, using 50K images for training and 10K
test images for evaluation. The test set also serves as the hardware fault test set in algorithm 4.1.
ResNet20 and ResNet56 represent shallow and deep baseline models for the CIFAR-10 dataset. If
not otherwise mentioned, all hyper-parameters specifying the task-related training were adopted
from ResNet’s base implementation [28]. Pruned variants are obtained by re-implementing the
reinforcement-learning-based pruning agent proposed in AMC [31]. The fault resilience of pruned
55
4 Handcrafted Co-Design
𝑡 : Datatype 𝑤 0 0 1 0 0 0 1 1
(weight, activation) MAC MAC MAC
𝑎 1 0 1 0 0 1 0 1
MAC MAC
𝑓 : Injection rate
CNNs is investigated with 50%-60% fewer operations remaining compared to their unpruned
variants. For defensive training against adversarial attacks, we use the popular FastAT [93]
approach and the training hyper-parameters described in the paper. To evaluate adversarial
robustness, we apply a strong unseen PGD [92] adversarial attack on all considered CNNs with
20 iterations and a perturbation budget bug =2. The entropy-based calibrator of TensorRT is used
to find the optimal v and c for each layer of the full-precision CNNs, before INT8 execution.
This persistently gave better accuracy over the naive min-max calibrator. As an added benefit
for the bit-flip experiments, the entropy-based calibrator provides smaller clip ranges than the
naive min-max method, which benefits the fault resilience of all considered CNNs. All CNNs are
calibrated on the same dataset, i.e., the same images are passed to compute v, c of each layer of
each CNN, before deployment on hardware.
We synthesize a 64 MAC unit variant of the NVDLA accelerator on the Xilinx ZCU102 board.
The bit-flip module is written in Verilog and wraps around the MAC units without adding any
delays to any critical paths of the accelerator design. The sets in algorithm 4.1 are F = {0.1,
0.02, 0.01, 0.005, 0.002}, T = {A, W }, B = {5, 6, 7}, and M = {25%, 50%, 100%}, where
the benchmark loops over the elements in the order they are presented here. Bit-position b =
7 ∈ B indicates a flip in the sign-bit of INT8. The sets F, T , B, and M were chosen after an
ablation study on the considered networks. The ranges for each bit-flip characteristic adequately
represent weak-to-strong influence on CNN fault rate for the purpose of our analysis.
56
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
• Activation sensitivity. Flipping bits of input activations A is more likely to cause failures
compared to flipping weight bits in W at any bit-flip position, on any number of multipliers
and any frequency of bit-flip injection. Many memory-based and targeted bit-flip works
only flip the weights of the CNN, without investigating input activations [126, 128], which
are persistently more vulnerable in all our tested CNNs, and all bit-flip patterns of the
benchmark.
• Sign-bit sensitivity. An expected and common observation is the high impact of the
sign-bit in deciding the probability of failure. However, it is interesting to note the degree
of its importance; in almost all cases, flipping the sign-bit in 25% of the multipliers is more
potent than flipping the 6-th bit on 100% of the multipliers, at any given frequency, for
both weights W and activations A. Flips on the 5-th bit (or lower, based on observations
not shown for brevity) are almost negligible at low injection rates, even on 100% of the
MAC units.
• Adversarially robust CNNs are vulnerable to hardware errors. There is a clear degra-
dation in fault resilience for adversarially robust CNNs, particularly for activation-based
bit-flips. We address this observation more closely in the next section. Pruned CNNs
exhibit resilience properties close to their unpruned counterparts. This is justified as their
scaling factors v are similar to the original (vanilla) unpruned network. However, spikes of
high failure rates (marked in figure 4.13) occur when m = 100%, indicating that injecting
many perturbations in a CNN with fewer computations (due to pruning), leads to slightly
weaker fault resilience.
• Deep CNNs with batch normalization are resilient. Deeper CNNs (56-layers) have
improved fault resilience over their shallow (20-layers) counterparts for vanilla, pruned,
and adversarially robust variants. The errors introduced in the early layers of the network
do not grow with the depth of CNN. This can be credited in part to the batch normalization
layers which take place after every convolutional layer, regulating the maximum possible
perturbation that can pass to the next layer, (1) due to calibration-time statistics (helps in
lowering scaling factors v of the layers) and (2) run-time normalization. He et al. [123]
show benefits of batch normalization against targeted (search-based) bit-flip attacks. We
further show the benefits of batch normalization more generally against any hardware-based
faults (non-targeted). The two ResNet20 variants presented in figure 4.10 (Vanilla and No
Batch-Norm) are evaluated in table 4.3. The overall mean failure rate is doubled in the
variant without normalization, due to its high scaling factors which amplify errors in the
CNN.
The results in figure 4.13 can shed light on parsimonious hardware-error resilience options.
For example, the designer may apply a redundancy method on the computations against the
sign-bit or allocate resilient memory holding activation bits (e.g. 8T-SRAM). More conservatively,
57
f =0.1 f =0.02 f =0.01 f =0.005 f =0.002 f =0.1 f =0.02 f =0.01 f =0.005 f =0.002
100 80 60 40 20 0
100 80 60 40 20 0
0 0.2 0.4 0.6 0.8 1
T T
Failure Rate
Failure Rate
Accuracy
Accuracy
B Alg.4.1 B
loop
ranges
M M
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
: Alg.4.1 loop ranges Benchmark Step : Alg.4.1 loop ranges Benchmark Step
t=A t=W Accuracy t=A t=W Accuracy
(a) ResNet20 Benchmark (b) ResNet56 Benchmark
f =0.1 f =0.02 f =0.01 f =0.005 f =0.002 f =0.1 f =0.02 f =0.01 f =0.005 f =0.002
100 80 60 40 20 0
100 80 60 40 20 0
0 0.2 0.4 0.6 0.8 1
Failure Rate
Accuracy
Accuracy
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
Benchmark Step Benchmark Step
: Spikes when M =100% : Spikes when M =100%
(c) ResNet20 AMC-Pruned Benchmark (d) ResNet56 AMC-Pruned Benchmark
f =0.1 f =0.02 f =0.01 f =0.005 f =0.002 f =0.1 f =0.02 f =0.01 f =0.005 f =0.002
100 80 60 40 20 0
100 80 60 40 20 0
0 0.2 0.4 0.6 0.8 1
Failure Rate
Accuracy
Accuracy
4 Handcrafted Co-Design
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
Benchmark Step Benchmark Step
(e) ResNet20 FastAT Benchmark (f) ResNet56 FastAT Benchmark
Figure 4.13: Bit-flip experiments following algorithm 4.1 on vanilla, pruned and adversarially trained ResNet20 and ResNet56. Each bar represents the
failure rate of a particular bit-flip setting {f, t, b, m} tested over 10K test images. Each sub-figure comprises 900K bit-flip experiments.
58
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
the designer may apply that redundancy to only a subset of multipliers, e.g. 50% of the MAC
array, further saving resources and area-on-chip. Such design decisions can be made based
on large-scale resilience experiments, and would not be possible based on targeted bit-flip
attacks [127, 123, 128].
59
4 Handcrafted Co-Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(a) ResNet20 layer-wise convolutional layer scaling
Scaling Factor (v)
0.1 0 0.2
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
Layer
(b) ResNet56 layer-wise convolutional layer scaling
Figure 4.14: Convolutional layer scaling factors for vanilla trained and adversarially robust variants of
ResNet20 and ResNet56. High weight decay (αd = 0.05) brings the high scaling factors v of
FastAT back to vanilla levels.
is provided for each CNN over the entire benchmark in algorithm 4.1 (Overall). Additionally,
to help in understanding the effect of individual characteristics of the bit-flip patterns, one bit-
flip characteristic is fixed (f , b, m, or t) and the MFR over all steps varying the other bit-flip
parameters is measured.
The observations made in section 4.2.4.1 are supported by the MFR presented in table 4.3.
For FastAT CNNs, a 62% and 95% degradation in overall MFR can be observed for ResNet20
and ResNet56, respectively, compared to their vanilla-trained variants. When increasing αd ,
the FastAT CNNs improve by up to 16% in overall MFR. More specifically, the fault resilience
against activation bit-flips t = A is improved by 27% and 25% for the high weight decay FastAT
ResNet20 and ResNet56, compared to the regular FastAT implementation. Although baseline
accuracy is considered orthogonal to fault resilience analysis (explained in section 4.2.3.2), it is
interesting to discuss the trade-offs that can be achieved in fault resilience, adversarial robustness,
and natural accuracy. In general, adversarial training techniques in literature incur a degradation
in natural accuracy when trying to learn adversarial attacks as well as their target classification
task [93]. An observation can be made for the smaller FastAT ResNet20 suffering a further drop
of 3.8 p.p. in accuracy after applying high αd . However, the larger ResNet56 has a slightly
improved accuracy after weight decay compared to the regular FastAT implementation. Weight
decay can be harsh, particularly on smaller CNNs, as more weights approach zero and lose
their feature representation capability. ResNet56 has sufficient redundancy to compensate for
this (and even benefits through regularization); however, the smaller ResNet20 loses some of
its natural accuracy. Although weight decay is proposed as an initial, simple remedy for the
60
4.2 Mind the Scaling Factors: Resilience Analysis of Quantized Adversarially Robust CNNs
Table 4.3: Summary of results on shallow (ResNet20) and deep (ResNet56) CNNs as vanilla, pruned,
and adversarially trained variants. Percentage improvement shown for FastAT αd = 0.05 over
regular FastAT.
Baseline (INT8) PGD-20 Atk. Mean Failure Rate (MFR) - Lower is better
Model Train/Config
Acc. [%] Acc.[%] Overall f = 0.005 f = 0.1 b=5 b=7 m = 25% m = 100% t=W t=A
Vanilla 92.03 1.04 0.29 0.21 0.53 0.09 0.55 0.19 0.37 0.19 0.39
CIFAR-10
No BatchNorm 79.12 5.01 0.60 0.57 0.69 0.48 0.72 0.53 0.70 0.40 0.81
ResNet20
60% Pruned 89.59 1.21 0.27 0.17 0.56 0.11 0.47 0.15 0.41 0.19 0.35
FastAT [93] 81.58* 72.85 0.47 0.40 0.67 0.27 0.70 0.39 0.57 0.30 0.64
FastAT αd =0.05 77.72 70.36 0.40 (15%) 0.31 (23%) 0.65 (3%) 0.20 (26%) 0.62 (11%) 0.24 (38%) 0.55 (4%) 0.33 (-10%) 0.47 (27%)
Vanilla 92.94 4.53 0.22 0.13 0.49 0.06 0.46 0.13 0.32 0.17 0.28
CIFAR-10
ResNet56
50% Pruned 92.04 2.66 0.28 0.19 0.54 0.10 0.47 0.15 0.44 0.19 0.37
FastAT [93] 82.71* 72.72 0.43 0.35 0.66 0.21 0.69 0.31 0.54 0.30 0.56
FastAT αd =0.05 83.37 74.72 0.36 (16%) 0.25 (29%) 0.65 (2%) 0.17 (19%) 0.63 (9%) 0.25 (19%) 0.48 (11%) 0.31 (-3%) 0.42 (25%)
*: Accuracy degradation from vanilla-training is common in state-of-the-art adversarial training to achieve high adv. robustness (see accuracy after PGD attack)
adversarial training and fault resilience problem, the analysis provided in this work identifies
a larger challenge in bringing robustness of both domains (adversarial attacks and hardware
faults) in the same CNN. It is also important to note that adversarially trained CNNs, even with
the proposed high αd , are still less fault resilient than vanilla CNNs. The reason being that
weight decay indeed shrunk the convolutional layers’ scaling factors, but the batch normalization
trainable parameters (γbn , βbn ) are not directly affected by weight decay, leaving their scaling
factors large due to adversarial training.
4.2.5 Discussion
This work highlighted the importance of scaling factors for maintaining hardware-fault resilience
of efficient, quantized CNNs. The importance of scaling factors was verified by performing
large-scale bit-flip experiments on regularly trained, adversarially trained, batch-norm free, weight
decayed, pruned, deep and shallow CNNs. Extracting key insights from the results generated by
the large-scale experiments required human expert-knowledge in ML and hardware concepts,
such as adversarial training, quantization, calibration, and neural network data distributions.
Only by conceptually understanding both sides, the execution on hardware and the network’s
training properties, the results were made interpretable and used to develop an intermediate
solution based on this understanding. If an automated agent were provided access to all training
parameters (learning rate policy, scheduler and value, momentum, batch-size, epochs, loss
formulation, optimizer type, regularization type, weight decay, etc.) as well as the large-scale
resilience benchmark, it would take a prohibitively long time for it to find out how each training
hyper-parameter affects bit-flip resilience. Keeping in mind that each time the automated agent
reconfigures the training setup to test out a different training hyper-parameter configuration, an
entire, costly, GPU-based training run is required before deployment, followed by large-scale
bit-flip experiments, to collect the reward/result for that particular training configuration. On
the other hand, the human expert required only two experiments (ResNet20 with and without
batch normalization) to prove their hypothesis which connects scaling factors to bit-flip resilience.
After confirming the hypothesis, the same idea extended itself to adversarially-trained CNNs,
proving that their large scaling factors open a backdoor for bit-flips with large true magnitude
perturbations. The human designer then used their theoretical understanding of how each training
hyper-parameter affects data distributions in a CNN, which indirectly affects the scaling factors
61
4 Handcrafted Co-Design
at deployment time on a quantized accelerator. The relevant hyper-parameter, weight decay, was
tweaked to improve the resilience of an adversarially trained ResNet56 by 25% on activation
faults. This succinctly captures the process of handcrafted HW-CNN co-design.
62
5 Semi-Automated Co-Design
C parts of the solution is challenging, and formulating the whole problem into a feasi-
ble, traversable, and well-defined search space is not possible. In such cases, certain
computation models may be used to aid the human designer in optimizing some components.
Additionally, low-level, handcrafted components may be integrated in an automated manner into
a larger system, which is then optimized by an agent or a model of computation (MoC). In this
chapter, two examples are introduced, where neural network accelerators reuse the handcrafted
components from chapter 4 in a larger hardware design, which can be represented as a data flow
graph (DFG). When implementing the neural network as a computation graph on a dataflow
hardware architecture (recall figure 2.8), the architecture of the neural network defines the com-
plexity of the graph and the computation effort in each node (layer). To a large extent, the neural
network is itself the hardware design. This context forms the HW-CNN co-design problem for
this chapter. Based on the layer-wise computation effort, the designer must accordingly specify
the resources to be allocated in different parts of the graph. Here, the allocation not only has
to respect the resources available on the target embedded FPGA platform, but also consider
the throughput and efficiency of the computation pipeline resulting from the synthesized graph.
Dataflow architectures can be optimized in a semi-automated manner when compiling the graph
in HLS; the human designer must specify the resources for nodes in the graph, but the allocation
of FIFO communication buffers and the computation pipeline is automatically generated. This
form of co-design was used to fit highly efficient BNNs on a semi-autonomous prosthetic hand in
Binary-LoRAX [25], and enabled accurate, privacy-preserving, edge-based face-mask wear and
positioning detectors during the COVID-19 pandemic in BinaryCoP [24].
63
5 Semi-Automated Co-Design
• Training BNNs for the graspable object classification task, enabling the efficient deploy-
ment of neural networks on intelligent prostheses with a task-related accuracy of 99.82%
on a 25-class problem from the YCB object dataset [130], adding 12 classes compared to
existing work [131].
• Achieving low-latency classifications of 0.45 ms, consuming <1% of the optimal controller
delay [132] and achieving a 99.7% reduction in latency compared to existing work [131].
64
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Prosthetic Hands
Figure 5.1: KIT Prosthetic Hand (50th percentile female) with Zynq Z7010-based processing system
Works such as [34, 62, 142, 143] have focused on adding algorithmic or structural complexity to
BNNs to achieve classification performance close to full-precision CNNs on complex tasks [144].
65
5 Semi-Automated Co-Design
However, simpler tasks with lower scene complexity can be handled with more efficient BNNs [60,
24].
In the context of semi-autonomous prosthetic hands, the camera input at the instance before
the grasp operation takes place is expected to have one central object in the field-of-view. In
that regard, the task’s complexity resembles that of popular datasets, such as the German Traffic
Sign Recognition Benchmark (GTSRB) [145], Street View House Numbers (SVHN) [146]
or CIFAR-10 [27], all of which have the object of interest in the forefront of the scene, with
minimal random background complexity when compared to autonomous driving scenes such
as Cityscapes [147]. It is important to note that BNNs have shown high accuracy and good
generalization on the mentioned datasets [60, 103]. Considering the power, memory, accuracy,
and latency requirements of the target application, along with the limited battery-life and compute
capabilities of the small, edge compute device on the prosthesis, BNNs represent good candidates
for the graspable object classification problem.
5.1.3 Methodology
5.1.3.1 Training and Inference of Simple BNNs
For efficient approximation of weights and activations to single-bit precision, the BNN method by
Courbariaux et al.[60] is used. A brief recap of these simple BNNs is provided in this section. At
training time, the network parameters are represented by full-precision latent weights W allowing
for a smoother convergence of the model [59]. It is important to note that the input and output
layers in this implementation are not binarized, to avoid a drop in classification accuracy.
During the forward-pass for loss calculation or deployment, the weights w ∈ W are transformed
into the binary domain b ⊂ B ∈ BKx ×Ky ×Ci ×Co , where B = {−1, 1}. Note that this simple form
of binarization does not involve multiple binary bases (M, N ) as those discussed in section 4.1.3.
In the hardware implementation, the −1 is represented as 0 to perform multiplications as XNOR
logic operations. The weight and input feature maps are binarized by the sign() function (recall
equation 2.7).
66
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Prosthetic Hands
High-Performance
𝑌𝑖
Power-Saving
𝐀𝑙−1
Object
𝐶𝑖 Localization
Accumulator
Sig P
Threshold
𝐶𝑖 Sig C DSP Mem.
𝑋𝑜
Mem.
1 0 1 0 0 1 1 0 1 1 0 1 1 1 1 1 DSP
Write
1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 back XNOR
𝑌𝑜
𝑙 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 to PEs
𝐀
DSP Popcnt + >>
𝐶𝑜
…
…
Figure 5.2: Overview of Binary-LoRAX: BNN tensor slices are fed into DSP blocks which perform
high-throughput XNOR operations. DSP results are forwarded to the PEs of a matrix-vector-
threshold unit (MVTU). A single MVTU of the pipeline is shown for compactness. Runtime
frequency scaling allows high-performance functions, or power-saving mode.
The sign() function blocks the flow of gradients during training due to its derivative, which
is zero almost everywhere. To overcome the gradient flow problem, the sign() function is
approximated during back-propagation by a STE [59].
In the simplest case, the estimated gradient gb could be obtained by replacing the derivative of
sign() with the hard tanh, which is equivalent to the condition gw = gb when |w| ≤ 1 [60], as
shown in equation 5.1.
gw = gb 1|w|≤1 (5.1)
As mentioned previously, batch normalization of the input elements al−1 ⊂ Al−1 , before the
approximation into the binary representation hl−1 ⊂ Hl−1 ∈ BXi ×Yi ×Ci is crucial to achieve
effective training. An advantage of BNNs is that the result of the batch normalization operation
will always be followed by a sign() operation (as shown in figure 5.2). The result after applying
both functions is always constrained to two values, {−1, 1}, irrespective of the input. This
makes the precise calculation of batch normalization wasteful on embedded hardware. Based
on the batch normalization statistics collected at training time, a threshold point τthold can be
defined, where an activation value al−1 ≥ τthold results in 1, otherwise -1 [103]. This allows the
implementation of the typically costly normalization operation as a simple magnitude comparison
operation on hardware.
67
5 Semi-Automated Co-Design
The baseline hardware architecture is provided by the Xilinx FINN framework [103]. The
hardware design space has many degrees of freedom for compute resources, pipeline structure,
number of PEs and SIMD-lanes, among other parameters. The streaming architecture is composed
of a series of matrix-vector-threshold units (MVTUs) to perform the XNOR, popcount and
threshold operations mentioned in section 5.1.3.1. In figure 5.2, a single MVTU is shown in
detail, containing two PEs with 32 SIMD-lanes each. A detailed view of a single PE is also
provided in the same figure. For convolutional layers, a sliding-window unit (SWU) reshapes
the binarized activation maps Hl−1 ∈ BXi ×Yi ×Ci into interleaved channels of hl−1 ⊂ Hl−1 , to
create a single wide input feature map memory, that can efficiently be accessed by the subsequent
MVTU and operated upon in a parallel manner. Max-pool layers are implemented as Boolean
OR operations, since a single binary “1” value suffices to make the entire pool window output
equal to 1.
A single MVTU is solely responsible for a single layer in the BNN, and is composed of single
or multiple PEs, each having their own SIMD-lanes. The SIMD-lanes determine the throughput
of each PE for the XNOR operation.
The choice of PEs and SIMD-lanes determines the latency and hardware resource utilization
of each layer (i.e. MVTU) on the hardware architecture. Instantiating too many PEs can result in
many underutilized FPGA BRAMs, while too few PEs result in a slower processing rate with
better BRAM utilization. Increasing the number of PEs beyond a certain number causes the
synthesis tool to map the memories to LUTs instead of BRAMs, since each PE gets a smaller
slice of the total weights B. This adds another dimension of design complexity, as the target
FPGA’s LUT and BRAM count can be balanced against throughput and utilization efficiency. A
layer’s poorly dimensioned MVTU can result in an inefficient pipeline, leading to poor overall
throughput. Throughput in a streaming architecture is heavily influenced by the slowest MVTU
of the accelerator, as it throttles the rate at which results are produced when the pipeline is full.
On the other hand, latency is dependent on the time taken by all the MVTUs of the architecture
as well as the intermediate components between them (e.g. SWU, pooling unit, etc.).
Choosing the correct number of PEs and SIMD-lanes for each layer becomes a design problem
of balancing the FPGA’s resources, the pipeline’s efficiency (throughput and latency), and
potentially the choice of layers in the BNN (i.e. task-related accuracy). The number of resources
on the FPGA is limited, especially in the context of low-power prosthetics, making these aspects
important in planning the deployment with a HW-DNN co-design approach.
In the previous section, the importance of defining the number of layers (BNN design) and
PE/SIMD-lanes per MVTU (HW design) was outlined. To enable efficient performance of the
semi-autonomous prosthesis, a further aspect must be considered next to resource utilization
and latency, namely the power consumption of the classifier. Prosthetic devices are meant to
be used on a day-to-day basis, making high power consumption a prohibitive aspect to their
practicality. For this reason, the classifier is adapted with the ability to change its operating
frequency dynamically at runtime. The purpose is to avoid running the classifier continually at its
68
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Prosthetic Hands
full capacity, but rather scale down its performance (in terms of latency) for more efficient use
of the available energy supply. Dynamic power in complementary metal-oxide-semiconductor
(CMOS) scales roughly with frequency following Pdyn ≈ αsw f · CVdd 2 , where α
sw is the
switching activity, f is the frequency, C the effective capacitance and Vdd the supply voltage.
In case of our target Xilinx Zynq system-on-chip (SoC) boards, the programmable logic (PL),
on which the hardware acceleration is implemented, is clocked through phase-locked loops
(PLLs) controlled by a CPU-based processing system (PS). The PS can manipulate the PL’s
clock by writing into special registers, whose values act as frequency dividers to the PLLs. As an
example, the motion of the prosthetic hand can be captured through simple sensors which are
monitored by the PS. Based on this motion, the PS can drive up the frequency of the classifier
and prepare for a low-latency, high accuracy classification (based on a mean classification of a
batch of frames). In case of a fragile or perilous object, the lower risk of a false classification
can reduce the chances of an improper grasp. The PS can also trigger the object localization
task by splitting the view into multiple small images and classifying them with high throughput.
This is elaborated in section 5.1.4.3. These high-performance features may extend the use of
Binary-LoRAX to other semi-autonomous prostheses and/or applications. Conversely, the PS
may monitor the remaining battery power or system temperature and switch the classifier to
low-power mode.
5.1.4 Evaluation
5.1.4.1 Experimental Setup
Binary-LoRAX is evaluated on 25 objects from the YCB dataset [130], improving upon previous
work by 12 objects [131]. The dataset is augmented through scale, crop, flip, rotate and contrast
operations. The masks provided with the dataset are used to augment the background with random
Gaussian noise. The dataset is expanded to 105K images for the 25 classes. The images are
resized to 32 × 32 pixels similar to the CIFAR-10 [27] dataset. The BNNs are trained up to 300
epochs, unless learning saturates earlier. Evaluation is performed on a 17.5K test set. The BNN
architectures v-CNV, m-CNV, and µ-CNV, detailed in table A.1, are trained according to the
method in [60]. Each convolutional and fully-connected layer is followed by batch normalization
69
5 Semi-Automated Co-Design
and activation layers except for the final layer. Convolution groups “1” and “2” are followed
by a max-pool layer. The target SoC platforms for the experiments are the XC7Z020 (Z7020)
for v-CNV and m-CNV prototypes, and XC7Z010 (Z7010) for µ-CNV. All prototypes are
finally deployed on the Z7020 SoC. Power, latency and throughput measurements are taken
directly on a running system. The power is measured at the power supply of the board (includes
both PS and PL). Latency measurements are performed end-to-end on the accelerator covering
the classifier’s total time for an inference, while throughput is the classification rate when the
accelerator’s pipeline is full. Note that throughput is higher than the latency rate due to the
streaming architecture working on multiple images concurrently in different parts of its pipeline
when it is full.
70
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Prosthetic Hands
Table 5.1: Hardware results of design space exploration. Power is averaged over a period of 100 seconds
of operation.
Configuration Freq. Power Latency Throughput Acc.
LUT BRAM DSP
(W,A)-bits|BNN MHz [W] [ms] [FPS] [%]
(8,8) - [131]* 400 - - - 0.446 115 9 96.51*
(2,2) - CNV** 100 35718 140 32 2.217 4.87 860 99.91
(1,2) - CNV 100 40328 131.5 26 2.241 1.63 3049 99.89
(1,1) - CNV [103] 100 26060 124 24 2.212 1.58 3049 99.82
Binary-LoRAX: DSP XNOR + Frequency Scaling:
2 1.857 78.93 61
(1,1) - v-CNV l 23675 124 72 l 99.82
111 2.172 1.42 3388
0.7 1.879 80.22 28
(1,1) - m-CNV l 21972 44.5 66 l 98.99
125 2.157 0.45 4999
1 1.824 80.64 16
(1,1) - µ-CNV l 11738 14 27 l 90.58
100 2.028 0.81 1646
*: Running on ARM Cortex M7 (CPU frequency reported), accuracy for 13 classes, 72×72 input
**: Less PEs and SIMD lanes to fit the SoC
slack for post-processing, actuators and other parts of the system. In power-saving mode, the
Binary-LoRAX prototypes run at 0.7-2 MHz and achieve an ∼80 ms latency, still leaving more
than 36% of the allocated delay for the controller. It is important to note that in all the reported
power measurements, roughly 1.65 W of power is consumed by the Z7020’s ARM-Cortex A9
processor (PS) and the board. This leaves the isolated accelerator’s power at roughly 0.2 W in
power-saving mode for all configurations, making it very energy efficient. However, we report
the overall power since the accelerator is still dependent on processor calls and preprocessing
operations on the CPU. In future work, the PS power consumption can also be optimized to
further reduce the classifier’s overall power requirement.
In addition to the low latency of the high-performance mode, the high throughput of up to
4999 FPS can be used to improve the quality of the application. Instead of providing a single
classification, the accelerator can pipeline the inference of many images (potentially from different
sensors) and perform batch classification. The batch classification result will represent the highest
class over all classifications, which in practice compose of slightly different angles, lighting and
distance to the object, improving the chances of a correct classification. Multi-camera prosthetics
proposed in [148] can benefit from the high throughput, as more data is gathered through the
multiple camera setup.
Another use of the high-performance mode is object localization in multi-object scenes. A
large input image can be sliced into several smaller images and reclassified [103]. The image
can be reconstructed with bounded high confidence classifications. Figure 5.3 demonstrates the
described function on Binary-LoRAX. This can help the prosthesis predetermine the location
of different objects in a far scene, when the hand is not yet close to the graspable object. The
approach also fits the training scheme, as the BNNs are trained on up-close images of the object
71
5 Semi-Automated Co-Design
Figure 5.3: The large input image is sliced into smaller images and reclassified. High confidence classifi-
cations are bounded.
20 35 50 65 80
2
Power
Latency
FPS 1
0
0
0 20 40 60 80 100 120
Frequency (MHz)
Figure 5.4: Runtime frequency scaling ranging from 2MHz to 111MHz for the v-CNV prototype.
(soon before the grasp), while far scenes with no central object would be unrecognizable to the
BNN. The individual slices of a far scene are similar to the up-close train images.
In figure 5.4, we perform a frequency sweep on the v-CNV prototype, identifying different
points of operation for different application requirements. The low-power region is considered to
be below 1.90 W, while localization would require classification rates of above 2250 FPS for an
input resolution of 320×240. Batch classification can be triggered in critical scenarios where a
latency of <10 ms is needed.
We demonstrate the application of runtime frequency scaling in figure 5.5. The total power
of the chip is measured for a duration of 80 seconds. At time = 15 s, we introduce a stimulus
representing a dangerous object or similarly a signal from a motion sensor on the hand. The event
triggers the classifier to high-performance mode for an observation period of 35 seconds. If no
further event occurs, the classifier winds down to low-power mode at time = 50 s. Naturally, the
intermediate frequencies shown in figure 5.4 can all be triggered for other scenarios or operating
modes.
72
5.1 Binary-LoRAX: Low-power and Runtime Adaptable XNOR Classifier for Prosthetic Hands
Prosthetic Movement
2.4
Prosthetic Idle
3388FPS / Low Battery
2.2
Power (W)
2 1.42ms
61FPS
1.8
78.93ms
Figure 5.5: Runtime change in operation mode based on application scenario, e.g. motion, delicate object
or low battery.
5.1.5 Discussion
A daily-used device, such as a prosthetic hand, must operate in different modes to suit daily
application scenarios. This work presented a low-latency runtime adaptable XNOR classifier
for semi-autonomous prosthetic hands. The high-performance and power-saving modes were
enabled through runtime adaptable frequency scaling. Binary-LoRAX prototypes achieved over
∼99% accuracy on a 25-class problem from the YCB dataset, a maximum of 4999 FPS, and
a latency of 0.45 ms. The low-power mode can potentially improve the battery-life of the
classifier by 19% compared to an equivalent accelerator running continuously at full-power.
This work demonstrated the use of expert knowledge and automation in a semi-automated HW-
DNN co-design formulation. Particularly for the µ-CNV prototype, which can be synthesized
on the heavily constrained Z7010’s FPGA, the neural network had to be dimensioned such
that the total number of MVTUs resulting from the layers did not consume LUT resources
beyond those available on the PL, but still maintained high-accuracy for the graspable object
classification task. Then, the PEs and SIMD-lanes of those MVTUs had to be chosen carefully
to maintain the performance of the pipeline, but fit on the constrained FPGA. HLS performs
the automated optimizations, based on the generated pipeline, to create the HDL components.
Handcrafted, reconfigured DSPs were injected into each MVTU, to execute the highly parallel
XNOR operations of the BNN and further reduce the total LUTs required by the accelerator. The
injection of handcrafted DSPs into MVTUs was performed by an automated HDL parser script.
This combination of handcrafted design of the BNN and the DSP, along with the automated
pipeline optimizations of HLS and HDL parsers, led to a highly effective, co-designed solution
which brought high-performance, intelligent classifiers to the semi-autonomous prosthetic hand.
73
5 Semi-Automated Co-Design
1
at the time of writing
74
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor
equivalent classification accuracy for all face structures, skin-tones, hair types, and mask types,
the algorithms must be able to generalize the relevant features over all individuals.
The deployment scenarios for the CNN should also be taken into consideration. A face-mask
detector can be set at the entrance of corporate buildings, shopping areas, airport checkpoints,
and speed gates. These distributed settings require cheap, battery-powered, edge devices which
are limited in memory and compute power. To maintain security and data privacy of the public,
all processing must remain on the edge-device without any communication with cloud servers.
Minimizing power and resource utilization while maintaining a high classification accuracy
is yet another HW-DNN co-design challenge which is tackled in this work. In this context,
BinaryCoP (Binary COVID-mask Predictor) is an efficient BNN-based real-time classifier of
correct face-mask wear and positioning. The challenges of the described application are tackled
through the following contributions:
• Training BNNs on synthetically generated data to cover a wide demographic and generalize
relevant task-related features. A high accuracy of ∼98% is achieved for a 4-class problem
of mask wear and positioning on the MaskedFace-Net dataset [153].
• The BNNs are analyzed through Grad-CAM to improve interpretability and study the
features being learned.
75
5 Semi-Automated Co-Design
𝐾𝑦
𝐶𝑜
𝑋𝑖 𝐁𝑙 𝑋𝑜
𝐶𝑖
Batch Norm
𝑌𝑖 𝑌𝑜
Sign
BinConv
Synthetic Data Training and Interpretability
𝐶𝑖 𝐀𝑙−1 𝐇 𝑙−1 𝐶𝑜 𝐀𝑙
Correctly Worn
Nose Exposed
FINN-based Accelerator
Latency,
#PEs, Resource
MVTU: #SIMD Utilization,
Throughput
PE
… • Pipeline • Pipeline
Buffers PE Buffers …
• SWU • SWU
SIMD lanes
• Avoid Region-Local Data
…
Chin Exposed
• Maintain Subject
Diversity
Rewire from PEs: OPMODE
ALUMODE • Assert correct features
Sig A Sig B Weight being learnt
Accumulator
Threshold
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mem.
Rewire to PEs
1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1
Mem.
Figure 5.6: Main components of BinaryCoP. The BNN requires low memory and provides good general-
ization. The FINN-based accelerator allows for privacy-preserving edge deployment of the
algorithms without sacrificing performance. The synthetic data helps in maintaining a diverse
set of subjects and Grad-CAM can be used to assert the features being learned.
information can be captured using class activation mapping (CAM) [157] and Grad-CAM [158]
techniques. To apply CAM, the model must end with a global average pooling layer followed by
a fully-connected layer, providing the logits of a particular input. The BNN models investigated
in this work operate on a small input resolution of 32×32, and achieve a high reduction of spatial
information without incorporating a global average pooling layer. For this reason, the Grad-CAM
approach is better-suited to obtain visual interpretations of BinaryCoP’s attention and determine
the important regions for its predictions of different inputs and classes.
To obtain the class-discriminative localization map, we consider the activations and gradients
for the output of the Conv 2 2 layer (shown in table A.1), which has spatial dimensions of 5×5.
We use average pooling for the corresponding gradients and reduce the channels by performing
Einstein summation as specified in [158]. With this approach the base networks do not need
any modifications or retraining. Due to the synthetically generated dataset used for training, we
expect BinaryCoP models to generalize well against domain shifts.
76
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor
Intermediate
Conv. Layer
𝑋𝑖
𝑋𝑜
Nose + Mouth
𝑌𝑖 𝑌𝑜
… … … Nose
Chin
𝐶𝑖 𝐇 𝑙−1 𝐶𝑜 𝐇𝑙 Correct
• Scale with 𝛼𝑐𝐶𝑜𝑟𝑟𝑒𝑐𝑡
𝑜
• Aggregate
• Activate
• Upsample
𝜕𝑦 𝐶𝑜𝑟𝑟𝑒𝑐𝑡
𝜕𝐴𝑐𝑜
NN 𝜕𝑦 𝐶𝑜𝑟𝑟𝑒𝑐𝑡
𝛼𝑐𝐶𝑜𝑟𝑟𝑒𝑐𝑡
𝑜
= 𝐺𝑙𝑜𝑏. 𝑎𝑣𝑔 𝜕𝐴𝑐𝑜
= Neuron importance weight
Attention
Figure 5.7: The Grad-CAM approach used to assert that correct and reasonable features are being learned
from the synthetic data.
5.2.4 Evaluation
BinaryCoP is able to detect the presence of a mask, as well as its position and correctness. This
level of classification detail is possible through the more detailed split of the MaskedFace-Net
dataset [153] from 2 classes, namely correctly masked face dataset (CMFD) and incorrectly
masked face dataset (IMFD), to 4 classes of CMFD, IMFD Nose, IMFD Chin, and IMFD Nose
and Mouth. The dataset suffers from high imbalance in the number of samples per class. From
the total 133,783 samples, roughly 5% of the samples are IMFD Chin, and another 5% samples
are IMFD Nose and Mouth. CMFD samples make up 51% of the total dataset while IMFD Nose
makes up 39%. The dataset in its raw distribution would heavily bias the training towards the
two dominant classes. To counter this, we randomly sample the larger classes CMFD and IMFD
Nose to collect a comparable number of examples to the two remaining classes, IMFD Chin and
IMFD Nose and Mouth. The evenly balanced dataset is then randomly augmented with a varying
combination of contrast, brightness, gaussian noise, flip and rotate operations. The final size of
the balanced dataset is 110K train and validation examples and 28K test samples. The images are
resized to 32×32 pixels, similar to the CIFAR-10 [27] dataset. The BNNs are trained up to 300
epochs, unless learning saturates earlier. The FP32 variant used for the Grad-CAM comparison is
trained for 175 epochs due to early learning saturation (98.6% final test accuracy). We trained the
BNN architectures shown in table A.1 according to the method described in section 5.1.3.1. The
target SoC platform for the experiments is the Xilinx XC7Z020 (Z7020) chip on the PYNQ-Z1
board. The µ-CNV design can also be synthesized for the more constrained XC7Z010 (Z7010)
chip, when XNOR operations are offloaded to the DSP blocks as described in section 4.1.4.
77
5 Semi-Automated Co-Design
Table 5.2: Hardware results of design space exploration. Power is averaged over 100s of operation.
Power and throughput measurements are taken directly on a running system, in the same manner
described in section 5.1.4.1.
Three BinaryCoP prototypes are evaluated, namely CNV, n-CNV and µ-CNV. Architectural
details of the networks can be found in table A.1. The CNV network is based on the architecture
in [103] inspired by VGG-16 [40] and BinaryNet [60]. n-CNV is a downsized version for a
smaller memory footprint, and µ-CNV has fewer layers to reduce the size of the synthesized
design. All designs are synthesized with a target clock frequency of 100MHz.
Referring back to table A.1, the PE counts and SIMD-lanes for each layer (i.e. MVTU) are
shown in sequence. For BinaryCoP-n-CNV, the most complex layer is Conv 1 2 with 3.6M
XNOR and popcount operations. In figure 5.8, this layer is marked as the throughput setter, due
to its heavy influence on the final throughput of the accelerator. Allocating more PEs for this
layer’s MVTU increases the overall throughput of the pipeline, so long as no other layer becomes
the bottleneck. Enough resources are allocated for Conv 1 1 to roughly match Conv 1 2’s
latency. The FINN architecture employs a weight-stationary dataflow, since each PE has its own
pre-loaded weight memory. When the total number of parameters of a given layer increases,
it becomes important to map these parameters to BRAM units instead of logic. The deeper
layers have several orders of magnitude fewer operations (OPs), but more parameters. For these
layers, increasing the number of PEs fragments the total weight memory, leading to worse BRAM
utilization and no benefit in terms of throughput. Here, choosing fewer PEs, with larger unified
weight memories, leads to improved memory allocation, while maintaining rate-matching with the
shallow layers (as seen figure 5.8), leaving the throughput gains from the initial PEs unhindered.
The CNV architecture in [103] follows the same reasoning for PE and SIMD allocation. For
µ-CNV, fewer PEs are allocated for the throughput-setters, as this prototype is meant to fit on
embedded FPGAs with less emphasis on high frame rates.
In table 5.2, the hardware utilization for the BinaryCoP prototypes is provided. With µ-
CNV, a significant reduction in LUTs is achieved, which makes the design synthesizable on the
heavily constrained Z7010 SoC. The trade-off is a slight increase in the memory footprint of
the BNN, as the shallower network has a larger spatial dimension before the fully-connected
layers, increasing the total number of parameters after the last convolutional layer. The choice of
78
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor
10
4 Highest OPs OPs Latency
7.5
3
Binary OPs (106 )
Less PEs
2
5
Low OPs
High Memory
2.5
1 0
0
1
3
FC
FC
FC
1
3
nv
nv
nv
nv
nv
nv
Co
Co
Co
Co
Co
Co
´
Figure 5.8: Binary operations and layer-wise latency estimates based on PE/SIMD choices for BinaryCoP-
n-CNV.
PE count and SIMD lanes for the n-CNV prototype allow it to reach a maximum throughput of
∼6400 classifications per second when its pipeline is full. This high-performance can be used
to classify images from multiple cameras in multi-gate settings. The inference power values
reported in table 5.2 show a total power requirement of around 2 W for all prototypes. For single
entrance/gate classifications, all prototypes have an idle power of around 1.65 W. In this setting, a
classification needs to be triggered only when a subject is attempting to pass through the entrance
where BinaryCoP is deployed. The idle power is required mostly by the processor (ARM-Cortex
A9) on the SoC and the board (PYNQ-Z1). This can potentially be reduced further by choosing a
smaller processor to pair with the proposed hardware accelerator. Although the PYNQ-Z1 board
has no power measurement bus (PMBus) to isolate the power measurements of the FPGA from
the rest of the components, we can infer that the hardware accelerator requires roughly 0.4 W for
the inference task from the two measured power values in table 5.2. The current design is still
dependent on the processor for pre- and post-processing, therefore the joint power is reported for
fairness.
79
5 Semi-Automated Co-Design
7125 41 1 90
Correct
98% 1% 0% 1%
26 7042 94 26
Nose
True Class
0% 98% 2% 0%
4 79 5651 9
N+M
0% 1% 98% 0%
107 41 7 7363
Chin
1% 1% 0% 98%
on the raw input images for better visualization. All raw images chosen have been classified
correctly by all the networks, for fair interpretation of feature-to-prediction correlation.
In figure 5.10-a, the region of interest (RoI) for the correctly masked class is shown. Bina-
ryCoP’s learning capacity allows it to focus on key facial lineaments of the human wearing the
mask, rather than the mask itself. This potentially helps in generalizing on other mask types. For
the child example shown in the first row, the focus of BinaryCoP lies on the nose area, asserting
that it is fully covered to result in a correctly masked prediction. Similarly, for the adult in row 2,
BinaryCoP-CNV focuses on the upper edge of the mask to predict its coverage of the face. This
also holds for our small version of BinaryCoP, with significantly reduced learning capacity. The
RoI curves finely above the mask, tracing the exposed region of the face. In the third-row example,
BinaryCoP-CNV falls back to focusing on the mask, whereas BinaryCoP-n-CNV continues to
focus on the exposed features. Both models achieve the same prediction by focusing on different
parts of the raw image. In contrast to the BinaryCoP variants, the FP32 model seems to focus on
a combination of several different features on all three examples. This can be attributed to its
larger learning capacity and possible overfitting.
In figure 5.10-b, we analyze the Grad-CAM output of the uncovered nose class. BinaryCoP-
CNV and BinaryCoP-n-CNV focus specifically on two regions, namely the nose and the straight
upper edge of the mask. These clear characteristics cannot be observed with the oversized FP32
CNN. In figure 5.10-c, the results show the RoI for predicting the exposed mouth and nose
class. All models seem to distribute their attention onto several exposed features of the face.
figure 5.10-d shows Grad-CAM results for chin exposed predictions. Although the top region of
the mask points upwards, similar to the correctly worn mask, the BNNs pay less attention to this
region and instead focus on the neck and chin. With the full-precision FP32 model, it is difficult
80
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor
Correctly Nose
Masked Exposed
Correctly Nose
Masked Exposed
Correctly Nose
Masked Exposed
Figure 5.10: Grad-CAM output of two BinaryCoP variants and a FP32 CNN. Results are collected for
all four wearing positions on a diverse set of individuals. Binarized models show distinct
regions of interest which are focused on the exposed part of the face rather than the mask.
The FP32 model is difficult to interpret in some cases. It is recommended to view this
figure in color.
to interpret the reason for the correct classification, as little to no focus is given to the chin region,
again hinting at possible overfitting.
Beyond studying the BNNs’ behavior on different class predictions, the attention heat maps can
be used to understand the generalization behavior of the classifier. In figure 5.11 to figure 5.13,
BinaryCoP’s generalization over ages, hair colors and head gear is tested, as well as complete
face manipulation with double-masks, face paint and sunglasses. In figure 5.11, the smaller
eyes of infants and elderly do not hinder BinaryCoP’s ability to focus on the top region of the
correctly worn masks. In figure 5.12, BinaryCoP-CNV shows resilience to differently colored
hair and head-gear, even when having a similar light-blue color as the face-masks (row 2 and
3). In contrast, the FP32 model’s attention seems to shift towards the hair and head-gear for
these cases. Finally, in figure 5.13, both BinaryCoP variants focus on relevant features of the
corresponding label, irrespective of the obscured or manipulated faces. This qualitatively shows
that the complex training of BNN, along with their lower information capacity, constrains them
to focus on a smaller set of relevant features, thereby generalizing well for unprecedented cases.
81
5 Semi-Automated Co-Design
BCoP BCoP
Label Raw FP32
CNV n-CNV
Correctly
Masked
Correctly
Masked
Correctly
Masked
Figure 5.11: Grad-CAM results for age generalization. It is recommended to view this figure in color.
BCoP BCoP
Label Raw FP32
CNV n-CNV
Correctly
Masked
Correctly
Masked
Nose
Exposed
Figure 5.12: Grad-CAM results for hair/headgear generalization. It is recommended to view this figure
in color.
82
5.2 BinaryCoP: BNN COVID-19 Face-Mask Wear and Positioning Predictor
BCoP BCoP
Label Raw FP32
CNV n-CNV
Correctly
Masked
Correctly
Masked
Chin
Exposed
Nose Mouth
Exposed
Nose Mouth
Exposed
Figure 5.13: Grad-CAM results for face manipulation with double-masks, face paint and sunglasses. It is
recommended to view this figure in color.
processing them. This application makes use of the high-throughput results presented in table 5.2.
Another approach proposed by Agarwal et al. [160] achieves the task of detecting a range of
personal protective equipment (PPE). Processing takes place on cloud servers, which could raise
privacy and data safety concerns in public settings. Wang et al. [161] propose an in-browser
serverless edge computing method, with object detection models. The browser-enabled device
must support the WebAssembly instruction format. The authors benchmarked their approach on
an iPad Pro (A9X), an iPhone 11 (A13) and a MacBook pro (Intel i7-9750H), achieving 5, 10
and 20 FPS respectively. Needless to say, these devices (or similar) are expensive and cannot
be placed in abundance in public areas. Similarly, [162] offers an Android application solution,
which is suitable for users self-checking their masks. In this case, low-power, edge-hardware,
and continuous surveillance are not emphasized.
BinaryCoP offers a unique, low-power, high-throughput solution, which is applicable to cheap,
embedded FPGAs. Moreover, the BinaryCoP solution is not constrained to FPGA platforms.
Software-based inference of BinaryCoP is also possible on other low-power microcontrollers,
with binary instructions. Training on synthetic data enables generating more samples with
different mask colors, shapes, and sizes [163], further improving the generalizability of the BNNs,
while keeping real-world data available for fine-tuning stages.
5.2.5 Discussion
Applying BNNs to face-mask wear and positioning prediction solves several challenges such
as maintaining data privacy of the public by processing data on the edge-device, deploying
the classifier on an efficient XNOR-based accelerator to achieve low-power computation, and
minimizing the neural network’s memory footprint by representing all parameters in the binary
domain, enabling deployment on low-cost, embedded hardware. The accelerator requires only
∼1.65 W of power when idling on single gates/entrances. Alternatively, high-performance
83
5 Semi-Automated Co-Design
is possible, providing fast batch classification on multiple gates and entrances with multiple
cameras, at ∼6400 FPS and 2 W of power. An accuracy of up to 98% for four wearing positions
of the MaskedFace-Net dataset was achieved. The Grad-CAM approach was used to study the
features learned by the classifier. The results showed the classifier’s high generalization ability,
allowing it to perform well on different face structures, skin-tones, hair types, and age groups.
BinaryCoP reused many of the semi-automated HW-DNN co-design concepts first introduced
in Binary-LoRAX. Additionally, semi-automated design was incorporated in the training loop,
where synthetic data was generated in an automated manner, and interpretability tools, such
as Grad-CAM, allowed the human to verify the features being learned. Optimizations from
compilers coupled with human-engineered DNN architectures resulted in high FPS performance
and task-accuracy, while maintaining low power in a privacy-preserving, edge setting. Both
BinaryCoP and Binary-LoRAX have shown how semi-automated HW-DNN co-design can
combine human-expertise and compiler optimizations to achieve highly efficient AI that can
impact human lives in a positive way.
84
6 Fully-Automated Co-Design
component libraries are fixed, the supported layers are known, the sched-
T
HE HARDWARE
ule of computations can be modeled accurately, and the execution metrics (power/latency)
can be estimated with high fidelity. Now, all what is left to do is find the right configura-
tion in a search space with over ∼1034 solutions [9]. Calling this solution a “needle in a haystack”
is an understatement, unless the haystack in question is as big as the observable universe and has
as much hay as we have stars. Fortunately, problems of this size emerge often in engineering and
mathematical optimization, giving purpose to the well-established field of metaheuristics. In fully-
automated co-design, search agents involving methods like genetic algorithms (GAs), Bayesian
optimizers, and reinforcement learning (RL) are exploited to traverse such multi-dimensional,
noisy search spaces and return sufficiently good solutions for the target application, in a reasonable
amount of time. For these search agents to function properly, the prerequisites of a well-defined
search space must be fulfilled, namely a finite set of design hyper-parameters and optimization
criteria, as well as an evaluation function, model, or implementation. The evaluation function
is necessary to assess the fitness of the design decisions in each step of the search algorithm.
An evaluation model plays a crucial role here, as it must be accurate, high in fidelity, and fast
enough to quickly provide the reward/fitness value of the design decisions, such that the search
agent can quickly and correctly traverse to the next step. In this chapter, two works are discussed
where this type of HW-DNN co-design is used. In HW-FlowQ [10], a multi-abstraction level
HW-DNN co-design methodology is presented, where a tripartite search space is traversed and
three levels of abstraction are proposed to converge to the final solution in a controlled manner. In
AnaCoNGA [9], two nested GAs jointly search the hardware and neural network design spaces,
while reducing the overall search time compared to a single GA searching the neural network
design space. This novel, fully-automated co-design methodology was nominated for the best
paper award at the Design, Automation and Test in Europe (DATE) conference in 2022.
85
6 Fully-Automated Co-Design
and the compressed CNN model through quantization. The search space is viewed at three
levels of abstraction, allowing for an iterative approach for narrowing down the solution space
before reaching a high-fidelity CNN hardware modeling tool, capable of capturing the effects
of mixed-precision quantization strategies on different hardware architectures (processing unit
counts, memory levels, cost models, dataflows) and two types of computation engines (bit-parallel
vectorized, bit-serial). To combine both worlds, a multi-objective non-dominated sorting genetic
algorithm (NSGA-II) is leveraged to establish a Pareto-optimal set of quantization strategies
for the target HW-metrics at each abstraction level. HW-FlowQ detects optima in a discrete
search space and maximizes the task-related accuracy of the underlying CNN while minimizing
hardware-related costs. The Pareto-front approach keeps the design space open to a range of
non-dominated solutions before refining the design to a more detailed level of abstraction. With
equivalent prediction accuracy, energy and latency are improved by 20% and 45% respectively
for ResNet56, compared to existing mixed-precision search methods.
86
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
• Exploring single and multi-objective genetic algorithms (SOGA, MOGA) for finding
Pareto-optimal quantization strategies with respect to the underlying hardware platform.
• Modeling vectorized and bit-serial accelerators, with varying resources and dataflows for
mixed-precision quantization, enabling HW-design exploration during CNN optimization.
87
6 Fully-Automated Co-Design
design parameters, as described in section 2.3.2.1 and 3.3.2. Interstellar [100] proposes formal
dataflow definitions. Different to Timeloop, Interstellar uses the Halide programming language to
represent the memory hierarchy and data movement constraints. Tetris [108] and Tangram [109]
make use of a dataflow scheduler for DNN workloads on spatial accelerators to test the potential
of other manipulations possible for their memory hierarchies. The mentioned works have proven
the effectiveness of HW-modeling of DNN workloads. Nevertheless, other aspects have not been
explored as thoroughly, such as adding the effects of layer-wise mixed-precision quantization
on the resulting dataflow or supporting mixed-precision computation units. There is a need to
extensively integrate hardware models with CNN optimization algorithms to aid the exploration
of mixed-precision CNNs with respect to the hardware model under consideration, particularly
when multiple hardware architectures are being considered as potential fabrication candidates.
6.1.3 HW-FlowQ
HW-FlowQ is a HW-CNN co-design methodology which facilitates a top-down design approach,
iteratively going through different levels of abstraction and performing some iterations of ex-
88
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
Convolution
Input 𝐼 … 𝑌𝑖 𝑌𝑜 … Pred (𝑌’)
Layers
…
𝐶𝑖 𝐶𝑜 𝐀𝑙
𝐀𝑙−1
Task Metrics 𝜓
Generation + 1
• Accuracy
Inference
Simulation 𝑏𝑚𝑎𝑥
Coarse
Selection: (e.g. NSGA-II, SGA) 𝑏𝐴
Registers
Mid Off-Chip: e.g. DRAM
ℱ𝜌 (𝜑, 𝜓)∀𝜌 ∈ 𝒫0 • Tiling On-Chip: e.g. SRAM
• Reordering Banks: 𝐀𝑙−1 𝐖𝑙 𝐀𝑙
𝑏𝑚𝑎𝑥
SIMD Vector
HW Metrics 𝜓 Engine
𝑏𝑊
• Energy
• Latency misalignment
•
Crossover
Frac. OPs
Mutation
OR
• Bytes Registers Lanes
𝑏𝑚𝑎𝑥
Fine PE PE PE PE PE
…
𝑏𝐴
• Unrolling PE PE PE PE PE
• Folding
…
• Interleaving PE PE PE PE PE Bit-serial
Engine
Figure 6.1: Overview of the HW-FlowQ methodology. Population P is evaluated on task-related accu-
racy ψ and hardware metrics ϕ. The three proposed HW-modeling abstraction levels: Coarse,
Mid and Fine, enable the genetic algorithm G to consider the hardware metrics of the CNN
relative to the current design phase.
ploration, before fixing some parameters and refining the design to one abstraction level lower.
Moving away from hardware-in-the-loop and look-up table approaches in existing works, here, a
model-in-the-loop design flow is implemented.
HW-FlowQ is based on an interaction between the individuals ρ of population P, with the
hardware model µ in the context of a genetic search algorithm G (figure 6.1). In detail, the
genotype of an individual ρ encodes the layer-wise quantization levels for weights and activations
(bW , bA ) of the CNN. The individual’s fitness F is measured through the HW-model estimations
ϕ and CNN accuracy term ψ, computed w.r.t. the images and labels of a validation set. When G
is a SOGA, the objectives ϕ and ψ are combined into a single cost function for fitness evaluation
(section 6.1.3.2). G can also take the form of a MOGA, such as NSGA-II.
To enable the design steps of HW-SW co-design, different representations of the target HW
platform need to be accessible based on the design phase. Starting from an abstracted, high-level
representation makes it possible to coarsely search for HW-CNN combinations that may suit
the application at hand. After some high-level parameters are fixed, a step of refinement can
take place. This brings the exploration to a finer level of detail, but within the scope of the fixed
parameters of the previous abstraction level (figure 6.2). More implementation-specific aspects
can be considered after each refinement iteration. At any stage, if the exploration fails to find any
suitable solutions, an abstraction step can take the design back to a coarser level and re-evaluate
89
6 Fully-Automated Co-Design
the higher-level parameters which were set. This type of design flow is commonly used in VLSI
design, where complex, large design spaces must be explored at different levels of abstraction,
from system-level down to transistor logic [38, 2].
Three levels of abstraction are offered in HW-FlowQ, namely Coarse, Mid and Fine. Starting
with Coarse-level optimization, the framework can be used to test the effect of quantization
on differently shaped/sized CNNs, given as an input. The total computations required and the
task-related accuracy can be evaluated. The CNN parameters at this level heavily influence the
start of the co-design process, as they set the upper-bound of task-related accuracies possible, as
well as the range of fractional operations and on-chip memory the hardware must accommodate.
After quantization, if the target compression and/or task-related accuracy cannot be met, support
for lower quantization levels needs to be considered and/or new CNN architectures need to be
provided. The quantization training scheme can also be decided at this stage (e.g. DoReFa, PACT,
QNN). It is important to note that HW-FlowQ does not constitute a NAS methodology, but is
rather complementary to such techniques. As an example, a NAS framework can provide HW-
FlowQ with high-accuracy CNN architectures as inputs at the Coarse-level. Then, HW-FlowQ can
quantize them optimally for target hardware designs, as well as facilitate designing customized
hardware for the proposed CNNs. Once the CNN(s) meets the high-level requirements, the
Mid-level evaluates the feasibility of different memory hierarchies to buffer and move data
before reaching the on-chip computation units. Parameters such as data transfer volumes,
computation-to-communication (CTC) ratio and off-chip memory accesses can be searched.
This information can help in deciding which off-chip to on-chip communication infrastructure
and bandwidth is suitable to meet the application constraints. The CNN can further be quantized
with this HW-model-in-the-loop setup, in order to close the gap between the hardware constraints
and the computation/communication requirements, while maintaining the task-accuracy goals.
Performing one more iteration of refinement takes the design to the Fine-level. At this stage,
the hardware computation architecture can be defined. Precise information on the supported
quantization levels, number of computation units available, register file sizes, supported data
movements, and more, can be provided. HW-FlowQ provides high-fidelity estimates of the
benefits that can be achieved on the prospective hardware design, for a particular quantization
strategy (figure 6.5). Details on the Fine-level modeling are provided in the next sections.
Considering all design parameters holistically would imply searching all possible quantization
strategies for all candidate CNNs (Coarse), on all possible on-chip/off-chip communication
and on-chip memory sizes (Mid) for all possible dataflows, compute array sizes, multiplier
Coarse
Mid
Fine
Compile/Synthesize
Figure 6.2: Iterative refinement increases the likelihood of finding the global optimum. Flow inspired
by [2].
90
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
Abstraction
Phase 1 Phase 2 Phase 3
• Baseline CNN ℱ𝜌 (𝜑, 𝜓) • Off-chip to On-chip ℱ𝜌 (𝜑, 𝜓) ℱ𝜌 (𝜑, 𝜓)
• Compute Arch.
• Quant. Support Communication • PE Specification
• Quant. Method • Mem. Hierarchy • Supported Dataflows
• Comp. Method • Mem. Size/Partition
Coarse Est. 𝜑 Mid Est. 𝜑 Fine Est. 𝜑
Coarse • Fractional OPs Mid • CTC ratio
Fine • Total energy
• Mem. Footprint • Off-chip access • Per-datatype energy
Opt. • Quant. Accuracy Opt. • Tile volumes Opt. • Total latency
• HW-utilization
• Fractionalize Ops • Loop Tiling • Loop Unrolling • Detailed execution
• Compression • Loop Reordering • Interleaving schedule
• Folding/Mapping
Refinement
Figure 6.3: Input, output and optimization details of the HW-model µ abstraction levels used at each
phase. After refinement, the inputs of the preceding phase are inherited to the next.
types and register dimensions (Fine). This would ultimately waste an immense amount of GPU
hours, searching for solutions which could have been eliminated at the Coarse-level already.
Additionally, with so many search parameters, the convergence of the search algorithm becomes
more difficult to guarantee, potentially leading to sub-optimal results. To address this challenge,
the step-wise optimization in HW-FlowQ’s Coarse, Mid and Fine levels along with the Pareto-
front-based quantization approach (NSGA-II) promotes a design-flow which leads to improved
synergies in the final HW-CNN implementation and a more practical approach to searching the
three large search spaces of CNN architecture, quantization strategy, and hardware design.
Figure 6.3 summarizes the inputs required at each level, the optimization that the HW-model
µ can perform internally at each phase and the output estimates which can be used to evaluate
the hardware-related fitness Fρ of different individuals in population P. Traversal between the
levels is indicated by refinement and abstraction arrows. The decision on whether the search
takes a step of refinement can be inferred from a list of application constraints. For example, a
maximum number of fractional operations needs to be met, before a transition between Coarse to
Mid can take place. Similarly, a desired off-chip communication constraint can be set, before the
design transitions between Mid to Fine. If a certain constraint cannot be met, the framework must
reconsider the inputs of the current level (e.g. at Coarse reconsider baseline CNN architecture, at
Mid reconsider memory hierarchy, etc.). If changing the inputs of the level does not meet the
targets, the inputs of the level above are reconsidered (abstraction). Through this progressive
filtering of design decisions at each level, the output of the overall framework meets the desired
application targets at the end of the flow.
Finding the correct layer-wise quantization strategy for both weights and activations with respect
to a target HW-model is a complex problem which would benefit from gradient-free optimization
due to the discrete nature of the solution space. The search space for the quantization strategy
alone consists of Q2L solutions, where Q is the set of possible quantization levels and L is
the number of layers. Quantizing some layers leads to larger drops in accuracy than others,
91
6 Fully-Automated Co-Design
Mutate
Figure 6.4: Layer-wise genome encoding allows for intuitive use of genetic operators (crossover, mutation)
to capture and maintain good localities of bitwidth-to-layer encodings from two fit parents
into their offspring.
and different accuracy drops can take place at different quantization levels for the same layer.
Moreover, quantization strategies change the mapping and scheduling space of the CNN on the
hardware, as explained in section 3.2. For example, a quantization strategy might make new
schedules possible, which lead to sudden drops in latency and energy, as soon as a particular
computation tile fits the on-chip memory after quantization. In this work, GAs are leveraged to
tackle the quantization strategy search problem, as they are known to be resilient to noisy search
spaces, quick to prototype, and do not need smooth, continuous search spaces to perform well.
Explicit, bijective encoding is used to create the genomes of potential solutions as shown in
figure 6.4. A single genome represents a potential CNN quantization strategy and has as many
genetic loci as there are layers in the CNN. Each genetic locus encapsulates a tuple of integer
bitwidth values for weights and activations (bW , bA ) at the corresponding layer. The set of
possible alleles at each genetic locus is defined by the bitwidths supported by the HW-model,
i.e. Q. Bitwidth-to-layer encoding can be captured intuitively in sequential genomes, which
leads to a sensible use of GA operators, such as single-point crossover (example in figure 6.4).
Neighboring CNN layers have higher feature correlation than distant layers. Therefore, quantized
layer relationships encoded in neighboring genetic loci can survive in a population and be reused
through single-point crossover to create more efficient offspring. The more fit the parents become
throughout the generations, the better genetic localities they will have to create better individuals.
Mutation further allows offspring to escape local minima of their parents.
Referring back to figure 6.1, on the top-left an initial population P0 is randomly generated at
the start of the genetic algorithm G, with each individual encoding the quantization levels of each
layer of the CNN in its genes. The individuals of P0 are briefly fine-tuned and evaluated based
on their task-accuracy ψ on a validation set (figure 6.1 top-right), as well as hardware estimates
ϕ of the HW-model through inference simulation (figure 6.1 bottom-right). Based on the GA
configuration, ψ and ϕ define the fitness of each individual ρ0 ∈ P0 . As depicted in figure 6.1, ψ
and ϕ are fed back to a selection phase in G, to constrain the cardinality of the population |P|.
Individuals survive this phase based on their fitness. Survivors are allowed to mate and produce
offspring in P1 , which inherit alleles from two survivor parents through crossover. A round of
92
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
mutation takes place, altering alleles of the offspring in P1 . The population goes through the
same phases of fitness evaluation, selection and crossover for the subsequent generations.
Single and multi-objective genetic algorithms are explored in this work. Both GAs share the same
evolutionary flow described earlier, but are different in their observation of fitness. By definition,
SOGA maximizes a single reward. Since our problem inherently involves multiple objectives (ψ
and ϕ), a balanced reward function must be defined to combine them into a single fitness value F
to apply SOGA.
ψ ∗ −ψ ∗
· log( ϕϕ ), if SOGA
1− t )
Fρ = (6.1)
ψ, ϕ otherwise NSGA-II
The fitness definition of SOGA in equation 6.1 is inspired by the cost function proposed
in [37]. ψ ∗ and ϕ∗ are the task-related accuracy and the hardware estimates of the uncompressed
CNN, respectively. The function balances the improvements in hardware efficiency log(ϕ∗ /ϕ)
while trying to maintain task-related accuracy through the term (1 − (ψ ∗ − ψ)/t). t sets a
threshold on accuracy degradation, where a difference between ψ ∗ and ψ equal to or greater than
t turns the accuracy term negative and renders the fitness Fρ of individual ρ unacceptable.
In the case of NSGA-II optimization, the algorithm evaluates the Pareto optimality of each
individual with respect to the population P. This relieves the burden of crafting a single fitness
function, which may not always guarantee a fair balance between multiple objectives. Addition-
ally, having an array of potential solutions in a Pareto-front is a better approach for design space
exploration, compared to having a single solution suggested by the search algorithm. Design
space exploration is a fundamental part of HW-SW co-design making NSGA-II an attractive
alternative to SOGA.
Considering the accuracy-related fitness term ψ, the quantization strategies of a population P
need to be evaluated in a reasonable amount of time to avoid a bottleneck in the search process.
To address this problem, quantized networks in P are not fully trained during the search. Instead,
the CNN model is instantiated and loaded with pre-trained floating-point weights, then quantized
according to the genome of ρ and briefly fine-tuned to recover from the accuracy loss introduced
by the direct quantization process. This process can also be parallelized, as the individuals within
a population can be fine-tuned at the same time, on a single or multiple GPUs. The learning
behavior of 2-bit, 4-bit and 6-bit networks was analyzed, to see how early the training curves can
be differentiated. This gives an indication of how well the accuracy will be at the end of a full-
training cycle. The accuracy fitness evaluation epochs in section 6.1.6 were chosen accordingly.
The GA essentially evaluates the learning capacity of the individual, not its final fully-trained
accuracy. At the end of the search, when a solution is chosen, it is trained from scratch, without
loading any pre-trained weights. It is worth mentioning that fast accuracy predictors, such as the
ones proposed in [84, 170], could also be used for the purpose of fast accuracy-related fitness
evaluation in HW-FlowQ’s GA.
93
6 Fully-Automated Co-Design
The mutation, crossover and selection operations are pivotal to the GA’s efficacy. Single-point
crossover is applied, which intuitively has a high probability of capturing attractive bitwidth-to-
layer encodings of two fit individuals and maintains inter-layer dependencies across segments of
the CNN, as shown in figure 6.4. With mutation probability pmut a single allele at a randomly
selected genetic locus is replaced by another from the set of possible alleles, Q. All individuals
conform with the CNN and the quantization levels supported on the hardware. Tournament
selection is used for SOGA, where tournaments take place to decide all the survivors. NSGA-II
selection is based on the crowded-comparison-operator [171].
This work focuses on modeling spatial architectures similar to [172, 4, 100, 3], with an on-chip
buffer and a compute core with an array of PEs, as depicted in figure 6.1. The energy cost
of data accesses depends on the technology and the size of the memory. HW-FlowQ supports
independent read-write costs for off-chip communication, memory blocks, as well as the register
files (RFs) of the PEs on the compute blocks. The cost models are inspired by the approach
proposed in [3, 173, 4, 100]. A normalized energy cost is set for each operation that can take
place on the architecture. The HW-model attempts to map the computations of a particular CNN
workload efficiently onto the HW-model. For each schedule, the HW-model is able to extract the
number of actions (reads, writes, MACs) required at each level of the accelerator as explained
in section 3.3.2. The number of actions is multiplied by the cost of each action on each type of
memory/compute unit. The exact normalized energy costs chosen in this work align with the
Timeloop [4] framework and the Eyeriss model in [3], as shown in table 6.1. HW-FlowQ also
supports manually setting costs for each action, based on different fabrication technologies.
Scheduler and Mapper. Modern compute architectures allocate a considerable amount of
their power budget for memory accesses and data movement [173]. Moreover, redundant data
movement can have a significant impact on latency. This has made efficient scheduling of CNNs
on spatial hardware an active field of research [3, 109, 108, 172, 100, 4].
The three main techniques commonly used to optimize a nested loop’s execution on hardware
were introduced in section 2.3.2, namely loop tiling, reordering and unrolling. HW-FlowQ’s
scheduler and mapper components handle loop optimization techniques largely in a similar
manner to the popular frameworks Interstellar [100] and Timeloop [4]. Here, the additional
considerations to capture the effect of mixed-precision quantization are discussed.
Quantization shrinks the bitwidth of datatypes allowing larger computation tiles to fit in a given
lower-level memory. This increases the number of possible loop tiling and reordering schedules.
Loop optimization through unrolling is handled by HW-FlowQ’s mapper component and is
dependent on both the dataflow supported by the accelerator and the mixed-precision computation
technique. It is important to note that when unrolling fractionalized (quantized) computations on
a vectorized or multi-lane bit-serial PE-array, a single PE may handle more spatially distributed
computations, as long as its register files fit the operands/partial sums needed/generated by said
computations. This can be exploited by HW-FlowQ’s mapper to find more efficient schedules
which fit on a smaller physical computation array and require less PE-to-PE data movement.
94
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
Depending on the defined HW-model, the scalar or SIMD vector-engines in the PE can be word-
aligned, making some quantization degrees less attractive than others. An example of sub-optimal
SIMD-register usage is marked with a red-cross in figure 6.1 (middle-right). HW-FlowQ can also
model bit-serial compute units such as the ones in [66], in which case, a relative improvement for
any quantization level for weights and/or activations can be achieved on the compute block. The
word alignment on the compute block can be set differently to that of the outer memory blocks
and the off-chip memory interconnect.
In the convolution operation, PSUMs can grow after each accumulation to a maximum of
2bW/A +Ci , where bW/A is the bitwidth of the operands. The HW-model considers instances of the
largest possible PSUM, according to the maximum bitwidth bmax supported by the accelerator.
The increase in vector throughput due to quantization of Wl and Al−1 is constrained to the
maximum amount of PSUM RF memory available on the PE. After complete accumulation, a
speed-up can be achieved in writing back the output pixels at the quantization level of the input
of the following layer blA of Al .
Vectorized and Bit-Serial Computation. To estimate the benefits in the computation of
low-bitwidth and mixed-precision CNNs, vectorized and bit-serial compute units are modeled [8,
174, 66]. The choice of the computation unit has a direct influence on the schedule, as it affects
how many computation cycles are required for a particular operation and how many unique
computations can be assigned to the same hardware at different bitwidths.
For vectorized accelerators, an aligned SIMD-MAC unit is modeled, which has a maximum
bitwidth of bmax for both weights and activations. A speed-up through data-level parallelism at
the PE-level can happen at Vspeedup integer steps, as shown in equation 6.2.
bmax
Vspeedup = (6.2)
max(bW , bA )
Vspeedup is the vectorization degree aligned with the wider operand between bW and bA . This
not only allows for more parallel computations in the same cycle, but also reduces the memory
access cost at the register file level, which would now access Vspeedup data that fit into the SIMD-
register with bitwidth bmax in a single read operation. The limitation of vectorized computation
units is that they can only perform complete operations, and therefore cannot always fully
exploit any arbitrary bitwidth. For example, a 16-bit vector unit can perform 2 complete 8-bit
computations, however, if the operands were 6-bits each, a non-integer speed-up of ∼2.67 would
not be possible. Another limitation is that variable bitwidths of bW 6= bA cannot be exploited for
higher parallelism even if both are aligned, due to the max(bA , bW ) term in equation 6.2. The
wider of the two operands dictates the number of parallel computations which fit in the vector
engine.
Bit-serial computation units can fully exploit any level of quantization for both operands. Their
performance is enhanced with respect to a bmax computation according to
b2max
BSspeedup = . (6.3)
bW × bA
It is important to note that BSspeedup cannot be directly compared to Vspeedup due to the inherent
differences between how both architectures perform a single bmax computation. Bit-serial units
require bW × bA cycles to complete a single computation, whereas bit-parallel (irrespective of the
95
6 Fully-Automated Co-Design
96
HW-FQ 16-bit Timeloop 16-bit HW-FQ 8-bit HW-FQ 16-bit Timeloop 16-bit HW-FQ 8-bit
HW-FQ 6-bit HW-FQ 4-bit) HW-FQ 2-bit HW-FQ 6-bit HW-FQ 4-bit) HW-FQ 2-bit
109
·1010
2.4
2.2
DRAM SRAM Array RF MAC
2 Timeloop Latency 16-bit Latency
1.8 8-bit Latency 4-bit Latency 108
Eyeriss [3]
Vect. (8-bit)
Vect. (4-bit)
Vect. (2-bit)
1.6
Timeloop [4]
Vect. (16-bit)
2-bit Latency
1.4
1.2
1 107
0.8
Latency [Cycles]
Latency [Cycles]
0.6
0.2
0
0
0
08
11
14
17
19
21
29
31
34
52
54
56
59
71
93
95
97
99
08
11
14
17
19
21
29
31
34
52
54
56
59
71
93
95
97
99
AlexNet Workload
09 —
13 —
15 —
18 —
20 —
28 —
30 —
32 —
48 —
53 —
55 —
58 —
70 —
76 —
94 —
96 —
98 —
09 —
13 —
15 —
18 —
20 —
28 —
30 —
32 —
48 —
53 —
55 —
58 —
70 —
76 —
94 —
96 —
98 —
100 —
100 —
Bit serial (16,16) Bit serial (8,8) Bit serial (4,4) Bit serial (16,16) Bit serial (8,8) Bit serial (4,4)
Bit serial (4,2) Bit serial (2,4) Bit serial (2,2) Bit serial (4,2) Bit serial (2,4) Bit serial (2,2)
10
2.8 ·10 1011
2.6
100
100
DRAM SRAM Array RF MAC
2.4
2.2 16-bit Latency 8-bit Latency 1010
2 4-bit Latency 2-bit Latency
1.8 109
Bit-Serial (8-bit)
Bit-Serial (4-bit)
Bit-Serial (2-bit)
Bit-Serial (16-bit)
1.6
1.4
1.2 108
10−1
1
0.8 Latency [Cycles]
Latency [Cycles]
0.6 107
0.4
0.2 106
0
CONV1 CONV2 CONV3 CONV4 CONV5
10−1
10−2
08
11
14
17
19
21
29
31
34
52
54
56
59
71
93
95
97
99
08
11
14
17
19
21
29
31
34
52
54
56
59
71
93
95
97
99
AlexNet Workload
09 —
13 —
15 —
18 —
20 —
28 —
30 —
32 —
48 —
53 —
55 —
58 —
70 —
76 —
94 —
96 —
98 —
09 —
13 —
15 —
18 —
20 —
28 —
30 —
32 —
48 —
53 —
55 —
58 —
70 —
76 —
94 —
96 —
98 —
100 —
100 —
Figure 6.5: 16-bit AlexNet validation of HW-FlowQ with Eyeriss [3] and Timeloop [4] (top-left), as well as 8, 4 and 2-bit vectorized execution.
Validation with Timeloop on DeepBench workloads [5] (top-right). Bit-serial execution of AlexNet (bottom-left) and DeepBench
(bottom-right).
97
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
6 Fully-Automated Co-Design
interconnect (Mid) or the on-chip memory size (Mid) in between. The criteria are separated to
divide the complexity of CNN structure search (Coarse), interconnect/memory hierarchy search
(Mid), and compute architecture search (Fine).
2048
Througput (Ops/Cycle)
1024
DRAM Access (MB)
32
512
256
128
16
64
8 16 32 64 128 256 512 8 16 32 64 128 256 512
On-chip memory (KB) On-chip memory (KB)
Figure 6.6: Analysis of DRAM access and throughput on on-chip buffer size at different levels of hardware
abstraction and quantization.
In figure 6.6, the effect of changing the numerical precision of ResNet18 for ImageNet on
off-chip (DRAM) accesses and computational throughput is investigated. These metrics can
be evaluated at all three abstraction levels, which makes them useful in highlighting the flow
between the levels. At the Coarse-level, the DRAM accesses are estimated as the CNN’s total
necessary reads and writes for all datatypes (inputs, weights and outputs). Since the Coarse-level
abstraction is agnostic to the on-chip memory details, all its corresponding dashed curves are
constant. However, among the Coarse-level curves, the difference in read/write volumes at each
quantization level (left plot), as well as the speed-up possible through vectorization (right plot) is
still captured.
Moving on to the Mid-level, the model can capture more details of the hardware. In this
case, the limitations of an under-dimensioned on-chip memory or an insufficient off-chip to
on-chip communication bandwidth can be detected. The Mid-level estimates are sensitive to
on-chip memory and communication, but semi-agnostic to the compute architecture. For this
example, the bandwidth of off-chip to on-chip communication is set to 8 bytes/cycle. On the
DRAM accesses plot, the solid lines approach the dashed (ideal) curves, as the on-chip memory
grows. More importantly, at lower numerical precisions, the Mid-level estimates meet the
corresponding Coarse-level estimates at smaller on-chip memory sizes. Moreover, the Mid-
level’s limited information on the computation architecture can still be used to detect bottlenecks
in communication and/or on-chip memory size. The Mid-level abstracts the details of the
compute architecture through the assumption that all PEs are fully utilized and can always
perform computations, if sufficient data is available. In the throughput plot, communication
bottlenecks for small on-chip memory sizes can be observed, which are not able to provide
the ideal computation architecture with enough data to fully utilize it. These communication
bottlenecks are more evident for CNN models consisting of multiple fully-connected layers
98
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
(AlexNet and VGG-16). This behavior does not change with numerical precision, since smaller
bitwidths also increase the ideal computation throughput of vectorized PEs (i.e. Coarse-level
estimates get better).
At the Fine abstraction level, the model considers the CNN, the memory hierarchy, and the
compute architecture details (register files, dataflow, mapping, etc.). For this example, the
validated Eyeriss-256-DB from table 6.1 is used, with varying SRAM sizes. The Fine-level dotted
curves approach the Mid and Coarse curves at a slower rate, as the on-chip memory increases.
This is due to the other limitations of the computation architecture, which the Fine-level takes into
account (e.g. sub-optimal unrolling, limited register file sizes, etc.). The Fine-level provides much
more information (as shown in figure 6.2), but for the purpose of highlighting the cross-abstraction
level interactions, these are not discussed in this section.
The different bitwidth ResNet18s have different Coarse lines (dashed), which limit the the-
oretical optimum DRAM accesses and throughput. The Mid and Fine lines (solid and dotted),
which capture more hardware details, never surpass their respective dashed lines which are
defined at the Coarse stage. The search at the Coarse level provides these theoretical optimal
performance levels for a range of mixed-precision CNN quantization strategies, while subsequent
levels try to reach that optimum, by parameterizing the hardware. For example, if the target was
to achieve the theoretical best performance with respect to 4-bit Coarse, either the hardware can
be over-dimensioned, (dotted-diamond line at 512KB of on-chip memory) or the CNN can be
quantized down to 2-bit with an on-chip memory of 32KB (Mid and Fine triangle lines of 2-bit
touch/surpass the 4-bit Coarse line). Both options allow us to reach the theoretical optimum
set by Coarse for 4-bit, but each option would have a different effect on accuracy, where the
over-dimensioned hardware would achieve higher accuracy due to larger bitwidths, while the
2-bit CNN would have lower accuracy but a smaller on-chip memory design.
From figure 6.6, a multi-abstraction flow can be deduced, which can help the designer eliminate
hardware and CNN candidates at early stages of the co-design, without having to spend costly
GPU hours on training or synthesis and HIL-based testing.
6.1.6 Evaluation
HW-FlowQ is evaluated based on CIFAR-10 [27] and ImageNet [26] datasets for the classification
task and Cityscapes [147] for the semantic segmentation task. The 50K train images of CIFAR-10
are used for training and accuracy fitness ψ evaluation, while the 10K test images are used for
final accuracy evaluation at the end of the search. The images have a resolution of 32×32 pixels.
ImageNet consists of ∼1.28M train and 50K validation images with a resolution of 256×256
pixels. The Cityscapes dataset consists of 2975 training images and 500 test images. The images
of size 2048×1024 show German street scenes along with their pixel-level semantic labels of 19
classes.
Section 6.1.6.1 and section 6.1.6.3 highlight the flexibility of the HW-modeling tool and
search approach. To isolate and identify the effects of changing the HW-model on the resultant
quantization strategy, all other variables of the experiment are fixed, including the CNN workload
(ResNet20). In section 6.1.6.2, the hyper-parameters of NSGA-II and its convergence are studied.
Here, the task is made more complex by enlarging the quantization search space, and employing a
deeper 56-layer CNN. In section 6.1.6.4, HW-FlowQ is applied to a different task domain, namely
99
6 Fully-Automated Co-Design
semantic segmentation. The DeepLabv3 [175] model is used to study the effects of layer-wise
quantization on the encoder, bottleneck layers (including the atrous spatial pyramid pooling
(ASPP) block), and the decoder layers of the segmentation network. Finally, in section 6.1.6.5,
HW-FlowQ is compared with state-of-the-art methods of uniform and variable quantization,
further testing on wide and high resolution CNNs (ResNet18 for ImageNet). If not otherwise
mentioned, all hyper-parameters specifying the task-related training were adopted from the
CNN’s base model and its corresponding quantization method. The first and last layers are
kept at 16-bits, following the heuristic of other quantization works [35, 34, 36]. The hardware
metrics are generated based on the hardware configurations described in table 6.1. The vectorized
Spatial-256 HW-model with row-stationary dataflow is used in section 6.1.6.5 with additional
support for 1-bit (XNOR-Net). As a Coarse-level metric, fractional operations (Frac. OPs) is
used as a measure of CNN computation compression, with respect to the hardware it is executed
on. For example, Frac. OPs of a vectorized accelerator are computed as the layer-wise sum of
operations over the speed-up due to the respective layer’s quantization:
L
X OP sl
F rac.OP s = . (6.4)
Vl
l=0 speedup
Table 6.1: Hardware configurations and normalized access energy costs used for experiments and valida-
tion.
Specs DRAM SRAM Array Registers
PE
HW-Model
Array Cost Size Cost Cost Size (filter, ifmap, psums) Cost
Spatial-168* 12×14 200 128KB 6 2 224, 12, 16 Words 1
Spatial-256* 16×16 200 256KB 13.84 2 224, 12, 16 Words 1
Spatial-1024* 32×32 200 3072KB 155.35 2 224, 12, 16 Words 1
Eyeriss-1024 32×32 200 3072KB 155.35 2 224, 37, 16 Words 1
Eyeriss-256 - DB 16×16 200 128KB 7.41 0 192, 12, 16 Words 0.99
*: Same dimensioning for bit-serial (BS) and other dataflows (RS, OS, WS)
For experiments on CIFAR-10, the population size |P| is set to 25 and 50 for exploration
and comparison with state-of-the-art experiments respectively. The number of generations is
fixed to 50 for all CIFAR-10 experiments. Probabilities for mutation and crossover are set to
0.4 and 1.0 respectively. For ImageNet experiments, |P| and the number of generations are
scaled down to 10, while Cityscapes experiments have |P|=25 and for 10 generations. The CNNs
trained on CIFAR-10 are fine-tuned for 2 epochs and their accuracy fitness is evaluated on 10K
random samples from the train-set during the search. For ImageNet, 0.4 epochs of fine-tuning
are performed before evaluating on the valid-set. For Cityscapes, 10 epochs are necessary to
evaluate the candidate population. As explained in section 6.1.3.2, the accuracy fitness (ψ) is the
GA’s measure of the learning capacity of an individual. To avoid artificially biasing the search
algorithm towards individuals that perform well on the test set, the accuracy fitness is restricted
to train or validation set. This way, the framework does not indirectly “see” the test set during the
100
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
search. After the search concludes, we fully train the chosen individual from scratch and report
its test set accuracy as “Accuracy Top-1” in the result tables.
Table 6.2: ResNet20 for CIFAR-10 quantized at different abstraction levels of the Spatial-256 hardware
with SOGA and NSGA-II.
Configuration Accuracy Accuracy ψ HW-ϕ N. Energy Latency
(< ϕ >;< level >;< G >) Top-1 [%] Fitness [%] Fitness [%]* [×107 ] [×103 cyc.]
Baseline (16 bit) 92.47 - - 32.84 191
Frac. OPs; Coarse; SOGA 89.28 88.44 79.64 - -
Frac. OPs; Coarse; NSGA-II 90.09 92.80 73.09 - -
DRAM acc.; Mid; SOGA 89.18 91.93 67.79 - -
DRAM acc.; Mid; NSGA-II 90.00 95.33 65.56 - -
N. Energy; Fine; SOGA 89.45 91.18 51.05 16.07 52
N. Energy; Fine; NSGA-II 90.09 94.75 48.12 17.04 61
Latency; Fine; SOGA 88.44 86.21 78.71 14.94 41
Latency; Fine; NSGA-II 89.99 94.78 68.77 17.05 59
*: Measured as (1-(Compressed Metric/Baseline Metric))*100
Once satisfied with the CNN’s compression potential, the search can be refined to take the
off-chip to on-chip memory movement into consideration. The number of processing passes
(rounds of communication between off-chip to on-chip) necessary to complete all the compu-
101
6 Fully-Automated Co-Design
tations of a CNN can be estimated. The layer tiling and loop ordering can be searched for
different quantization strategies. For this example, the Mid-level estimates are based on DRAM
accesses when the on-chip buffer is dimensioned to 256KB. The CNN’s DRAM accesses can be
reduced by around 65% with respect to the 16-bit baseline CNN, while maintaining the same
accuracy that was targeted at the Coarse-level. Based on the bandwidth of the off-chip to on-chip
communication infrastructure considered, this can confirm that our dimensioning of the on-chip
buffer is in a good range to achieve a significant reduction of DRAM accesses, without having to
over-quantize our CNN and lose the task-related accuracy goal. Finally, the Fine-level estimates
give us a better understanding of how our CNN can be scheduled on a particular HW-model.
For this example, we proceed with the Spatial-256 configuration presented in table 6.1, with
row-stationary dataflow. Normalized energy can be reduced by around 50%, while maintaining
the target Top-1 accuracy from the higher abstraction levels. When considering latency, we
observe the drawback of the SOGA approach, not being able to decently balance accuracy and
the hardware-reward. Although the emerging solution maximizes the latency reward signifi-
cantly (78.71%), it leads to a considerable accuracy degradation (88.44%). The reward function
(equation 6.1) was designed to balance both accuracy and hardware-rewards, however, due to
the high potential of improving latency through quantization, we find that the SOGA algorithm
was willing to sacrifice the train reward ψ (down to 86.21%) to get a much larger overall reward
through latency ϕL . This highlights the weakness of handcrafting reward functions to achieve
multi-criteria optimization. On the other hand, NSGA-II offers a range of solutions, from which
a well-balanced solution is shown in table 6.2, reducing latency by (68.77%) and maintaining
the task-related accuracy since the Coarse-level. The Pareto-optimal solutions for latency vs.
accuracy range from Top-1: 90.63% at 60.00% ϕL reduction, to Top-1: 89.37% at 72.74% ϕL
reduction. The set of all Pareto-front solutions is not shown in the table.
102
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
98 98 98
96 96 96
94 94 94
92 92 92
90 90 90
88 88 88
2.0 2.2 2.4 2.6 9 2.0 2.2 2.4 2.6 9 2.0 2.2 2.4 2.6 9
·10 ·10 ·10
2.0 2.2 2.4 2.6 9 2.0 2.2 2.4 2.6 9 2.0 2.2 2.4 2.6 9
·10 ·10 ·10
98 98 98
96 96 96
94 94 94
92 92 92
90 90 90
88 88 88
0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2
·106 ·106 ·106
Figure 6.7: 2-D projections of three 3-D Pareto-fronts for ResNet56 quantization: left to right (|P|,
generations) = (25, 25), (25, 50), (50, 50). Grey to black shades represent Pareto-fronts of
older to newer generations, red points belong to the final Pareto-front. It is recommended to
view this figure in color.
are most attractive, are those which offer a trade-off among the optimization criteria. For (|P|,
generations) = (25, 50) and (50, 50), the points which contribute the most to the total Pareto-front
hypervolume (lie at the apex of the convex Pareto-front) are comparable in hardware metrics
and accuracy fitness. The (|P|, generations) = (25, 25) configuration has solutions of equivalent
accuracy fitness, however, their hardware metrics are worse. With these insights, the number of
generations is fixed to 50 for all CIFAR-10 experiments to get better convergence. |P| is set to 25
for exploration experiments and 50 for comparison with state-of-the-art experiments.
The hypervolume occupied by the 3-D frontier can be measured at each generation to derive the
search convergence. As a decision-making technique, a reference point is extrapolated from the
polar solutions (worst in each dimension) of the final Pareto-front, and the farthest solution from
it in the frontier is found, based on Euclidean distance. In this work, this solution is referred to as
the hypervolume-leader (HV-leader), which offers a balanced trade-off among the Pareto-points
of the frontier.
103
6 Fully-Automated Co-Design
104
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
·105
3
90 90
2
. .
1
85 85
.
0.6 1.2 0.8 1.0 1.2 0.6 0.8 1.0 2 4 6
Normalized Energy/4 Frames ϕE [-]·109 Normalized Energy/4 Frames ϕE [-]·109 Latency/4 Frames ϕL [cycles] ·105
90 2 90
. . .
1
85 85
0.6 0.8
1.4 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1 2 3 4
Normalized Energy/4 Frames ϕE [-]·109 Normalized Energy/4 Frames ϕE [-]·109 Latency/4 Frames ϕL [cycles] ·106
95 1.0 95
0.8
90
0.6 90
85 0.4
. . .
0.2 85
80 0.0
1.5 2.0 2.5 3.0
4.0 3.5 1.5 2.0 2.5 3.0
4.0 3.5 0.5 1.0
Normalized Energy ϕE [-] ·108 Normalized Energy ϕE [-] ·108 Latency ϕL [cycles] ·105
Figure 6.8: 2-D projections of 3-D Pareto-fronts of 3 exploration experiments on ResNet20 for CIFAR-10
for hardware dimensioning, bit-serial processing and dataflow variants. It is recommended
to view this figure in color.
breaking the almost linear relationship between optimal energy and latency mapping observed
for vectorized accelerators (figure 6.8-a). This can be attributed to both the change in compute
architecture and the variations possible for both bA and bW . In table 6.4, similar latency and
energy trends can be observed for bit-serial computation, as with vectorized computation for
batch-size 1. A Pareto-optimal solution is chosen for each hardware and trained to achieve an
accuracy above 90%. The smallest BS-168 is the slowest, yet the most energy efficient, while
BS-1024 significantly reduces the latency at the cost of more energy for data movement. The
improved effect of batch processing is more prominent for bit-serial accelerators. The 256-PE
bit-serial accelerator with batch processing offers a significant improvement in terms of energy,
bringing the 256-PE configuration to better energy efficiency than the smaller 168-PE counterpart
executing batch-size 1 inputs. Additionally, latency also gets a decent improvement of 13.5%.
This shows the advantages of relaxing the bA = bW constraint on the hardware and the GA
search.
105
6 Fully-Automated Co-Design
To further analyze this aspect, the layer-wise quantization strategy chosen by the GA for batch
sizes 1 and 4 on bit-serial accelerators are presented in figure 6.9. Layers with large activation
volumes can have lower bitwidth activations (low bA ), while the weights can be kept at a slightly
higher bitwidth (higher bW ). The opposite can be done for layers with large filter volumes. This
extends the improvements to be gained on mixed-precision accelerators and larger batch sizes
(i.e. larger activation volumes). In figure 6.9, the quantization strategy chosen for batch size of 4
reflects the GA’s attempt to compress the large activations more aggressively than for batch size
of 1, particularly for the first half of the CNN. To compensate for the potential accuracy loss, the
GA maintains larger weight bitwidths bW for batch = 4. The resulting quantized CNNs of both
batch 1 and batch 4 have an equivalent accuracy (∼90.3%), but with a noticeable improvement in
hardware metrics for batch size of 4, due to the GA taking the capabilities of the hardware into
account.
Quantization on Different Dataflows. To demonstrate the effect of quantization on dataflows,
a WS dataflow and an OS dataflow are presented. WS unrolls computations in dimensions Co and
Ci over the processing element array, while OS unrolls Xo and Yo , and replicates the unrolling
over Co . Both WS and OS support channel interleaving in order to maximize their register
utilization, similar to the RS dataflow.
106
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
15 batch = 1 bW bA 15 batch = 4
Bitwidth
10 10
5 5
0 0
CONV 1
CONV 2
CO V 1
CO V 2
CONV 1
CONV 2
CONV 1
CONV 2
CO V 1
CO V 2
CONV 1
CONV 2
CONV 1
CO V 2
CO V 1
CONV 2
N 431
2
CONV 1
CO V 2
CO V 1
CONV 2
CONV 1
CONV 2
CO V 1
CO V 2
CO V 1
CONV 2
CONV 1
CO V 2
CO V 1
CO V 2
CONV 1
CONV 2
N 431
2
21
21
22
22
23
23
31
31
32
32
33
33
41
41
42
42
43
21
21
22
22
23
23
31
31
32
32
33
33
41
41
42
42
43
CONV
CONV
V
N
N
N
N
N
N
N
N
N
N
N
N
N
N
CO
CO
Figure 6.9: Layer-wise bitwidth strategy for BS-256 hardware. Batch size 1 (left) and 4 (right). NSGA-
II compensates for larger activations (batch=4) by lowering bA and maintains accuracy by
increasing bW , when compared to batch=1 inference.
The baselines in table 6.5 show RS is the most energy-efficient, while OS offers the best
latency. WS is placed in the middle in terms of latency but has worse energy efficiency when
compared to the other considered dataflows. The Pareto-fronts of quantization strategies in
figure 6.8-c demonstrate the effect of dataflows on three accelerators, which are otherwise
identical in dimensioning. WS proves to be highly sensitive to quantization, having many unique
non-dominated combinations of ϕE , ϕL and ψ. Generally, WS is the least efficient in terms of
latency and energy, for a particular train accuracy ψ. OS dataflow enjoys its lead in latency, due
to a higher potential of unrolling as a result of quantization over vectorized PEs (each vectorized
PE acts as Vspeedup virtual PEs). Consequently, its energy rivals that of RS. The higher parallelism
degree on a single SIMD-vector engine reduces the total cost of MAC operations. Furthermore,
since the loop unrolling is taking place across the array as well as within the vectorized PEs,
fewer PE-to-PE hops are required to achieve the unrolling of the mapper, resulting in less array
data movement energy.
The semantic segmentation task is critical to applications in robotics and autonomous driving.
High-quality segmentation can be more computationally complex by several orders of magnitude,
when compared to classification tasks (e.g. table 6.6). This is related to both, the typically larger
107
6 Fully-Automated Co-Design
15
Bitwidth
10
5
0
N 1
N 2
N 1
N 2
N S
N 1
N 2
N 1
N 2
N S
N 1
N 2
N 1
N 2
N S
N 1
N 2
N 1
A 522
A 1
A 2
A 3
P4
D 5
D 1
CO 2
CO V2
3
CO V21
CO V 2 1
CO V22
CO 22
CO V31
CO V31
CO V32
CO 3 2
CO V41
CO V 4 1
CO V42
CO 4 2
CO V51
CO V 5 1
CO V52
P
P
P
P
EC
EC
V
CO V3
CO V4
CO V5
SP
SP
SP
SP
SP
N
N
V
A
N
CO
Figure 6.10: Layer-wise bitwidths (bW =bA ) of a DeepLabv3 Pareto-choice strategy with 67.3% mIoU on
Cityscapes. Short and parallel layers have bA equal to their respective bottom layer.
input image resolution and the additional layers needed for semantic segmentation (bottleneck,
ASPP block and decoder layers).
For the DeepLabv3 network executing on Eyeriss-1024 (details in table 6.1), HW-FlowQ must
adapt to the task’s training challenges, particularly on low-bitwidth (≤4-bit) configurations for
PACT quantization, which often lead to exploding gradients. Despite this difficultly of PACT,
HW-FlowQ produced the Pareto-choice candidate shown in figure 6.10, which achieved 67.3%
mean intersection over union (mIoU) with a 21.6% reduction in fractional operations over uniform
8-bit PACT quantization of DeepLabv3 (shown in table 6.6). In figure 6.11, qualitative semantic
segmentation results are shown for uniform 8-bit PACT and the HW-FlowQ Pareto-choice, for
three example scenes in the Cityscapes dataset. These results show an impressive potential of
mixed-precision low-bitwidth quantization on complex semantic segmentation tasks, potentially
with more advanced quantization techniques under HW-FlowQ in future work.
108
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
Figure 6.11: Qualitative results of DeepLabv3 quantization on Cityscapes scenarios. Black regions have
no ground-truth labels. Pareto-choice has 21.6% Frac. OPs compression compared to
uniform 8-bit PACT. It is recommended to view this figure in color.
109
6 Fully-Automated Co-Design
Table 6.6: Comparison of HW-FlowQ with state-of-the-art quantization methods on Eyeriss-256 Vector-
ized.
Model/ Accuracy F.Ops N. Energy Latency
Method
Dataset Top-1 [%] [×106 ] [×107 ] [×103 cycles]
Baseline (16-bit) 92.47 41 33 191
DoReFa-Net (4-bit) [35] 89.75 10 16 51
CIFAR-10
ResNet20
*: Executed on Eyeriss-1024
MOGA approach through NSGA-II inherently supports multi-criteria optimization. The designer
does not need to handcraft a reward function which fairly captures all the optimization targets
in one reward value. For the ImageNet experiment, a HV-leader with an accuracy of 67.02%
and hardware estimates comparable to PACT (4-bit) was achieved in only 10 generations and
|P |=10.
6.1.7 Discussion
HW-FlowQ optimizes CNNs by finding quantization strategies based on high-fidelity HW-model-
in-the-loop setups. Abstraction levels and design phases inspired by VLSI design flows help
in systematically narrowing down hyper-parameters for both the CNN and hardware design,
exposing HW-CNN co-design synergies. Exploring vectorized and bit-serial compute engines,
the performance trade-offs for different mixed-precision workloads can be exploited by the GA.
The effectiveness of NSGA-II was demonstrated, offering a Pareto-optimal set of quantization
strategies for different HW-models during the optimization process. The HW-models introduced
in this work provide the metaheuristic method, i.e. the genetic algorithm, with all the necessary
details to autonomously make decisions on CNN design and compression. Transitions between
110
6.1 HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-Design Quantization Methodology
the design abstraction levels can also take place in an automated manner, whenever target
application constrains are met. The effectiveness of models, abstraction levels, and metaheuristics
is demonstrated in this work, providing a comprehensive example of fully-automated HW-CNN
co-design.
111
6 Fully-Automated Co-Design
6.2.1 Introduction
An example of a HW-DNN co-design scenario is the numerical quantization of CNN and the
hardware design of a bit-serial accelerator. As mentioned in previous chapters, CNNs benefit from
layer-wise and datatype variable numerical precision [7]. To extract the mentioned benefits of
variable numerical precision, a hardware accelerator can employ bit-serial computation units [6].
Such an accelerator can have an array of spatially distributed computation units and a distributed
on-chip buffer to efficiently provide the computation array with data.
For an automated co-design framework to effectively arrive at a solution that meets an appli-
cation’s constraints, a large and complex solution space must be explored. This motivates the
development of a lightweight, easily reconfigurable HW-models, which can be used to evaluate
design choices in this complex space [101, 4]. Harnessing the speed and flexibility of such
HW-models can enable the parallel co-design of HW and CNN, without prohibitive synthesis or
cycle-accurate simulation bottlenecks.
AnaCoNGA embeds HAS into QSS, in a nested GA formulation. The main contributions of
this work can be summarized as follows:
112
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
• We insert the HAS loop into the QSS loop. For each potential QSS, the HAS loop efficiently
evaluates a 4-D HW design Pareto-front. After synthesis, the HW-CNN co-designed pair
achieve a 35% and 37% reduction in latency and DRAM accesses, while achieving 2.88
p.p. higher accuracy compared to a 2-bit ResNet20-CIFAR-10 executing on a standard
edge variant of the accelerator.
NHAS [88] aims to find an optimal quantized CNN architecture using an evolutionary algorithm.
An efficient hardware dimensioning for the compute array and on-chip memory is searched to
accelerate a pool of CNN workloads used as a benchmark. After the hardware is configured,
the CNN search space is explored. The hardware evaluation follows a look-up table approach
due to the smaller quantization search space considered. This sequential approach of co-design
can be enhanced by including the hardware design search within the CNN search loop. Other
works which target joint HW-CNN co-design are [167] and [176], both of which include the
hardware’s performance in the reward function of an RL-agent and iteratively tune both the CNN
and hardware architectures. Fine HW-level details, such as scheduling schemes and quantized
execution, are not explored in [167], as the optimization loop targets optimally partitioning the
CNN workload over a pool of FPGAs. In [176], layer-wise quantization is not supported.
AnaCoNGA is a nested co-design approach to perform hardware design parallel to the quan-
tization search, leading to a tight coupling between the HW and CNN, without iteratively or
sequentially switching between the two domains. The classification of the mentioned works is
shown in Tab. 6.7.
6.2.3 Methodology
In this section, the three main components of AnaCoNGA are presented, namely the analytical
accelerator model, the QSS algorithm, and the HAS algorithm, followed by integration of the
components into the framework.
113
6 Fully-Automated Co-Design
114
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
115
6 Fully-Automated Co-Design
LHS RHS
D𝑚 D𝑛
D𝑘
LHSbits RHSbits
Single PE
Output Matrix
Figure 6.12: High-level abstraction of a bit-serial accelerator [6]: The dimensions Dm , Dn , Dk determine
the tiling degree of matrices RHS and LHS.
With this analysis, workload execution metrics can be evaluated with respect to hardware pa-
rameters such as compute array dimensions Dm , Dn , Dk , as well as on-chip buffer sizes for LHS
and RHS without having to synthesize the HW each time. The analytical model introduced
in this section is used to perform fast exploration and design of the hardware as an example.
It is important to note that AnaCoNGA is not limited to this hardware analytical model; the
optimization loops introduced in the next sections can potentially be used to harness the speed and
flexibility of more advanced fast analytical hardware models in literature, such as CoSA [101],
GAMMA [102] or ZigZag [177].
The quantization strategy search (QSS) is essentially the search space introduced in HW-FlowQ
(section 6.1.3.2). To recap, for an L-layer CNN, there are Q2L solutions, where Q is the set of
possible quantization levels for weights and activations. It is important to note that QSS can be
applied to any quantization technique (DoReFa [35], PACT [36], or others), as it only tries to
find the best bitwidths for each layer and datatype. A MOGA is used to tackle the multi-criteria
optimization problem of maximizing accuracy and minimizing hardware execution complexity.
No hardware design takes place in this search.
An initial population P0 is randomly generated, with each genome encoding a quantization
strategy, i.e. a quantization tuple (Wlbits , Al−1
bits ) for each layer of the CNN (explicit, bijective
encoding). The genomes of P are briefly fine-tuned and evaluated based on their task-accuracy
on a validation set. When using standalone QSS, the GA must additionally consider the fitness of
the quantized CNNs on hardware estimates (DRAM accesses and computation cycles). Based on
the three fitness metrics, the Pareto optimality of each individual is identified with respect to the
population P. The population goes through phases of selection, crossover and mutation for the
subsequent generations, producing Pareto-optimal CNN quantization strategies.
116
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
117
6 Fully-Automated Co-Design
6.75
330
3
9
DRAM Accesses (MBytes)
220
4.5
2
2.25
110
1
3
0
0
HW1 HW2 HW3 HW1 HW2 HW3
Figure 6.13: Validation of the HW-model vs. real HW measurements for compute cycles and DRAM
accesses on three BISMO configurations (HW1-3). Small and large workloads are verified
from ResNet20-CIFAR-10 (left) and ResNet18-ImageNet (right).
this challenge sequentially or iteratively. To perform true co-design, both hardware and CNN
need to be jointly and concurrently considered.
One approach is to combine HAS genomes with QSS genomes into one GA. However, this
would result in a prohibitively complex, large search space (4.77 × 1037 for ResNet20 on
the considered bit-serial accelerator), with many direct and indirect relationships between the
hardware and quantization parameters. The complex, joint search space would also necessitate
larger populations and generations for the GA, leading to excessive GPU hours. Another approach
could be to iterate between the search spaces [88, 167, 176]. An iterative approach brings us to
the same dilemma, since the hardware was initially biased for a different quantization strategy,
and a newly found HW-CNN combination is sub-optimal with respect to another combination,
which had a different quantization strategy prior.
To tackle this challenge, the two genetic algorithms are nested in AnaCoNGA, as shown in
figure 6.14. On the one hand, the HAS GA requires roughly ∼1.5 minutes to execute for 200
generations and 200 hardware genomes and can be parallelized. This is due to the fast analytical
HW-model in section 6.2.3.1, and the LUT/BRAM utilization models proposed in [65]. On
the other hand, the QSS genetic algorithm requires some epochs of fine-tuning to evaluate the
accuracy of a potential quantization genome. This can be a costly fitness evaluation process
for larger networks and datasets. When nesting the HAS GA into the QSS GA, we can exploit
the speed of the HAS loop to evaluate the hardware design Pareto-front for each considered
quantization genome (parallel HAS blocks in figure 6.14). In each HAS experiment, a 4-D
Pareto-front of hardware designs is generated for the respective quantization genome. The 4-D
hardware Pareto-front is checked for solutions that meet our target hardware constraints. If no
solution in the HAS Pareto-front satisfies our hardware requirements, then the QSS receives a
signal to remove the genome’s fine-tuning step and assign it a null accuracy, without wasting any
GPU training time (feedback line from HAS to QSS in figure 6.14). With this approach, the QSS
is relieved from optimizing hardware metrics and can now be reformulated into a single-objective
genetic algorithm (SOGA), which is solely focused on improving the accuracy of the quantized
118
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
Hardware Design
… …
Generation + 1
Generation + 1
Separate HAS for each HAS HAS HAS Parallelized Fast (Analytical) HAS
CNN quant. strategy ~1.5 mins
NSGA-II Selection
SOGA Selection
Crossover
Mutation
Crossover
Mutation
Figure 6.14: AnaCoNGA: Each individual from QSS executes its own HAS MOGA. Any QSS individual
can prove itself efficient on its own hardware design to get a chance for its accuracy to be
evaluated. QSS is relieved from optimizing hardware and is transformed to a SOGA (i.e.
accuracy focused).
CNNs. QSS essentially allows each quantization genome to evaluate its own hardware design
space before accepting them into the population. Therefore, two radically different QSS genomes
could meet the target hardware constraints (DRAM, computation cycles, BRAM and LUTs) by
finding themselves specialized hardware designs in their respective HAS explorations. This way,
the hardware design remains flexible (undefined) on the scale of the overall experiment, but is
guaranteed to exist for any genome which is eventually chosen by the QSS at the end of the search.
AnaCoNGA’s design loops enable the use of analytical HW-models such as [101, 102, 177],
harnessing their speed and flexibility to achieve parallel, fully-automated, HW-CNN co-design.
119
6 Fully-Automated Co-Design
6.2.4 Evaluation
AnaCoNGA is evaluated on CIFAR-10, CIFAR-100, and ImageNet datasets. The 50K train and
10K test images of CIFAR-10 and CIFAR-100 are used to train and evaluate the quantization
strategies. ImageNet consists of ∼1.28M train and 50K validation images. After an ablation
study, we set the population size and number of generations to 50 for QSS GAs on ResNet20.
Probabilities for mutation and crossover are set to 0.5 and 1.0, respectively. For ResNet56, we
reduce the running population size |P| to 25. For ResNet18-ImageNet experiments, |P| is set to 25
and the number of generations is reduced to 25. The CNNs trained on CIFAR-10 are fine-tuned for
3 epochs and evaluated on 10K random samples during the search. For ImageNet, we fine-tune for
1.5 epochs before evaluating on the valid-set. The quantization method for ResNet20 experiments
is DoReFa [35], while deeper (ResNet56) and higher resolution (ResNet18) experiments use the
PACT method [36]. Results denoted with (2, 4-bit) indicate 2-bit weights and 4-bit activations.
For comparison, binarized variants are trained using the XNOR-Net method [34].
The Xilinx Z7020 SoC on the PYNQ-Z1 board is used as the target platform for all hardware
experiments in table 6.10 and figure 6.16, with all designs synthesized at a 200MHz target
clock frequency. For HAS experiments, both the population size and generations are set to 200,
since no significant improvement was observed for larger experiments. Mutation and crossover
probabilities are set to 0.4 and 1, respectively. In table 6.10, AnaCoNGA’s nested HAS GA uses
the respective (2, 4-bit) configuration’s hardware performance as its hardware constraint/target.
Table 6.9 shows the valid alleles which can be used in genomes. The 6≡ symbol indicates the
parameters that can take different values within a genome. Valid alleles are chosen within ranges
120
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
of dimensions synthesizable on the Z7020. Larger designs (larger search space) are feasible on
larger FPGAs [65].
121
6 Fully-Automated Co-Design
strategies. Another observation is that the shift is not only due to the more efficient execution
metrics of 4-bit vs. 8-bit, but also due to new legal scheduling options on differently dimensioned
accelerators. This can be seen in the non-overlapping circle and cross markers of the BRAM vs.
LUTs 2-D projection plot, indicating different hardware dimensions being optimal for the 4-bit
and 8-bit CNNs, while respecting the resource limitations of the Z7020 FPGA. The 2-D projection
of computation cycles against LUTs for ResNet18 shows the effect of legality checks in the
analytical model on the shape of the Pareto-front. The cross marks (8-bit) do not extend beyond
28K LUTs of logic, indicating that larger computation arrays cannot allocate the associated
BRAM requirements to fit a single minimum-sized tile of computation for one or more of the
ResNet18 layers (i.e., design not synthesizable, or computation not possible). Therefore, the size
of synthesizable computation arrays with sufficient BRAM to feed the array with minimum tile
sizes is restricted with the 8-bit CNN. On the other hand, the circle markers (4-bit) of the same
plot extend to larger LUT utilization, indicating the existence of large synthesizable compute
arrays, with sufficient BRAM to load smaller tiles of the smaller 4-bit ResNet18.
Further HAS results are presented in table 6.10, applied to the strategies found in the QSS
experiments of the previous section (labeled QSS+HAS). From the resulting Pareto-fronts,
candidates with the lowest execution and DRAM access cycles are chosen for synthesis, without
exceeding the resource utilization of the HW3 BISMO choice from table 6.8. The GA finds
non-trivial asymmetric hardware configurations (Dm 6= Dn ), which exploit the position of the
tensors Wl and Al−1 into either LHS or RHS matrices. The asymmetric hardware allows the
scheduler to swap the position of weights and activations in the middle of the CNN execution, to
maintain high computation efficiency and low DRAM accesses, by reusing the datatype placed in
the RHS matrix of the computation. This naturally reduces the LUT and BRAM requirements
of the design. Figure A.2 shows the layer-wise execution details of a 4-bit ResNet18-ImageNet
on an asymmetric Dm × Dn × Dk = 8×14×96 hardware configuration found through HAS. For
comparison, the same workload is executed on the symmetric HW3, which has higher theoretical
peak binary trillion operations per second (TOPS) (6.55 binary TOPS vs. 4.30 binary TOPS). The
HAS solution heavily reduces the amount of DRAM accesses for all the layers. This indicates
better tiling dimensions and compute efficiency ηOPs with respect to workloads, which naturally
brings down the computation cycles and reduces the chances of stalls. For the HAS asymmetric
solution, layers 1-5 and 12-17 are executed with weights on the LHS, while other portions of the
CNN are executed with weights on the RHS. The layers are not schedulable otherwise, indicating
that the HAS solution is tightly-coupled with the schedule and the legality checks of the analytical
model, allocating sufficient resources to guarantee at least a single efficient and legal scheduling
for each workload of the CNN exists, thereby reducing the FPGA’s resource utilization. In
table 6.10, the real hardware measurements of sequential co-design (QSS+HAS) show a clear
advantage to all standalone QSS CNNs, dramatically lowering their DRAM accesses and latency
below or equivalent to a 1-bit strategy executing on standard BISMO dimensions from [7], with
less LUT and BRAM required for the design.
122
·107
·104
1.5 ·105
1.5 ·105
2.5
4
2
1
1
1.5
LUTs
2
1
0.5
0.5
BRAM (Bytes)
BRAM (Bytes)
0 0.25 0.50 0.75 1 0 2 4 0 0.25 0.50 0.75 1 0 0.5 1.0 1.5 2.0
Compute Cycles 7 LUTs 4 Compute Cycles 7 DRAM Acc. (B)
·10 ·10 ·10 ·107
(a) HAS for ResNet20-CIFAR-10 4-bit (blue) and 8-bit (red) Uniform
·104
4 ·105
4 ·105
1.4 ·109
4-bit ResNet20
Larger Compute Array
4
8-bit ResNet20
for 4-bit Workloads
3
3
1
BRAM limitation
for 8-bit Workloads
2
2
LUTs
2
0.5
BRAM (Bytes)
BRAM (Bytes)
1
1
Figure 6.15: HAS: 2-D projections of a 4-D Pareto-front in a multi-objective search space. The GA optimizes for hardware resources (LUTs, BRAM)
and performance metrics (DRAM accesses, execution cycles) for ResNet20 (top) and ResNet18 (bottom). The proposed analytical model
allows for fast exploration and evaluation of solutions.
123
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
6 Fully-Automated Co-Design
edge BISMO variant used in [7]. An improvement is observed in task-related accuracy for
all AnaCoNGA solutions over sequential co-design (QSS+HAS). This can be attributed to the
accuracy-focused SOGA implemented in the QSS of AnaCoNGA, which leaves the HAS to be
handled by the nested MOGA (recall figure 6.14). Furthermore, the nested HAS allows more
diverse, high-accuracy quantization individuals to survive through QSS, as each QSS individual
can find their own hardware design to meet the application constraints.
Cycles [103 ]
Overlap
0.6 3.6
86
0.4 2.4
0.2 1.2
80 0 0
1-bit 2-bit 2-4-bit 4-bit AnaCoNGA
(a) ResNet20-CIFAR-10
74 3 18
2.5 15
Cycles [106 ]
2 12
62 1.5 9
1 6
0.5 3
50 0 0
1-bit 2-bit 2-4-bit 4-bit AnaCoNGA
(b) ResNet56-CIFAR-100
68 30 180
25 150
DRAM Acc. [MB]
Accuracy (%)
Cycles [106 ]
20 120
59 15 90
10 60
5 30
50 0 0
1-bit 2-bit 2-4-bit 4-bit AnaCoNGA
(c) ResNet18-ImageNet
Figure 6.16: Breakdown of execution on synthesized hardware. Higher DRAM accesses are correlated
with lower compute efficiency and stalls. AnaCoNGA reduces latency and DRAM accesses
while maintaining high accuracy.
The latency and DRAM accesses of AnaCoNGA and QSS+HAS variants are comparable to
or better than a single-bit network executing on the handcrafted accelerator. As mentioned in
section 6.2.4.3, the HAS MOGA finds asymmetric hardware designs and relies on the scheduler
124
6.2 AnaCoNGA: Analytical HW-CNN Co-Design using Nested Genetic Algorithms
Table 6.10: Quantization and hardware design experiments. Uniform and standalone QSS are executed on
a standard edge variant (HW3) used in [7]. Latency and DRAM are measured on hardware.
Acc LUT BRAM Latency DRAM HW Config. Peak Bin.
Model Work
[%] Util Blocks K. Cycles Acc. MB Dm × Dn × Dk , LHS, RHS Buf TOPS
XNOR (1-bit) [34] 83.98 501 2.26
DoReFa (2-bit) [35] 87.16 659 3.29
CIFAR-10
ResNet20
DoReFa (2,4-bit) [35] 88.98 32639 135 817 4.43 8×8×256, 256KB, 256KB 6.55
DoReFa (4-bit) [35] 89.75 944 5.35
QSS (standalone) 89.44 798 4.17
QSS+HAS 89.44 29687 55 422 1.99 8×16×64, 32KB, 32KB
3.28
AnaCoNGA 90.04 29671 55 428 2.08 8×16×64, 16KB, 32KB
XNOR (1-bit) [34] 85.61 1212 5.73
PACT (2-bit) [36] 90.28 1710 8.93
CIFAR-10
ResNet56
PACT (2,4-bit) [36] 92.97 32639 135 2172 12.41 8×8×256, 256KB, 256KB 6.55
PACT (4-bit) [36] 93.27 2585 15.37
QSS (standalone) 91.89 2120 11.98
QSS+HAS 91.89 29643 79 1242 5.44 4×32×64, 16KB, 64KB
3.28
AnaCoNGA 92.31 29638 79 1315 5.83 4×32×64, 8KB, 64KB
XNOR (1-bit) [34] 57.70 1212 5.73
PACT (2-bit) [36] 64.66 1710 8.93
CIFAR-100
ResNet56
PACT (2,4-bit) [36] 70.91 32639 135 2172 12.41 8×8×256, 256KB, 256KB 6.55
PACT (4-bit) [36] 71.65 2585 15.37
QSS (standalone) 69.52 2054 11.60
QSS+HAS 69.52 29638 79 1240 5.45 4×32×64, 8KB, 32KB
3.28
AnaCoNGA 70.68 29643 79 1420 6.23 4×32×64, 16KB, 64KB
XNOR (1-bit) [34] 52.51 14090 64.93
ResNet18
ImageNet
to switch the order of weights and activations in LHS or RHS, depending on the layer being
executed. This leads to lower LUT and BRAM requirements for all the HAS-based designs, while
executing more efficiently than the oversized HW3. All AnaCoNGA-based hardware designs
are smaller (fewer peak binary TOPS) than HW3, but achieve better performance due to their
tightly-coupled dimensioning, which improves their compute efficiency. To better understand
AnaCoNGA’s hardware performance, the total execution time is split and the cycles are measured
with respect to compute, as well as the non-overlapping cycles spent on other parts of the pipeline
(stall cycles). This data is presented in figure 6.16. Although the HAS genetic algorithm is not
aware of pipeline stalls, it optimizes for minimal compute cycles and lower DRAM accesses,
where, particularly the latter, is correlated with lower pipeline stalls. Hardware designs with these
traits naturally bring down stall cycles, leading to higher compute and memory access overlap.
For figure 6.16-a, the AnaCoNGA solution indeed has higher compute cycles than 1-bit due to
its higher bitwidths, which results in a higher accuracy CNN. However, the DRAM accesses are
well-optimized, due to HAS designing an accelerator which achieves efficient compute tiles and
high compute efficiency ηOP s , resulting in fewer stall cycles, ultimately bringing the total latency
of the execution below 1-bit on the HW3 edge BISMO design, while maintaining task-related
accuracy higher than a uniform 4-bit solution. Similar trends can be observed in figure 6.16
125
6 Fully-Automated Co-Design
for ResNet56 and ResNet18 as well, achieving lower execution metrics than 2-bit CNNs and
maintaining high task-related accuracies.
AnaCoNGA also brings benefits in terms of reduced GPU hours. For ResNet20 and ResNet56
on CIFAR-10, QSS and AnaCoNGA were run on a single NVIDIA Titan RTX GPU. The
search took 14 hours for ResNet20 with AnaCoNGA, which is a 51% reduction with respect
to standalone QSS. For ResNet56, a 24% reduction in GPU hours was achieved, leading to 34
hours of search time. Overall, the nested HAS constraint analyzing the 4-D Pareto-fronts of all
QSS genomes, allowed the SOGA to skip the evaluation of accuracy for genomes which had no
promising hardware designs.
6.2.5 Discussion
AnaCoNGA is a HW-CNN co-design framework using two GAs, QSS and HAS, combined in
a novel nested scheme to eliminate handcrafted reward functions, iterative switching between
the two domains, and fine-tuning CNN genomes with sub-optimal HW-design spaces. The
speed and flexibility of analytical HW-models were harnessed to achieve true parallel co-design,
while reducing the overall search time when compared to iterative or sequential approaches.
This fully-automated co-design approach requires no intervention of human experts during the
optimization process and achieves non-trivial synergies which would be very difficult for an
expert to predict. Counterintuitively, by searching both the hardware and neural network design
spaces, the optimization was faster than only searching one design space. This lowers the effort
on both the ML and HW engineer and achieves better results in both domains. The accuracy
of ResNet20-CIFAR-10 was improved by 2.88 p.p. compared to a uniform 2-bit CNN, and
achieved a 35% and 37% improvement in latency and DRAM accesses, while reducing LUT
and BRAM resources by 9% and 59% respectively, when compared to an edge variant of the
accelerator. AnaCoNGA is a prime example of how metaheuristic techniques and well-defined,
parameterizable analytical models can provide fully-automated co-design in the final development
stages of a DNN deployment.
126
7 Conclusion & Outlook
of abstract artificial algorithms are only truly understood when their im-
C
OMPLEXITIES
plementation in the real world is needed. This is exacerbated when the algorithm and
the execution medium are designed in a segregated manner. Co-design brings algorithm
and medium under the same scope, manifesting a real-world implementation with synergies that
bring them closer to algorithms observed in nature.
This dissertation presented several challenges in implementing DNNs on hardware. These
challenges were hard to resolve without compromises in DNN and/or hardware design targets.
The pitfalls of incoherent co-design were highlighted with examples of how compromises in DNN
targets do not result in benefits on the target hardware and vice versa. This problem statement was
addressed by applying classical concepts from the field of VLSI design and HW-SW co-design.
These included different methodologies, executable models, and design abstraction levels.
Handcrafted methodologies were presented, where the designer’s conceptual understanding of
the design challenge is itself the problem formulation [8, 16]. Semi-automated methods were used
when parts of the design challenge were solvable with computation models, but guided by human
designers [24, 25]. Fully-automated methods tackled challenges with prohibitively large search
spaces, but well-defined models and evaluation criteria [9, 10, 11]. The designer essentially takes
their hands off the wheel and allows metaheuristic methods to search for the optimal parameters for
both hardware and DNN design. Executable models were introduced in several forms, facilitating
the injection of hardware-awareness into DNN optimization loops. Look-up tables and regression-
based models were developed for off-the-shelf hardware platforms [12, 15, 17, 19, 20], classical
SDF-style models were used to parameterize dataflow hardware architectures [21, 22, 23, 24, 25],
and analytical models were used to explore mapping and scheduling schemes on differently
dimensioned spatial accelerators [9, 10, 11, 12, 13]. Going a step further, differentiable hardware
models were developed, proving that hardware optimization does not need to break the smooth,
gradient-based training operation of DNNs, but can even be directly injected as part of the
learning and backpropagation algorithm of the DNN [15]. This allows the DNN to learn the task
at hand as well as learn how to run efficiently on hardware. Finally, the large search spaces and
costly evaluation and training times motivated the use of divide-and-conquer approaches to tackle
complex design challenges. The introduction of abstraction levels into the design flow allowed
the human and/or the metaheuristic agent to focus on solving sub-parts of the problem instead of
tackling it once as a whole and landing in incoherent co-design neighborhoods [8, 24, 25, 10, 11].
Focusing on a limited set of design details reduces the development effort, similar to how VLSI
engineers’ work is integrated at different levels of abstraction, allowing them to consider less
complex problems at each stage. The works published under the scope of this dissertation
127
7 Conclusion & Outlook
128
and automated methodologies will be followed, accurate executable models will be critical, and
abstraction levels will be introduced to understand the design problem with reasonable detail at
each stage of development.
Human-engineered algorithms will behave like biological ones at some point. They will very
likely surpass the intelligence observed in biological beings in nature. To take the first steps
towards that goal, the fundamental way in which algorithms in nature and their execution medium
interact with the real world must be understood. The hardware and the algorithm must become
one and the same.
129
Bibliography
131
Bibliography
132
Bibliography
133
Bibliography
[33] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, D. Chen, and
Y. Lin. AutoDNNchip: An automated dnn chip predictor and builder for both fpgas and
asics. In International Symposium on Field-Programmable Gate Arrays. Association for
Computing Machinery, 2020. doi:10.1145/3373087.3375306.
[35] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFa-Net: Training low
bitwidth convolutional neural networks with low bitwidth gradients. In arXiv, 2016.
doi:10.48550/ARXIV.1606.06160.
[37] Q. Huang, K. Zhou, S. You, and U. Neumann. Learning to prune filters in convolutional
neural networks. In Winter Conference on Applications of Computer Vision. IEEE, 2018.
doi:10.1109/WACV.2018.00083.
[40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In arXiv, 2014. doi:10.48550/ARXIV.1409.1556.
[43] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 1989. doi:10.1016/0893-6080(89)90020-8.
134
Bibliography
[44] D. Hubel and T. Wiesel. Brain and Visual Perception: The Story of a 25-year
Collaboration. Oxford University Press, 2012. doi:10.1093/acprof:oso/
9780195176186.001.0001.
[45] J. Weng, N. Ahuja, and T. Huang. Learning recognition and segmentation of 3-d objects
from 2-d images. In International Conference on Computer Vision. IEEE, 1993. doi:
10.1109/ICCV.1993.378228.
[46] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International Conference on Machine Learning.
Association for Computing Machinery, 2015.
[47] A. Brock, S. De, S. L. Smith, and K. Simonyan. High-performance large-scale image recog-
nition without normalization. In arXiv, 2021. doi:10.48550/ARXIV.2102.06171.
[48] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic seg-
mentation. In Conference on Computer Vision and Pattern Recognition. IEEE, 2015.
doi:10.1109/CVPR.2015.7298965.
[50] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image
classification: Fast feature extraction and svm training. In Computer Vision and Pattern
Recognition. IEEE, 2011. doi:10.1109/CVPR.2011.5995477.
[51] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with
compressed fisher vectors. In Computer Vision and Pattern Recognition. IEEE, 2010.
doi:10.1109/CVPR.2010.5540009.
[52] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. Communications of the ACM, 2017. doi:10.1145/3065386.
135
Bibliography
[56] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efficient processing of deep neural net-
works: A tutorial and survey. Proceedings of the IEEE, 2017. doi:10.1109/
JPROC.2017.2761740.
[57] R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A
whitepaper. In arXiv, 2018. URL: https://arxiv.org/abs/1806.08342, doi:
10.48550/ARXIV.1806.08342.
[58] NVIDIA Corporation. NVDLA TensorRT Documentation. URL: https:
//docs.nvidia.com/deeplearning/tensorrt/developer-guide/
index.html.
[59] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through
stochastic neurons for conditional computation. In arXiv, 2013. doi:10.48550/
ARXIV.1308.3432.
[60] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural
networks. In Neural Information Processing Systems. Curran Associates, Inc., 2016.
[61] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia. Regularized binary network
training. In arXiv, 2018. doi:10.48550/ARXIV.1812.11800.
[62] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In
Neural Information Processing Systems. Curran Associates Inc., 2017.
[63] Q. Lou, F. Guo, M. Kim, L. Liu, and L. Jiang. AutoQ: Automated kernel-wise neural
network quantization. In International Conference on Learning Representations, 2020.
URL: https://openreview.net/forum?id=rygfnn4twS.
[64] M.-R. Vemparala, A. Frickenstein, and W. Stechele. An efficient fpga accelerator design
for optimized cnns using opencl. In Architecture for Computing Systems. Springer, 2019.
doi:10.1007/978-3-030-18656-2 18.
[65] Y. Umuroglu, D. Conficconi, L. Rasnayake, T. B. Preusser, and M. Själander. Opti-
mizing bit-serial matrix multiplication for reconfigurable computing. Transactions on
Reconfigurable Technology and Systems, 2019. doi:10.1145/3337929.
[66] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos. Loom: Exploiting weight
and activation precisions to accelerate convolutional neural networks. In Proceedings of
the 55th Annual Design Automation Conference. Association for Computing Machinery,
2018. doi:10.1145/3195970.3196072.
[67] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo. UNPU: An energy-efficient deep
neural network accelerator with fully variable weight bit precision. Journal of Solid-State
Circuits, 2019. doi:10.1109/JSSC.2018.2865489.
[68] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos. Stripes: Bit-serial
deep neural network computing. In International Symposium on Microarchitecture. IEEE,
2016. doi:10.1109/MICRO.2016.7783722.
136
Bibliography
[69] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh. Bit fusion:
Bit-level dynamically composable architecture for accelerating deep neural networks. In
International Symposium on Computer Architecture. IEEE Press, 2018. doi:10.1109/
ISCA.2018.00069.
[70] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation.
Addison-Wesley Longman Publishing Co., Inc., 1991.
[71] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Neural Information
Processing Systems. Morgan-Kaufmann, 1990.
[72] B. Hassibi, D. Stork, and G. Wolff. Optimal brain surgeon and general network prun-
ing. In International Conference on Neural Networks. IEEE, 1993. doi:10.1109/
ICNN.1993.298572.
[73] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. In arXiv, 2015. doi:10.48550/
ARXIV.1510.00149.
[74] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient
convnets. In arXiv, 2016. URL: https://arxiv.org/abs/1608.08710, doi:
10.48550/ARXIV.1608.08710.
[75] Y. He, P. Liu, Z. Wang, et al. Filter pruning via geometric median for deep convolutional
neural networks acceleration. In Conference on Computer Vision and Pattern Recognition.
IEEE, 2019. doi:10.1109/CVPR.2019.00447.
[76] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural net-
works. In International Conference on Computer Vision. IEEE, 2017. doi:10.1109/
ICCV.2017.155.
[77] T. Yang, Y. Chen, and V. Sze. Designing energy-efficient convolutional neural networks
using energy-aware pruning. In Conference on Computer Vision and Pattern Recognition.
IEEE, 2017. doi:10.1109/CVPR.2017.643.
[78] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient
inference engine on compressed deep neural network. In International Symposium on
Computer Architecture. IEEE Press, 2016. doi:10.1109/ISCA.2016.30.
137
Bibliography
[81] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning. PMLR, 2019. URL: https:
//proceedings.mlr.press/v97/tan19a.html.
[83] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: In-
verted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern
Recognition. IEEE, 2018. doi:10.1109/CVPR.2018.00474.
[84] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, and S. Han. APQ: Joint search
for network architecture, pruning and quantization policy. In Conference on Computer Vi-
sion and Pattern Recognition. IEEE, 2020. doi:10.1109/CVPR42600.2020.00215.
[85] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet:
Platform-aware neural architecture search for mobile. In Conference on Computer Vision
and Pattern Recognition. IEEE, 2019. doi:10.1109/CVPR.2019.00293.
[86] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and special-
ize it for efficient deployment. In International Conference on Learning Representations,
2020. URL: https://openreview.net/forum?id=HylxE1HKwS.
[87] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target
task and hardware. In International Conference on Learning Representations, 2019. URL:
https://openreview.net/forum?id=HylVB3AqYm.
[88] Y. Lin, D. Hafdi, K. Wang, Z. Liu, and S. Han. Neural-hardware architecture search. In
Neural Information Processing Systems Workshops, 2019.
[92] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning
models resistant to adversarial attacks. In International Conference on Learning Represen-
tations, 2018. URL: https://openreview.net/forum?id=rJzIBfZAb.
138
Bibliography
[93] E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial
training. In International Conference on Learning Representations, 2020. URL: https:
//openreview.net/forum?id=BJx040EFvH.
[95] NVIDIA Corporation. NVIDIA Tesla V100 GPU Architecture, 2017. URL:
https://images.nvidia.com/content/volta-architecture/pdf/
volta-architecture-whitepaper.pdf.
[96] NVIDIA Corporation. NVDLA Open Source Project - Primer. URL: http://
nvdla.org/primer.html.
[99] Intel Corporation. Deep learning with Intel AVX-512 and Intel DL BOOST. URL:
https://www.intel.com/content/www/us/en/developer/articles/
guide/deep-learning-with-avx512-and-dl-boost.html.
[100] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina,
C. Kozyrakis, and M. Horowitz. Interstellar: Using halide’s scheduling language to analyze
dnn accelerators. In International Conference on Architectural Support for Programming
Languages and Operating Systems. Association for Computing Machinery, 2020. doi:
10.1145/3373376.3378514.
[102] S.-C. Kao and T. Krishna. GAMMA: Automating the hw mapping of dnn models on
accelerators via genetic algorithm. In International Conference on Computer-Aided Design.
Association for Computing Machinery, 2020. doi:10.1145/3400302.3415639.
139
Bibliography
[106] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi,
N. Imam, S. Jain, Y. Liao, C.-K. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul,
J. Tse, G. Venkataramanan, Y.-H. Weng, A. Wild, Y. Yang, and H. Wang. Loihi: A
neuromorphic manycore processor with on-chip learning. IEEE Micro, 2018. doi:
10.1109/MM.2018.112130359.
[107] H. Kwon, A. Samajdar, and T. Krishna. MAERI: Enabling flexible dataflow mapping
over dnn accelerators via reconfigurable interconnects. ACM SIGPLAN Notices, 2018.
doi:10.1145/3296957.3173176.
[108] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. TETRIS: Scalable and efficient
neural network acceleration with 3d memory. In International Conference on Architectural
Support for Programming Languages and Operating Systems. Association for Computing
Machinery, 2017. doi:10.1145/3037697.3037702.
[109] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis. TANGRAM: Optimized coarse-
grained dataflow for scalable nn accelerators. In International Conference on Architectural
Support for Programming Languages and Operating Systems. Association for Computing
Machinery, 2019. doi:10.1145/3297858.3304014.
[110] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. SmartShuttle: Optimizing off-chip
memory accesses for deep learning accelerators. In Conference on Design, Automation
and Test in Europe. IEEE, 2018. doi:10.23919/DATE.2018.8342033.
[111] D. D. Gajski, S. Abdi, A. Gerstlauer, and G. Schirner. Embedded system design modeling,
synthesis and verification. Springer, 2009.
[112] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks
with binary weights during propagations. In Neural Information Processing Systems.
Curran Associates, Inc., 2015.
[113] R. Andri, L. Cavigelli, D. Rossi, and L. Benini. YodaNN: An architecture for ultralow
power binary-weight cnn acceleration. Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 2018. doi:10.1109/TCAD.2017.2682138.
140
Bibliography
[115] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang.
Accelerating binarized convolutional neural networks with software-programmable fp-
gas. In International Symposium on Field-Programmable Gate Arrays. Association for
Computing Machinery, 2017. doi:10.1145/3020078.3021741.
[116] L. Yang, Z. He, and D. Fan. A fully onchip binarized convolutional neural network
fpga impelmentation with accurate inference. In International Symposium on Low Power
Electronics and Design. Association for Computing Machinery, 2018. doi:10.1145/
3218603.3218615.
[117] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei. Fp-bnn. Neurocomputing, 2018. doi:
10.1016/j.neucom.2017.09.046.
[118] D. Nguyen, D. Kim, and J. Lee. Double MAC: Doubling the performance of convolutional
neural networks on modern fpgas. In Conference on Design, Automation and Test in
Europe. IEEE, 2017. doi:10.23919/DATE.2017.7927113.
[120] Xilinx, Inc. Versal: The First Adaptive Compute Acceleration Platform (ACAP), 2018.
v1.0.
[121] W. Tang, G. Hua, and L. Wang. How to train a compact binary neural network with high
accuracy? In Conference on Artificial Intelligence. AAAI Press, 2017.
[122] U. Zahid, G. Gambardella, N. J. Fraser, M. Blott, and K. Vissers. FAT: Training neural
networks for reliable inference under hardware faults. In International Test Conference.
IEEE, 2020. doi:10.1109/ITC44778.2020.9325249.
[123] Z. He, A. S. Rakin, J. Li, C. Chakrabarti, and D. Fan. Defending and harnessing the
bit-flip based adversarial weight attack. In Conference on Computer Vision and Pattern
Recognition. IEEE, 2020. doi:10.1109/CVPR42600.2020.01410.
[124] J. Lin, C. Gan, and S. Han. Defensive quantization: When efficiency meets ro-
bustness. In International Conference on Learning Representations, 2019. URL:
https://openreview.net/forum?id=ryetZ20ctX.
[125] Y. He, P. Balaprakash, and Y. Li. FIdelity: Efficient resilience analysis framework for
deep learning accelerators. In International Symposium on Microarchitecture. IEEE, 2020.
doi:10.1109/MICRO50266.2020.00033.
141
Bibliography
[126] L.-H. Hoang, M. A. Hanif, and M. Shafique. FT-ClipAct: Resilience analysis of deep
neural networks and improving their fault tolerance using clipped activation. In Conference
on Design, Automation and Test in Europe. EDA Consortium, 2020.
[127] A. S. Rakin, Z. He, and D. Fan. Bit-flip attack: Crushing neural network with progressive
bit search. In International Conference on Computer Vision. IEEE, 2019. doi:10.1109/
ICCV.2019.00130.
[128] J. Li, A. S. Rakin, Y. Xiong, L. Chang, Z. He, D. Fan, and C. Chakrabarti. Defending
bit-flip attack through dnn weight reconstruction. In Design Automation Conference. IEEE,
2020. doi:10.1109/DAC18072.2020.9218665.
[129] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu. Defense against adversarial attacks
using high-level representation guided denoiser. In Conference on Computer Vision and
Pattern Recognition. IEEE, 2018. doi:10.1109/CVPR.2018.00191.
[130] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar. The YCB object
and model set: Towards common benchmarks for manipulation research. In International
Conference on Advanced Robotics. IEEE, 2015. doi:10.1109/ICAR.2015.7251504.
[132] T. R. Farrell and R. F. Weir. The optimal controller delay for myoelectric prostheses.
Transactions on Neural Systems and Rehabilitation Engineering, 2007. doi:10.1109/
TNSRE.2007.891391.
[134] M. Markovic, S. Dosen, C. Cipriani, D. Popovic, and D. Farina. Stereovision and aug-
mented reality for closed-loop control of grasping in hand prostheses. Journal of neural
engineering, 2014. doi:10.1088/1741-2560/11/4/046001.
[136] J. DeGol, A. Akhtar, B. Manja, and T. Bretl. Automatic grasp selection using a camera in
a hand prosthesis. In International Conference of the IEEE Engineering in Medicine and
Biology Society. IEEE, 2016. doi:10.1109/EMBC.2016.7590732.
142
Bibliography
[137] P. Weiner, J. Starke, F. Hundhausen, J. Beil, and T. Asfour. The KIT prosthetic hand:
design and control. In International Conference on Intelligent Robots and Systems. IEEE,
2018. doi:10.1109/IROS.2018.8593851.
[138] M. Esponda and T. M. Howard. Adaptive grasp control through multi-modal interactions
for assistive prosthetic devices. In arXiv, 2018. doi:10.48550/ARXIV.1810.07899.
[140] C. Shi, D. Yang, J. Zhao, and H. Liu. Computer vision-based grasp pattern recogni-
tion with application to myoelectric control of dexterous hand prosthesis. IEEE Trans-
actions on Neural Systems and Rehabilitation Engineering, 2020. doi:10.1109/
TNSRE.2020.3007625.
[142] A. Frickenstein, M.-R. Vemparala, J. Mayr, N.-S. Nagaraja, C. Unger, F. Tombari, and
W. Stechele. Binary DAD-Net: Binarized driveable area detection network for autonomous
driving. In International Conference on Robotics and Automation. IEEE, 2020. doi:
10.1109/ICRA40945.2020.9197119.
[143] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid. Structured binary neural networks for
accurate image classification and semantic segmentation. In Conference on Computer
Vision and Pattern Recognition. IEEE, 2019. doi:10.1109/CVPR.2019.00050.
[144] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale
hierarchical image database. In Conference on Computer Vision and Pattern Recognition.
IEEE, 2009. doi:10.1109/CVPR.2009.5206848.
[145] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition
benchmark: A multi-class classification competition. In International Joint Conference on
Neural Networks. IEEE, 2011. doi:10.1109/IJCNN.2011.6033395.
[146] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural
images with unsupervised feature learning. In Neural Information Processing Systems
Workshops, 2011.
143
Bibliography
[148] F. Hundhausen, J. Starke, and T. Asfour. A soft humanoid hand with in-finger visual
perception. In arXiv, 2020. doi:10.48550/ARXIV.2006.03537.
[149] L. Wang, Z. Q. Lin, and A. Wong. Covid-net: A tailored deep convolutional neural network
design for detection of covid-19 cases from chest x-ray images. Scientific Reports, 2020.
doi:10.1038/s41598-020-76550-z.
[150] A. I. Khan, J. L. Shah, and M. M. Bhat. Coronet: A deep neural network for detection
and diagnosis of covid-19 from chest x-ray images. Computer Methods and Programs in
Biomedicine, 2020. doi:10.1016/j.cmpb.2020.105581.
[154] T. Mitze, R. Kosfeld, J. Rode, and K. Wälde. Face masks considerably reduce covid-
19 cases in germany. Proceedings of the National Academy of Sciences, 2020. doi:
10.1073/pnas.2015954117.
[155] S. Ge, J. Li, Q. Ye, and Z. Luo. Detecting masked faces in the wild with lle-cnns. In
Conference on Computer Vision and Pattern Recognition. IEEE, 2017. doi:10.1109/
CVPR.2017.53.
[156] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, H. Wu, P. Yi, K. Jiang, N. Wang, Y. Pei,
H. Chen, Y. Miao, Z. Huang, and J. Liang. Masked face recognition dataset and application.
In arXiv, 2020. doi:10.48550/ARXIV.2003.09093.
[157] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for
discriminative localization. In Conference on Computer Vision and Pattern Recognition.
IEEE, 2016. doi:10.1109/CVPR.2016.319.
144
Bibliography
[161] Z. Wang, P. Wang, P. C. Louis, L. E. Wheless, and Y. Huo. Wearmask: Fast in-browser
face mask detection with serverless edge computing for covid-19. In arXiv, 2021. doi:
10.48550/ARXIV.2101.00784.
[162] K. Hammoudi, A. Cabani, H. Benhabiles, and M. Melkemi. Validating the correct wearing
of protection mask by taking a selfie: Design of a mobile application “checkyourmask”
to limit the spread of covid-19. Computer Modeling in Engineering & Sciences, 2020.
doi:10.32604/cmes.2020.011663.
[163] A. Anwar and A. Raychowdhury. Masked face recognition for secure authentication. In
arXiv, 2020. doi:10.48550/ARXIV.2008.11104.
[165] Z. Dong, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer. HAWQ: Hessian aware
quantization of neural networks with mixed-precision. In International Conference on
Computer Vision. IEEE, 2019. doi:10.1109/ICCV.2019.00038.
[166] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer. Mixed precision quantization
of convnets via differentiable neural architecture search. In arXiv, 2018. doi:10.48550/
ARXIV.1812.00090.
[167] W. Jiang, L. Yang, E. H.-M. Sha, Q. Zhuge, S. Gu, S. Dasgupta, Y. Shi, and J. Hu. Hard-
ware/software co-exploration of neural architectures. Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, 2020. doi:10.1109/TCAD.2020.2986127.
145
Bibliography
[170] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia,
P. Vajda, M. Uyttendaele, and N. K. Jha. Chamnet: Towards efficient network design
through platform-aware model adaptation. In Conference on Computer Vision and Pattern
Recognition. IEEE, 2019. doi:10.1109/CVPR.2019.01166.
[171] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multi-objective
genetic algorithm: NSGA-ii. Transactions on Evolutionary Computation, 2002. doi:
10.1109/4235.996017.
[172] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo. Optimizing the convolution operation to
accelerate deep neural networks on fpga. Transactions on Very Large Scale Integration
(VLSI) Systems, 2018. doi:10.1109/TVLSI.2018.2815603.
[173] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In
International Solid-State Circuits Conference Digest of Technical Papers. IEEE, 2014.
doi:10.1109/ISSCC.2014.6757323.
[175] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for
semantic image segmentation. In arXiv, 2017. doi:10.48550/ARXIV.1706.05587.
[176] W. Jiang, L. Yang, S. Dasgupta, J. Hu, and Y. Shi. Standing on the shoulders of giants: Hard-
ware and neural architecture co-search with hot start. Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, 2020. doi:10.1109/TCAD.2020.3012863.
[177] L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst. Zigzag: Enlarging joint
architecture-mapping design space exploration for dnn accelerators. Transactions on
Computers, 2021. doi:10.1109/TC.2021.3059962.
146
A Appendix
Table A.1: Network architectures and hardware dimensioning used in Binary-LoRAX and BinaryCoP.
Pool, ReLU and batch normalization layers not shown. FC 3 | [25/4] indicates YCB (25
classes) or MaskedFace-Net (4 classes).
147
A Appendix
0.3
Norm. Compute Cycles
Reward Accuracy
Reward Accuracy
85
85
0.25
80
80
0.2
75
75
0.15
70
70
0.3 0.4 0.5 0.3 0.35 0.4 0.45 0.5 0.15 0.2 0.25 0.3
Figure A.1: QSS: 2-D projections of a 3-D Pareto-front for optimal quantization with respect to accuracy,
compute cycles, and DRAM accesses on HW3. Compute cycles and DRAM accesses are
normalized to an 8-bit execution on HW3. “Reward Accuracy” is with minimal fine-tuning
(not fully trained). It is recommended to view this figure in color.
5
0.5 0
0
1
CO 2
3S
CO 2
4S
CO 2
5S
2
FC
21
21
22
22
31
31
32
32
41
41
42
42
51
51
52
52
V
V
V
V
N
N
N
N
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
CO
Figure A.2: Comparison of a HAS solution (Dm , Dn , Dk = 8, 14, 96) found for ResNet18-ImageNet
4-bit against the larger standard symmetric hardware configuration HW3. The CONV1 layer
follows the same trend but is not shown to maintain plot scale.
148