0% found this document useful (0 votes)

24 views34 pages

Quantization and Deployment Od DNN On Microcontroller

Uploaded by

dar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views34 pages

Quantization and Deployment Od DNN On Microcontroller

Uploaded by

dar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

sensors

Article
Quantization and Deployment of Deep Neural Networks
on Microcontrollers
Pierre-Emmanuel Novac 1, * , Ghouthi Boukli Hacene 2,3 , Alain Pegatoquet 1 , Benoît Miramond 1 , Vincent Gripon 2

1 Université Côte d’Azur, CNRS, LEAT, 06903 Sophia Antipolis, France; Alain.Pegatoquet@univ-cotedazur.fr
(A.P.); Benoit.Miramond@univ-cotedazur.fr (B.M.)
arXiv:2105.13331v2 [cs.LG] 23 Sep 2021

2 IMT Atlantique, 29200 Brest, France; Ghouthi.BoukliHacene@imt-atlantique.fr (G.B.H.);

vincent.gripon@imt-atlantique.fr (V.G.)
3 MILA, Montreal, QC H2S 3H1, Canada
* Correspondence: Pierre-Emmanuel.Novac@univ-cotedazur.fr

Abstract: Embedding Artificial Intelligence into low-power devices is a challenge that has been partly
overcome with recent advances in machine learning and hardware design. Presently, deep neural
networks can be deployed on embedded targets to perform different tasks such as speech recognition,
object detection or human activity recognition. However, there is still room for optimization of deep
neural networks in embedded devices. These optimizations mainly address power consumption,
memory and real-time constraints, but also an easier deployment at the edge. Moreover, there is still

a need to better understand what can be achieved for different use cases. This work focuses on the
quantization and deployment of deep neural networks on low-power 32-bit microcontrollers. First,
Citation: Novac, P.-E.; Boukli Hacene,
we outline quantization methods, relevant in the context of embedded execution on a microcontroller,
G.; Pegatoquet, A.; Miramond B.;
Gripon V. Quantization and
Then we present a new framework for end-to-end deep neural network training, quantization and
Deployment of Deep Neural Networks deployment. This open-source framework, called MicroAI, is designed as an alternative to existing
on Microcontrollers. Sensors 2021, 21, inference engines (TensorFlow Lite for Microcontrollers and STM32Cube.AI). Our framework can easily
2984. be adjusted and/or extended for specific use cases. Executions using single-precision 32-bit floating-
https://doi.org/10.3390/s21092984 point as well as fixed-point on 8- and 16-bit integers are supported. The proposed quantization method
is evaluated with three different datasets (UCI-HAR, Spoken MNIST and GTSRB). Finally, a comparison
Academic Editor: Alexander Wong study between MicroAI and both existing embedded inference engines is provided in terms of memory
and power efficiency. On-device evaluation was done using ARM Cortex-M4F-based microcontrollers
Received: 29 March 2021
(Ambiq Apollo3 and STM32L452RE).
Accepted: 20 April 2021
Published: 23 April 2021
Keywords: embedded systems; artificial intelligence; machine learning; quantization; power consump-
tion; microcontrollers
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
1. Introduction
Deep Neural Networks (DNN) are widely used presently to solve a range of problems, in-
cluding classification. DNN can classify all sorts of data such as audio, images or accelerometer
samples for tasks such as speech recognition, object recognition or human activity recognition
Copyright: © 2021 by the authors.
(HAR).
Licensee MDPI, Basel, Switzerland.
A well-known downside of DNN is its high energy consumption requirement. In par-
This article is an open access arti-
cle distributed under the terms and
ticular, the training phase is usually based on a large amount of data processed by costly
conditions of the Creative Commons
algorithms. Although the inference phase requires less processing power, it is still a costly
Attribution (CC BY) license (https:// process. Therefore, GPUs and ASICs are often used to perform such computations in the
creativecommons.org/licenses/by/ cloud [1].
4.0/).

Sensors 2021, 21, 2984. https://doi.org/10.3390/s21092984 https://www.mdpi.com/journal/sensors

Sensors 2021, 21, 2984 2 of 34

However, cloud computing requires transmitting the collected data to a network server
to process it and fetch the result, thus requiring permanent connectivity, causing privacy
concerns as well as non-deterministic latency. As an alternative, computations can be done at
the edge on the device itself. By doing so, data do not need to be sent by the device to the cloud
anymore. However, running DNN on resource-constrained devices such as a microcontroller
used in Internet of Things (IoT) devices or wearables is a challenging task [2–4].
These devices have only a very small amount of memory, often less than 1 MiB. They
also run DNN algorithms several orders of magnitude more slowly than GPUs or even CPUs
(see Appendix A). The reason is that microcontrollers generally rely on a general-purpose
processing core that does not implement parallelization techniques such as thread-level
parallelism or advanced vectorization. Moreover, microcontrollers typically run at a much
lower frequency than GPUs (8 MHz to 80 MHz compared to 1 GHz to 2 GHz). Microcontrollers
can also be coupled with tiny battery cells. In some cases, for example when data are collected
in remote areas, they cannot even be recharged in the field. Therefore, performing inference at
the edge faces major issues in terms of real-time constraints, power consumption and memory
footprint. To meet these constraints, the deployment of a DNN must respect an upper bound
for one inference response time as well as an upper bound for the number of parameters of
the network.
As a result, a DNN must be limited in width and depth to be deployable on a micro-
controller. As has been observed, deeper and/or wider networks are often able to solve
more complex tasks with better accuracy [5]. As such, there is always a trade-off between
memory footprint, response time, power consumption and accuracy of the model. In a previ-
ous work [6], we presented a trade-off between memory footprint, power consumption and
accuracy when performing HAR on smart glasses. This work showed that HAR is feasible in
real time on a low-power Cortex-M4F-based microcontroller. However, we also concluded
that there was room for improvement in the memory footprint and power consumption.
A technique that can provide a significant decrease in the memory footprint is based on
network quantization. Quantization consists of reducing the number of bits used to encode
each weight of the model, so that the total memory footprint is reduced by the same factor.
Quantization also enables the use of fixed-point rather than floating-point encoding. In other
words, operations can be performed using integer rather than floating-point data types. This
is of interest because integer operations require considerably fewer computations on most
processor cores, including microcontrollers. Without a floating-point unit (FPU), floating-
point instructions must be emulated in software, creating a large overhead, as was illustrated
in [7]. In that study, a comparison between software, hardware and custom hybrid FPU
implementations was provided.
In this paper, we present an open-source [8] framework, called MicroAI, to perform
end-to-end training, quantization and deployment of deep neural networks on microcon-
trollers. The training phase relies on the well-known TensorFlow and PyTorch deep learning
frameworks. Our objective is to provide a framework that is easy to adapt and extend, while
maintaining a good compromise between accuracy, energy efficiency and memory footprint.
As a second contribution, we provide some comparative results using two different
microcontrollers (STM32L452RE and Ambiq Apollo3) and three different inference engines
(TensorFlow Lite for Microcontrollers, STM32Cube.AI and our own MicroAI). Results are
compared in terms of memory footprint, inference time and power efficiency. Finally, we
propose to apply 8-bit and 16-bit quantization methods on three datasets dealing with different
modalities: acceleration and angular velocity from body-worn sensors for UCI-HAR, speech
for Spoken MNIST and images for GTSRB. These datasets are light enough to be handled by a
deep neural network running on a microcontroller, but still relevant for applications relying
on embedded artificial intelligence.
Sensors 2021, 21, 2984 3 of 34

Section 2 presents the challenges of running deep neural networks on microcontrollers

and recent advances in this field. Section 3 describes the two common ways of representing
real numbers on modern computers: floating point and fixed point. Section 4 presents the
methodology implemented in our MicroAI framework for deep neural network quantization.
Section 5 details our MicroAI framework and compares it to existing solutions. In Section
6, some comparative results between our framework MicroAI and two popular embedded
neural network frameworks (TensorFlow Lite for Microcontrollers and STM32Cube.AI) are
given for two microcontroller platforms (Nucleo-L452RE-P and SparkFun Edge) in terms of
inference time and power efficiency. The impact of our 8- and 16-bit quantization methods for
three different datasets (UCI-HAR, Spoken MNIST and GTSRB) is also presented. In Section 7
the results obtained are discussed. Finally, Section 8 concludes this work and discusses future
perspectives.

2. The State of the Art in Embedded Execution of Quantized Neural Networks

A first family of optimization methods for embedded deep neural networks, based
on network quantization, proposes to reduce the precision (i.e., the number of bits used to
represent a value). Numerous works propose to use low-bit quantization (i.e., 2 or 3 bits to
quantize both weights and activations) such as PArameterized Clipping acTivation (PACT)
combined with Statistics-Aware Weight Binning (SAWB) [9], Learned Step Size Quantization
(LSQ) [10], Bit-Pruning [11] or Differentiable Quantization of Deep Neural Networks [12]
(DQDNN). In the extreme case, this amounts to binarizing the network parameters [13,14].
Nevertheless, it is usually possible to find a good compromise between the precision and the
performance of a given architecture.
Another family of methods focuses on low-cost architectures. This is notably the case of
the well-known MobileNet [15] and Squeeze-Net [16] networks. Finding a good architecture
for a given problem is difficult, and remains an open issue even if hardware constraints are
not taken into account. Indeed, the combinatorial explosion of hyperparameters makes the
exploration of an architecture very expensive. These networks are often tailored to solve
computer vision problems such as ImageNet [17] classification, and are therefore not well
suited for general use cases or simpler problems. In practice, many applications use a generic
network such as ResNet [18]. This is also the approach adopted in this work. The reason is
that we want to easily make use of the same deep neural network on different kinds of data
(time series, audio spectrum and image) and also simplify the implementation.
In a third family of methods, some authors have explored the possibility of reducing
the number of parameters in neural networks by identifying parts of the network that are
not very useful for decision making. These parts can then be removed from the architecture.
These pruning techniques make it possible to considerably reduce the number of parameters.
Hence, in [19], the authors managed to remove up to 90% of the parameters. However, the
unstructured nature of the removed parameters makes it difficult to operate the network.
Recently, several works have proposed pruning in a structured way where a whole kernel, a
filter or even a layer is pruned according to a specific criterion [20–23].
Finally, a last family of methods consists of identifying similar parts at various points in
the architectures in order to factorize them. For example, in [24] the authors demonstrate that
it is possible to reduce the number of parameters by replacing them with memory pointers.
Some works propose to improve accuracy by adding some learning steps and algorithms to
get a more representative factorization [25,26].
All these compression techniques make it possible to reduce the size of architectures
considerably, but usually at the cost of a reduction in performance. In practice, it is also
common to reduce by half or even two-thirds the memory used to store the parameters while
maintaining a similar level of performance.
Sensors 2021, 21, 2984 4 of 34

In this work we will focus on quantization-based compression. To reduce the number of

bits used to represent a value, quantization maps values from one set to another smaller set.
The smaller set can have a constant step between its elements, in which case the quantization
scheme is said to be uniform. In the case of convolutional neural networks, the authors in
[27] show that the weights of convolutional layers typically follow a Gaussian distribution
when weight decay is applied. More generally, it has been shown that weights can closely fit
Gaussian Mixture Models [28]. Therefore, choosing a non-uniform quantization scheme to
better represent the values of the non-uniform distribution of weights would lead to a lower
quantization error.
A non-uniform quantization scheme was implemented in [29] on an FPGA device. In
this work, instead of coding the value in a fixed-point format, only the nearest power of
two is coded. Using this approach, it is possible to obtain a better resolution compared to
a fixed-point representation for numbers near 0. This approach also allows large values to
be represented, but at the cost of a lower resolution. The quantization step is determined
by minimizing the quantization error at the output of the layer, thus balancing the precision
and the dynamic range. As the implementation relies on bit shifts rather than on integer
multiplications, this solution has some benefits in terms of resource usage and latency for an
FPGA target. Additionally, results show that there is a slight degradation of accuracy when
using the proposed non-uniform quantization versus a uniform quantization.
Using lower-precision computation for deep neural networks has been explored in [30]
where the authors compare the test error rates on various image datasets for single-precision
floating point, half-precision floating point, 20-bit fixed point and their own dynamic fixed-
point approach with 10 bits for activations and 12 bits for weights. In their work, it is worth
noting that the training is also performed using lower-precision arithmetic. Training with
fixed-point arithmetic was presented in [31] with 16-bit weights and 8-bit inputs, causing
an accuracy loss of a few percent in the evaluation of text-to-speech, parity bit computation,
protein structure prediction and sonar signal classification problems. In [32], the authors
showed that on an Intel® E5640 microprocessor with an x86 architecture, using 8-bit integer
instructions instead of floating-point instructions provided an execution speedup of more
than 2, without a loss of accuracy, on a speech recognition problem. In this case the training
was performed using single-precision floating-point arithmetic, and the evaluation was done
after the quantization of the network parameters.
These prior works were however mostly not concerned with embedded computing on
microcontrollers. Running deep neural networks on microcontrollers began to be popular in
the last few years thanks to the rise of the Internet of Things and the improved efficiency of
deep neural networks.
In [33], the authors emphasize that the instruction set architecture (ISA) of available
microcontrollers can be a great limitation to running quantized neural networks. Indeed, most
microcontroller architectures do not exhibit any kind of SIMD instructions. On the other hand,
most microcontrollers rely on 32-bit registers. Thus, even if the neural network parameters
and the input data use a lower precision representation, they have to be computed one by one
using the register width of 32 bits. Some more advanced microcontroller architectures offer
instructions able to handle 4 × 8-bit or 2 × 16-bit data packed in 32-bit registers. However, such
advances do not allow working with intermediate precision or sub-byte precision, and not
all arithmetic and logic instructions are covered. Intermediate or sub-byte precision requires
manually packing and unpacking data, thus inducing a noticeable computation overhead,
even though it helps further reducing the memory footprint.
Moreover, microcontrollers used in IoT devices mostly rely on ARM Cortex-M cores with
the associated ISA. However, ARM cores are not open, meaning that modifying and extending
the ISA is not possible. To overcome these limitations, the authors in [33] proposed an
Sensors 2021, 21, 2984 5 of 34

extension to the RISC-V ISA, which is open, with instructions to handle sub-byte quantization.
Unfortunately, as microcontrollers implementing RISC-V are still scarce on the market, and
not readily available with the proposed extension, this approach cannot be reasonably used to
deploy IoT devices since it requires manufacturing a custom microcontroller. Manufacturing a
custom microcontroller is not feasible when the goal is to release an IoT product on the market,
due to large costs, time and the required level of expertise. As a result, only off-the-shelf
microcontrollers are considered in this work. Only 8-bit, 16-bit and 32-bit precision will
therefore be studied.
Deep neural networks have already been deployed on 8-bit microcontrollers. One of
the first methods was proposed in [34]. Although interesting, this method requires a lot of
work to implement pseudo-floating-point coding, a custom multiplication algorithm over 16
bits, as well as a hyperbolic tangent approximation for the activation function, all in assembly
language. Over the last few years, implementations have relied on 32-bit microcontrollers
with either a hardware FPU or fixed-point computations. In addition, the Rectified Linear
Unit (ReLU) [35] has become widely used as an activation function and has the benefit of
being easily computed as a max between 0 and the layer’s output, thus being much less
complex than a hyperbolic tangent. In the meantime, neural network architectures and
training methods have continued to evolve to provide more and more efficient models. As a
result, applications such as spoken keyword spotting [36] and human activity recognition [6]
can now be performed in real time on IoT devices relying on low-power microcontrollers.

3. Representation of Real Numbers

3.1. Floating-Point
In modern computation systems, real numbers typically use a floating-point repre-
sentation. Floating-point representation relies on the encoding of three different pieces of
information: the sign, the significand and the exponent. Coding the significand and the
exponent separately makes it possible to represent values with a very large dynamic range,
while at the same time providing increasing precision as the numbers approach 0.
Most floating-point implementations follow the IEEE754 [37] standard which defines
how the sign, significand and exponent are coded in a binary format. Floating-point numbers
can be coded in half, single or double precision requiring 16, 32 or 64 bits, respectively.
Obviously, the more bits allocated to code a value, the more precise it is. A higher number
of bits allocated to the exponent also allows for a larger dynamic range. In deep neural
networks, single precision is more than enough for training and inference. Double precision
requires more computing resources, so it is generally not used. Recently, it has been shown
that half-precision can further accelerate the training and inference without a significant drop
in the accuracy of the deep neural network [38].
However, the choice is much more restricted for low-power microcontrollers. When
present, the hardware floating-point unit often only supports single-precision computation.
Double-precision computations must be performed in software and are therefore significantly
slower than single precision. Half-precision data are converted to single precision before
the computation. In 2019, ARM released the ARMv8.1-M ISA which includes instructions
for half-precision support. Even though the Cortex-M55 core is planned to implement these
instructions, there is so far no microcontroller with this core available on the market. As a
result, when floating point is used on a microcontroller, only single precision is considered.
The binary representation of single-precision floating-point numbers is called binary32
and is represented with 1 bit for the sign, 8 bits for the exponent and 23 bits for the significand
(see Table 1). It allows a dynamic range of roughly [−1038 , 1038 ], far beyond the values
typically seen in a deep neural network, while increasing the resolution for numbers close to
0. The closest possible numbers to 0 are approximately ±1.4 × 10−45 .
Sensors 2021, 21, 2984 6 of 34

Table 1. Single-precision floating-point binary32 representation.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sign exponent significand

3.2. Fixed-Point
Fixed-point is another way to represent real numbers. In this representation, the integer
part and the fractional part have a fixed length. As a result, the dynamic range and the
resolution are directly limited by the length of the integer part and the length of the fractional
part, respectively. The resolution is constant across the whole dynamic range. In binary, the
Q notation is often used to specify the number of bits associated with each part. Qm.n is a
number where m bits are allocated to the integer part and n bits to the fractional part [39]. It
is important to note that we consider signed numbers in two’s complement representation,
the sign bit being included in the integer part. The number of bits for the integer part can be
increased to obtain a larger dynamic range, but it will conversely reduce the number of bits
allocated to the fractional part, thus reducing its precision.
Given a Qm.n signed number, its dynamic range is [−2m−1 , 2m−1 − 2−n ] and its resolution
is 2−n .
As an example, in Table 2, a signed Q16.16 number stored in a 32-bit register has 16 bits
for the integer part including 1 bit for the sign and 16 bits for the fractional part. This translates
to a dynamic range of [−32, 768, 32, 767.9999847], much smaller than the equivalent floating-
point representation, and a constant resolution of 1.5259 × 10−5 across the whole range, less
precise than the floating-point representation near 0.

Table 2. Fixed-point Q16.16 on 32-bit representation.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
integer part fractional part

4. Training and Quantization of Deep Neural Networks

In this work, the training is always performed using single-precision floating-point
computation, that is, in the binary32 format. As training is done offline on a workstation,
there is no need to perform it in fixed point. Despite this being feasible, it would come with
additional challenges regarding gradient computation.

4.1. Floating-Point to Fixed-Point Quantization of a Deep Neural Network

Since the training relies on floating-point computation, a conversion from a floating-point
to a fixed-point representation must be performed before the deep neural network is deployed
on the target. As the set of possible values is different between floating-point and fixed-point
representations, this involves quantizing the weights of the deep neural network.
Floating-point to fixed-point conversion requires determining a scale factor, so that the
floating-point number can be represented as an integer multiplied by a scale factor. The scale
factor is a positive or negative power of two so that it can be computed using only left or right
shifts. In the case of the Cortex-M4 architecture, both multiplication and shift instructions
take one cycle. However, a division requires 2 to 12 cycles. Therefore, divisions should be
avoided as much as possible.
The scale factor must be chosen to represent the whole range of values while avoiding
any risk of data overflow, but at the cost of a lower precision for smaller numbers.
Sensors 2021, 21, 2984 7 of 34

4.1.1. Uniform and Non-Uniform

Similarly to the works presented in Section 2, in our experiments we also observed that
convolutional layer weights are close to Gaussian distributions, with a mean close to 0 when
inputs are normalized. Such a distribution of weights for a convolutional layer kernel is
shown in Figure 1. As a result, convolutional layer weights can be better represented using a
non-uniform distribution of numbers. This is what floating-point numbers originally do to
get a better precision around 0.
However, as the goal is to perform fast computations, a uniform quantization is pre-
ferred. Non-uniform quantization would require performing additional transformations
before using the microcontroller’s instructions. This overhead can be non-negligible and lead
to quantization performance lower than floating-point computations. To obtain a nonconstant
quantization step, a nonlinear function must be computed, either online or offline, to generate
a lookup table where operands for each operation are stored. In contrast, uniform quantization
is based on a constant quantization step. Furthermore, coding only the nearest power of two
does not bring an improvement on Cortex-M4-based microcontrollers. Multiplications and
shifts are implemented in hardware and take only 1 cycle. In consequence, the benefits of this
kind of approach are limited. For these reasons, we will rely on uniform quantization in this
work.
conv1d/kernel:0
Mean=−0.0013869114918634295
5 Std=0.1366153210401535
Max=0.487394779920578
Min=−0.6174291372299194
4
Density

0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Figure 1. Example of the distribution of weights for a convolutional layer kernel.

4.1.2. Scaling and Offset

An alternative consists of finding a scale factor that is not necessarily a power of two,
but which scales values in [−1; +1[. Using this technique, all the bits (except the sign bit) are
used to represent the fractional parts of numbers. As an example, there is the Q1.15 format for
16-bit numbers.
The scale factor also uses a fixed-point representation but with its own power of two
scale factor. This allows for a slightly lower quantization error since the quantization step
is not necessarily a power of two. However, scaling a number adds computations: instead
of performing only a shift, a multiplication with the scale factor followed by a shift with the
scale factor’s power-of-two scale factor must be performed.
In addition, the range could be made asymmetric by introducing an offset value. This
would also enable a slightly lower quantization error when the distribution is not centered
around 0. However, this requires one more addition when scaling a number. It is worth noting
that using unsigned numbers for ReLU activation could help recovering one more bit for the
fractional part if fully merged with the previous layer. Nevertheless, it also comes with an
additional implementation complexity to perform operations between signed and unsigned
numbers. All these alternatives imply a higher complexity and additional computations, with
only a slight improvement of the quantization error. For these reasons, we decided to use a
scale factor that is a power of two and a symmetric range.
Sensors 2021, 21, 2984 8 of 34

4.1.3. Per-Network, Per-Layer and Per-Filter Scale Factor

To reach the best quantization, the scale factor should in theory be chosen for each weight
value. However, storing a scale factor for each value is obviously not a reasonable approach
since it leads to an important memory overhead, defeating the purpose of implementing
quantization to reduce memory usage. On the other hand, using a scale factor for the whole
network is too coarse to achieve a good overall quantization error. Instead, the scale factor
can be made different for each layer. Another solution consists of using a scale factor for each
filter of each layer. Although more complex to implement and introducing some overhead
(scale factors of the layers must be stored in memory), this approach can slightly decrease the
quantization error. So far, our implementation only allows a per-network and a per-layer scale
factor. The parameters and the activations can have different scale factors.

4.1.4. Conversion Method

To convert from floating-point to fixed-point, the method starts with finding the required
number of bits m to represent the unsigned integer part:

m = 1 + log2 ( max | xi |) (1)
1≤ i ≤ N

where xi is an element of the floating-point vector x of length N.A positive value of m means
that m bits are required to represent the absolute value of the integer part, while a negative
value of m means that the fractional part has m leading unused bits. This enables a greater
precision to be obtained for vectors with numbers smaller than 2−1 , since the leading unused
bits can be removed and replaced instead by more trailing bits for precision.
From this we can compute the number of remaining bits n for the fractional part:

n = w−m−1 (2)

where w is the data type width (e.g., 16 bits).

In this equation, 1 is subtracted to take into account the additional bit required to represent
signed numbers.
A positive value of n means that n bits are available to represent the fractional part.
A negative value of n means that the fractional part cannot be represented, and the
integer part cannot be represented to its full precision.
An element x f ixedi of the fixed-point vector x f ixed is computed from the element xi of
the floating-point vector x as:

x f ixedi = f loor ( xi × 2n ) (3)

where trunc(y) truncates a real number y to its integer part.

And the scale factor s is defined as:

s = 2− n (4)

Two methods can be used to get the quantized weights of a deep neural network. These
methods are detailed in the following.

4.2. Post-Training Quantization

In post-training quantization, the neural network is entirely trained using floating-point
computation (a binary32 format is assumed here). Once the training is over, the neural network
is frozen, and the parameters are then quantized. The quantized neural network is then used
to perform the inference, without any adjustments of the parameters.
Sensors 2021, 21, 2984 9 of 34

The quantization phase introduces a quantization error on each parameter as well as

on the input, thus leading to a quantization error on the activations. The accumulation of
quantization errors at the output of the neural network can cause the classifier to incorrectly
predict the class of the input data, creating an accuracy drop compared to the non-quantized
version. As the bit width of the values decreases, the quantization error increases, and the
resulting accuracy typically decreases as well. In some situations, a slight increase in the
quantization error can help the network generalize better over new data, inducing a slight
increase in the accuracy over test data.

4.3. Quantization-Aware Training

The objective of the quantization-aware training (QAT) is to compensate the quantization
error by training the deep neural network using the quantized version during the forward
pass. This should help to mitigate the accuracy drop to some extent. The backpropagation still
relies on non-quantized values. To stabilize the learning phase with the quantized version,
and thus obtain better results on average, the DNN can be pre-trained using a floating-point
representation in order to initialize the parameters to sensible values.
In this work we decided to perform all the computations using a floating-point repre-
sentation. As shown in Figure 2, the inputs, weights and biases of each layer are quantized
(but kept in floating-point representation) before actually performing the layer’s computation.
The layer’s output is quantized after the computation, before reaching the next layer. The
quantization operation is done following the method presented in Section 4.1. During the
training phase, the range of values is reassessed each time to adjust the scale factor before
performing the layer’s computation. When doing inference only, the scale factor is frozen.

Layer parameters
Inputs
(Previous layer outputs) Weights Bias

if training: Update scale factor Update scale factor

Quantize Quantize

Layer operation (Convolution, FC…)

if training: Update scale factor

Quantize

Outputs
(Next layer inputs)

Quantized forward pass Non-quantized backward pass Skipped during inference

Figure 2. Quantization-aware training

In case of a convolutional neural network, the convolutional and fully connected layers
require a quantization-aware training for the weights. Please note that batch normalization lay-
ers also require quantization-aware training. However, as we do not use batch normalization
in our experiments, it has not been implemented. For max-pooling layers, quantization-aware-
training is not required as they do not have weights. Moreover, as max-pooling only consists
Sensors 2021, 21, 2984 10 of 34

of an element-wise max, there is no need to quantize: inputs are already quantized from the
previous layer and the dynamic range cannot be expanded. Therefore, no quantization is
done on the max-pooling layers. It is similar for the ReLU activation which is considered
to be a separate layer. Conversely, the element-wise addition layer requires quantization. It
does not have trainable weights; however, the dynamic range of the output can increase after
adding two large numbers. Therefore, the same quantization process is applied to compute
the output scale factor.

5. Deployment of the Quantized Neural Network

After the network has been trained and optionally quantized, it is deployed onto a
microcontroller to perform the inference on the target platform. Deployment involves the
following phases:
• exporting the weights of the deep neural network and encoding them into a format
suitable for on-target inference,
• generating the inference program according to the topology of the deep neural network,
• compiling the inference program, and
• uploading the inference program with the weights onto the microcontroller’s ROM.

5.1. Existing Embedded AI Frameworks

Several embedded AI frameworks are already available. Among them, the most popular
ones are TensorFlow Lite for Microcontrollers [40] and STM32Cube.AI [41]. Other frameworks
stemming from research projects also exist and are discussed in the following.

5.1.1. TensorFlow Lite for Microcontrollers

TensorFlow Lite for Microcontrollers (or TFLite Micro) is a project derived from Tensor-
Flow Lite. Originally focused on deep neural network deployment on smartphones, it has been
made available for microcontrollers. TFLite Micro supports a wide range of operations [42],
enabling the deployment of a variety of deep neural networks, such as multi-layer percep-
trons and convolutional neural networks, including residual neural networks. Deep neural
networks are developed and trained using TensorFlow/Keras, and can then be deployed
semi-automatically onto a microcontroller.
TFLite Micro is intended to be generic enough to be deployed on any kind of 32-bit
microcontroller. The inference library is therefore portable, but it also means there is no
integration with specific microcontroller families and vendor tools. The trained deep neural
network (topology and weights) can be automatically converted to a format understandable
by the inference library, but there are no tools to generate and deploy the application code.
Moreover, the test application must be written by hand. Nevertheless, a template source code
for a few development boards (e.g., the SparkFun Edge) as well as a few demo applications
(e.g., keyword spotting) are available. Finally, TFLite Micro does not come with tools to
measure metrics such as the inference time or the RAM and ROM usage.
TFLite Micro supports computation in both floating point in binary32 format and fixed
point on 8-bit integers. The quantization technique uses a non-power-of-two scale factor, a
symmetric range for the weights and an asymmetric range for the activations. Biases are
quantized on 32-bit integers. Convolution operations can make use of a per-filter scale factor
and offset, while other operations use a per-tensor (i.e., per-layer) scale factor and offset [43,44].
There is no support for fixed point on 16-bit integers.
Inference with 8-bit integers can be accelerated using low-level optimizations pro-
vided by the CMSIS-NN [3] library from ARM. This library uses SIMD-like instructions
(from the ARMv7E-M instruction set architecture of Cortex-M4 cores) to perform two multi-
Sensors 2021, 21, 2984 11 of 34

ply–accumulate (MACC) operations on 16-bit operands with a single 32-bit accumulator in

one cycle.
While being entirely free/open-source, the complexity of the software architecture makes
it quite difficult to manipulate and extend. This is a substantial drawback in a research
environment, and it also comes with additional overhead. The deep neural network topology
is deployed as a sort of microcode that is interpreted at runtime instead of being statically
compiled. This process makes it more difficult for the compiler to perform optimizations and
causes a larger memory usage.

5.1.2. STM32Cube.AI
STM32Cube.AI is a software suite from STMicroelectronics that enables the deployment
of deep neural networks onto their STM32 family of microcontrollers. STM32Cube.AI supports
deployment of trained deep neural network models from several frameworks including Keras
and TensorFlow Lite. A wide range of operations are supported [45], allowing the deployment
of several deep neural network architectures, such as multi-layer perceptron and convolutional
neural networks, including residual neural networks.
STM32Cube.AI is a software suite fully integrated with other STMicroelectronics develop-
ment and deployment tools such as STM32CubeMX, STM32CubeIDE and
STM32CubeProgrammer. This provides a very straightforward and easy to use flow. More-
over, a test application is included to evaluate the model on target with a real test dataset,
providing metrics on inference time, and ROM and RAM usage, without having to write a
single line of code.
Like TFLite Micro, STM32Cube.AI supports computations in floating-point binary32
format and fixed point on 8-bit integers. In fact, the quantization on 8-bit integers comes from
TFLite. There is no support for fixed point on 16-bit integers.
STM32Cube.AI also has an optimized inference engine that seems to be partially based
on CMSIS-NN. However, as the source code of the inference engine is not freely available, it is
not clear what is optimized and how.
The inference library is entirely proprietary/closed-source, therefore it is not possible to
manipulate and extend this library. This represents a major drawback in a research environ-
ment. It is also not possible to use STM32Cube.AI on microcontrollers which are not part of
the STMicroelectronics portfolio. The inference process and optimizations are not detailed,
but unlike TFLite Micro, the network topology is compiled into a set of function calls to the
closed-source library rather than being interpreted at runtime.

5.1.3. Other Frameworks

Some other frameworks have been developed as part of research projects. These frame-
works mainly focus on “classical” machine learning (SVM, Decision Tree, etc.), for example,
emlearn [46] and Micro-LM [47], or multi-layer perceptron, for example, Gravity [48] and
FANN-on-MCU [49]. These frameworks do not support convolutional neural networks with
residual connections. At the time of this work, microTVM seems less mature and popular
than TensorFlow Lite for Microcontrollers and STM32Cube.AI. It is, therefore, not studied in
this work.

5.2. MicroAI: our Framework Proposition

As mentioned above, existing tools for quantized neural networks have some drawbacks
that motivated the development of our own framework. This framework addresses the
following issues:
Sensors 2021, 21, 2984 12 of 34

• open-source frameworks do not support convolutional neural networks with non-

sequential topologies,
• frameworks that support convolutional neural networks are proprietary or too complex
to be modified and extended easily,
• other frameworks do not provide 16-bit quantization,
• some frameworks are dedicated to a limited family of hardware targets.
In this work, we aim at providing a framework that is easy to extend and modify, and
that allows for a complete pipeline from the neural network training to the deployment and
evaluation on the microcontroller. Additionally, this framework must provide a lightweight
runtime on the microcontroller to reduce the overhead. Finally, our objective is to achieve a
performance close to existing solutions.
Our framework, called MicroAI, is built in two parts:
1. A neural network training code that relies on Keras or PyTorch.
2. A conversion tool (called KerasCNN2C) that takes a trained Keras model and produces
a portable C code for the inference
Both parts are written in Python since it is the most popular programming language to
build deep neural networks and it easily interfaces with existing frameworks, libraries and
tools.

5.3. MicroAI: General Flow

As seen in Figure 3, MicroAI provides an interface to automatically train, deploy and
evaluate an artificial neural network model. A configuration file written in TOML [50] is used
to describe the whole flow of an experiment:
• The number of iterations for the experiment (for statistical purposes)
• The dataset to use for training and evaluation
• The preprocessing steps to apply to the dataset
• The framework used for training
• The various model configurations to train, deploy and evaluate
• The configuration of the optimizer
• The post-processing steps to apply to the trained model
• The target configuration for deployment and evaluation
The three main steps, training, deployment and evaluation, are described in the following.
The commands used to trigger them are available in Appendix D.

Data acquisition and windowing

float32 Training

int8 Quantization-aware training with PyTorch

PyTorch to Keras model conversion int16 Post-training quantization

C inference code generation from Keras model

Deployment on microcontroller

Evaluation on microcontroller

Figure 3. MicroAI general flow for neural network quantization and evaluation on embedded target.
Sensors 2021, 21, 2984 13 of 34

5.4. MicroAI: Training

For the training phase, MicroAI is simply a wrapper around Keras or PyTorch.
A dataset requires an importation module to be loaded into an appropriate data model.
The training process expects a RawDataModel instance, which gathers the training and test sets.
This instance contains numpy arrays for the data and the labels. A higher-level data model
HARDataModel is also available for human activity recognition to process subjects and activities
more easily. This model is then converted to a RawDataModel using the DatamodelConverter
in the preprocessing phase. The preprocessing phase also includes features such as normaliza-
tion. Dataset importation modules for UCI-HAR, SMNIST and GTSRB (described in Section
6) are included and can easily be extended to new datasets.
To make use of a deep neural network architecture in the model configuration, it must be
first described according to the training framework API in use. The description of the model is
a template where parameters can be set by the user in the configuration file. MicroAI provides
the following neural network architectures for both Keras and PyTorch:
• MLP: a simple multi-layer perceptron with a configurable number of layers and neurons
per layer.
• CNN: a 1D or 2D convolutional neural network with configurable number of layers,
filters per layer, kernel and pool size, and number of neurons per fully connected layer
for the classifier
• ResNet: a 1D or 2D residual neural network (v1) with convolutional layers. The num-
ber of blocks and filters per layer, stride, kernel size, and optional BatchNorm can be
configured.
All architectures use ReLU activation units.
In the configuration file, several model settings can be described, each inside their own
[[model]] block. Each model will be trained sequentially. A common configuration for all
the models can be specified in a [model_template] block. Model configuration also includes
optimizer configuration and other parameters such as the batch size and the number of epochs.
Once the model is trained, some post-processing can be applied. It is for instance possible
to remove the SoftMax layer for Keras models with the RemoveKerasSoftmax module. This
layer is indeed useless when only inference is performed.
Even though it also performs model training, the quantization-aware training described
in Section 4.3 is also included for PyTorch as a post-processing step in the
QuantizationAwareTraining module. The actual training step before post-processing is seen
as a general training, before optionally performing post-training quantization or quantization-
aware training. The quantization-aware training can be seen as a fine-tuning on top of the
more general training (which can also be skipped if necessary). The quantization-aware
training does not actually convert the weights from a floating-point data type to an inte-
ger data type with fixed-point representation. This conversion is instead performed by the
KerasCNN2C conversion tool.
Support for additional learning frameworks can be added by creating a new class imple-
menting the LearningFramework interface and by supplying compatible model templates.

5.5. MicroAI: Deployment

MicroAI can deploy a trained model to perform inference on a target using either
STM32Cube.AI, TensorFlow Lite for Microcontrollers or our own tool, KerasCNN2C.
STM32Cube.AI can be used for all STM32 platforms, and support for the Nucleo-L452RE-
P with an STM32L452RE microcontroller is included. Support for other platforms using a
STM32 microcontroller can be added by providing a sample STM32CubeIDE project including
the X-CUBE-AI package. STM32Cube.AI does not support microcontrollers outside the STM32
family.
Sensors 2021, 21, 2984 14 of 34

TensorFlow Lite for Microcontrollers is a portable library that can be included in any
project. Therefore, it could be used for any 32-bit microcontroller. However only integration
with the SparkFun Edge platform with an Ambiq Apollo3 microcontroller is included in our
framework so far.
Similarly, KerasCNN2C produces a portable library that can be included in any project.
So far, only integration with the Nucleo-L452RE-P and the SparkFun Edge boards has been
performed. Support for other platforms can be added by providing project files that call
the inference code and a module that interfaces with the build and deployment tools for
that platform.
Please note that none of these tools can take a trained PyTorch model as an input to
deploy onto a microcontroller. The trained PyTorch model must therefore be converted to
a Keras model prior to the deployment. Our framework provides a module to perform
semi-automatic conversion from a PyTorch model to a Keras model. A Keras model that
matches the structure of the PyTorch model must be programmed and the matching between
the PyTorch model and Keras model layer names also must be specified. The semi-automatic
conversion module can then automatically copy the weights from the PyTorch model to the
Keras model and export it for use by one of the deployment tools.

5.6. KerasCNN2C: Conversion Tool from Trained Keras Model to Portable C Code
KerasCNN2C is a tool that we developed to automatically generate, from a trained Keras
model exported as an HDF5 file, a C library for inference. It can also be used independently of
the MicroAI framework.
In this work, only 1D models are evaluated on target. Work is underway for full support
of 2D-model deployment. Training and quantization are already supported, therefore 2D
models are evaluated offline. Here are the supported layers so far:
• Add
• AveragePooling1D
• BatchNormalization
• Conv1D
• Dense
• Flatten
• MaxPooling1D
• ReLU
• SoftMax
• ZeroPadding1D
Layers can have multiple inputs such as the Add layer, thus allowing residual neural
networks (ResNet) to be built. Sequential convolutional neural networks or multi-layer
perceptron models are also supported.
The generated library exposes a function in the model.h header to run the inference
process with the following signature:
1 void cnn (
c o n s t number_t in pu t [MODEL_INPUT_CHANNELS ] [ MODEL_INPUT_SAMPLES ] ,
3 o u t p u t _ l a y e r _ t y p e output ) ;

where number_t is the data type used during inference defined in the number.h header, and
MODEL_INPUT_CHANNELS and MODEL_INPUT_SAMPLES are the dimensions of the input defined
in the generated model.h header. The input and output arrays must be allocated by the caller.
The model inference function does not proceed to the conversion of the input from
fixed-point to floating-point representation when using a fixed-point inference code. The
caller must perform the conversion before feeding the buffer to the model inference function
(see Section 5.8).
Sensors 2021, 21, 2984 15 of 34

To this aim, a floating-point number x_float can be converted to a fixed-point number

x_fixed with the following call:
1 x _ f i x e d = clamp_to_number_t ( ( long_number_t ) f l o o r ( x _ f l o a t * (1 < <INPUT_SCALE_FACTOR) ) ) ;

where long_number_t is a type twice the size of number_t and clamp_to_number_t saturates
and converts to number_t. Both are defined in the number.h header.
INPUT_SCALE_FACTOR is the scale factor for the first layer, defined in the model.h header.
The output array corresponds to the output of the model’s last layer, which is typically
a fully connected layer when solving a classification problem. If the purpose is to predict a
single class, the caller must find the index of the max element in the output array.

5.7. KerasCNN2C: Conversion Process

This tool first parses the model using the Keras API from TensorFlow 2.4 and generates
an internal representation of the topology (i.e., a graph), with each node corresponding to a
layer.
Then, a series of transformations is performed to produce a graph better suited for
deployment on a microcontroller:
• combine ZeroPadding1D layers (if they exist) with the next Conv1D layer,
• combine ReLU activation layers with the previous Conv1D, MaxPooling1D, Dense or
Add layer,
• convert BatchNorm[51] weights from the mean µ, the variance V, the scale γ, the offsets
β and e to a multiplicand w and an addend b using the following formula:
γ
w= (5)
σ
√
σ= V+e (6)
γ×µ
b = β− (7)
σ
so that the output of the BatchNorm layer can be computed as y = w × x + b. It could be
folded in the previous convolutional layer, but this is not implemented yet.
Then, for each node in the graph, the weights of the layer go through the quantization and
conversion module if the conversion to fixed-point representation is enabled. The C inference
function is generated from a Jinja2 [52] template file using the layer’s configuration. Similarly,
the layer’s weights are converted into a C array from a Jinja2 template file. Code generation
is used to avoid runtime overhead of an interpreter such as the one used in TensorFlow Lite
for Microcontrollers. Additionally, it allows the compiler to perform better optimizations.
In fact, the layer’s configuration is generated as constants or literals in the code, allowing
the compiler to perform appropriate optimizations such as loop unrolling, using immediates
when relevant and doing better register allocation. By default, GCC’s -Ofast optimization
level is enabled. Moreover, the code is written in a simple and straightforward way. So far, no
special effort has been made to further optimize the source code for faster execution.
The allocator module aims to reduce RAM usage. To do so, it allocates the layer’s
output buffers in the smallest number of pools without conflicts. For each layer of the model,
its output buffer is allocated to the first pool that satisfies two conditions: it must neither
overwrite its input, nor the output of a layer that has not already been consumed. If there
is no such available pool, a new one is created. It is worth noting that the allocator module
does not yet try to optimize the allocation to minimize the size of each pool (this is a harder
problem to solve). In consequence, the total RAM usage is not optimized.
Finally, the main function cnn(...) is generated. This function only contains the
allocation of the buffers done by the allocator module and a sequence of calls to each of the
Sensors 2021, 21, 2984 16 of 34

layers’ inference functions. The correct input and output buffers are passed to each layer
according to the graph of the model.

5.8. KerasCNN2C: Quantization and Fixed-Point Computation

The post-training quantization is performed by the quantization module itself. The scale
factor for each layer is found according to the method in Section 4.1.4, but it can also be
specified manually for the whole network. The fixed-point coding used for all the weights is
computed according to this method as well, and the data type is converted from float to an
integer data type, such as int8_t for 8-bit quantization or int16_t for 16-bit quantization.
When doing quantization-aware training, the scale factors are found during the training
phase (also according to the method in Section 4.1.4). Therefore, the quantization module
reuses them. However, the weights are still in floating-point representation since the training
phase only relies on floating-point computation. In consequence, the quantization module
must perform a data type conversion similar to the one performed for post-training quantiza-
tion.
Once the model is deployed and running on the target, the fixed-point computation
can be done using a regular integer arithmetic and logic unit. The data type for the input
and output of a layer is the same as the one used to store the weights. To avoid overflows,
computation is done using a data type twice the width of the operands’ data type. For example,
if the data type of the weights and inputs is int16_t, then the intermediate results in a layer
are computed and stored with an int32_t data type. The result is then scaled back to the
correct output scale factor before saturating and converting it back to the original operands’
data type.
Before performing an addition or a subtraction, operands must be represented with the
same number of integer and fractional bits. This is not required for multiplication, but the
number of bits allocated for the fractional part of the result is the sum of the number of bits
for the fractional part of the two operands. Therefore, after a multiplication, the result must
be scaled to the required format by shifting the result to the right by the appropriate number
of bits.
In Appendix B, the number of operations required for the main layers of a residual neural
network in our implementation are provided, along with the number of cycles taken for these
operations. Enabling compiler optimizations generates some ARMv7E-M instructions, namely
SMLABB that performs a multiply–accumulate operation in one cycle (instead of two cycles).
However, the compiler does not make use of the SSAT operation that could allow saturating in
one cycle. Instead, it uses the same instructions as a regular max operation, that is, a compare
instruction and a conditional move instruction requiring a total of two cycles.
Sensors 2021, 21, 2984 17 of 34

6. Results
All the results presented in this section rely on the same model architecture, a ResNetv1-6
network with the layers shown in Figure 4. The number of filters per layer f is the same for
all layers, but is modified to adjust the number of parameters of the model. The convolutional
and pooling layers are one-dimensional except when handling the GTSRB dataset, for which
they are two-dimensional.

Input
dims=(x,y,c)

Convolution
size=(3, 3)
stride=(1, 1)
padding=(1, 1)
filters=f

ReLU

Convolution
size=(3, 3)
stride=(1, 1)
padding=(1, 1)
filters=f

MaxPooling
size=(2, 2)
stride=(2, 2)

Convolution
size=(1, 1)
ReLU stride=(1, 1)
padding=(0, 0)
filters=f

Convolution
size=(3, 3) MaxPooling
stride=(1, 1) size=(2, 2)
padding=(1, 1) stride=(2, 2)
filters=f

ReLU

Convolution
size=(3, 3)
stride=(1, 1)
padding=(1, 1)
filters=f

ReLU

Convolution
size=(3, 3)
stride=(1, 1)
padding=(1, 1)
filters=f

ReLU

MaxPooling
size=(x/2, y/2)
stride=(x/2, y/2)

Flatten

FullyConnected
out=n_classes

Figure 4. ResNet model architecture.

For each experiment, the residual neural network is initially trained using 32-bit floating-
point numbers (i.e., without quantization), and then evaluated over the testing set. This
baseline version is depicted as float32 in the figures shown in the following.
Sensors 2021, 21, 2984 18 of 34

The float32 neural network is quantized for inference with fixed-point on 16-bit integers
and is then evaluated without additional training. This version is depicted as int16 in the
figures shown hereafter. Quantization is performed using the Q7.9 format for the whole
network, meaning the number of bits n for the fractional part is fixed to 9.
The float32 neural network is also trained and evaluated for inference with fixed-point
on 8-bit integers using quantization-aware training. This version is indicated as int8 in the
figures. In this case the fixed-point precision can vary from layer to layer and is determined
using the method introduced in Section 4.1.4.
The SGD optimizer is used for all experiments. The stability of the SGD optimizer
motivated this choice, especially for the quantization-aware training. Training parameters are
described below for each dataset. Additionally, training and testing sets are normalized using
the z-score of the training set. It is worth noting that Mixup [53] is also used during training.
Accuracy is not evaluated directly on the target due to the amount of time it would
require. Only inference time for the UCI-HAR dataset is measured on the target.
In the figures, each point represents an average over 15 runs.

6.1. Evaluation of the MicroAI Quantization Method

6.1.1. Human Activity Recognition dataset (UCI-HAR)
The University of California Irvine’s hosted Human Activity Recognition dataset (UCI-
HAR) [54] is a dataset of activities of daily living recorded using the accelerometer and
gyroscope sensors of a smartphone. In this experiment, we use the raw data from the sensors
divided into fixed time windows, rather than the precomputed features. The reason is that
we want to perform real-time embedded recognition. To do so, it is necessary to avoid the
overhead of computing the features for each inference before entering the deep neural network.
Instead, the features are extracted by the convolutional neural network itself.
The dataset is divided into a training set of 7352 vectors and a testing set of 2947 vectors.
Each vector is a one-dimensional time series of 2.56 s composed of 128 samples sampled
at 50 Hz, with 50% overlap between vectors. Each sample has 9 channels: 3 axes of total
acceleration, 3 axes of angular velocity and 3 axes of body acceleration. Six different classes
are available in the dataset: walking, walking upstairs, walking downstairs, sitting, standing
and lying.
The initial training without quantization is performed using a batch size of 64 over 300
epochs. The initial learning rate is set to 0.05, the momentum is set to 0.9 and the weight
decay is set to 5 × 10−4 . The learning rate is multiplied by 0.13 at epochs 100, 200 and 250.
The quantization-aware training for fixed-point on 8-bit integers uses the same parameters.
As can be seen in Figure 5, for the UCI-HAR dataset, the same accuracy is obtained using
a 16-bit quantization (UCI-HAR int16) or 32-bit floating-point (i.e., the baseline UCI-HAR
float32), whatever the number of filters per convolution.
On the other hand, we observe that the 8-bit quantization causes a drop in accuracy that
increases in magnitude up to 0.81% when the number of filters per convolution grows, even
though quantization-aware training is used to mitigate this issue.
In Figure 6, we observe that the accuracy obtained using 8-bit and 16-bit quantization
is similar only for deep neural networks exhibiting a reduced number of parameters, in
other words a low memory footprint. As an example, for 16 filters per convolution, an 8-bit
quantization leads to an accuracy of 92.41% while requiring 3958 memory bytes to store the
parameters. When a 16-bit quantization is used, an accuracy of 92.46% can be achieved, but at
the cost of an increase in the required memory for storing the parameters (7916 bytes).
As can be seen, when more than 24 filters per convolution are used, the 16-bit quantization
clearly exhibits the best accuracy vs. memory ratio. For more than 48 filters per convolution,
the 8-bit quantization provides an even worse ratio than the baseline.
Sensors 2021, 21, 2984 19 of 34

UCI-HAR float32
0.950 UCI-HAR int8
UCI-HAR int16
0.945

0.940

Accuracy 0.935

0.930

0.925

0.920

0.915
20 30 40 50 60 70 80
Filters
Figure 5. Human Activity Recognition dataset (UCI-HAR): accuracy vs. filters.

0.950
80 80
40 48 64 48 64
40
0.945

32 32
0.940 40 48 64 80
24 24
32
Accuracy

0.935 24

0.930

16
0.925 16 16

0.920 UCI-HAR float32

UCI-HAR int8
UCI-HAR int16
0.915
0 50 100 150 200 250 300 350
Parameters memory (kiB)

Figure 6. Human Activity Recognition dataset (UCI-HAR): accuracy vs. parameter memory.

6.1.2. Spoken digits dataset (SMNIST)

Spoken MNIST is the spoken digits part of the written and spoken digits database for
multi-modal learning [55].
This dataset is made of spoken digits extracted from the Google Speech Commands
[56] dataset. The audio signal is preprocessed to obtain 12 MFCC plus an energy coefficient
using a window of 50 ms with 50% overlap over the audio files of approximately 1 s each,
generating one-dimensional series of 39 samples with 13 channels. The dataset is divided into
training and testing sets of 34,801 and 4107 vectors, respectively. Some samples are duplicated
to obtain 60,000 training vectors and 10,000 testing vectors. There are 10 different classes for
each digit, from 0 to 9.
The initial training, without quantization, uses a batch size of 256 over 120 epochs. The
initial learning rate is set to 0.05, the momentum is set to 0.9 and the weight decay is set to
5 × 10−4 . The learning rate is multiplied by 0.1 at epochs 40, 80 and 100.
Sensors 2021, 21, 2984 20 of 34

The quantization-aware training for fixed-point on 8-bit integers uses a batch size of 1024
over 140 epochs. Initial learning rate, momentum and weight decay are the same as for the
initial training. Learning rate is multiplied by 0.1 at epochs 40, 80, 100 and 120.
As can be observed in Figure 7 and regardless of the number of filters, the 16-bit quan-
tization (SMNIST int16) provides overall a similar accuracy compared to the floating-point
baseline (SMNIST float32). On the other hand, the accuracy drops by up to 1.07% when the
8-bit quantization is used. However, the accuracy drop slightly decreases when 48 filters per
convolution are used, and then stays around 0.5% or 0.6% for a higher number of filters.
In Figure 8, we can see that the 16-bit quantization is still the best solution in terms of
memory footprint. Despite the fact that the 8-bit quantization stays closer to 16-bit quantization
on SMNIST than on UCI-HAR, the 8-bit quantization does not provide any benefit over 16-bit
quantization in terms of accuracy vs. memory ratio, even for small neural networks.

0.970 SMNIST float32

SMNIST int8
SMNIST int16
0.965

0.960

0.955
Accuracy

0.950

0.945

0.940

0.935

0.930
20 30 40 50 60 70 80
Filters
Figure 7. Spoken digits dataset (SMNIST): accuracy vs. filters.

0.970
80 80
64 64
0.965 48 48
40 80 40
64
0.960 48
32 32

24 24
0.955
40
Accuracy

16 3216
0.950

24
0.945

0.940 16

0.935 SMNIST float32

SMNIST int8
0.930 SMNIST int16

0 50 100 150 200 250 300 350

Parameters memory (kiB)

Figure 8. Spoken digits dataset (SMNIST): accuracy vs. parameter memory.

Sensors 2021, 21, 2984 21 of 34

6.1.3. The German Traffic Sign Recognition Benchmark (GTSRB)

The German Traffic Sign Recognition Benchmark (GTSRB [57]) is a dataset containing
various color pictures of road signs. Image sizes vary between 15 × 15 to 250 × 250 pixels.
In this experiment, the two-dimensional images were scaled to 32 × 32 pixels using bilinear
interpolation and anti-aliasing, while keeping the 3 color channels (red, green, blue). The
dataset is divided into training and testing sets of 39,209 and 12,630 vectors, respectively.
There are 43 different classes, one for each type of road sign in the dataset.
The initial training without quantization uses a batch size of 128 over 120 epochs. The
initial learning rate is set to 0.01, the momentum is set to 0.9 and the weight decay is set to
5 × 10−4 . The learning rate is multiplied by 0.1 at epochs 40, 80 and 100.
The quantization-aware training for fixed-point on 8-bit integers uses a batch size of 512
over 120 epochs. The initial learning rate, momentum and weight decay are the same as for
the initial training. The learning rate is multiplied by 0.1 at epochs 20, 60, 80 and 100.
The accuracy results obtained for 8- and 16-bit quantization and the 32-bit floating-point
versions are shown in Figure 9 for different numbers of filters. As can be seen, the 16-bit
quantization (GTSRB int16) provides an accuracy similar to the one obtained with the baseline
(GTSRB float32). In the meantime, a drop in accuracy of up to 1.1% can be observed when the
8-bit quantization is used with this GTSRB dataset. However, as it was observed with the
SMNIST dataset, the accuracy approaches that of the baseline when the network has more
filters (a drop of only 0.33% for 64 filters).

0.98
GTSRB float32
GTSRB int8
GTSRB int16
0.97

0.96
Accuracy

0.95

0.94

0.93

0.92

20 30 40 50 60 70 80
Filters
Figure 9. German Traffic Sign Recognition Benchmark: accuracy vs. filters.

Moreover, even though the 8-bit quantization does not outperform the results obtained
with the 16-bit quantization, Figure 10 shows that the 8-bit quantization can represent an
interesting solution when a two-dimensional network is used on an image dataset.
Sensors 2021, 21, 2984 22 of 34

0.98
80 80
64 64
80
40 6448 40 48
0.97
3248 32
40
0.96 24 24
32

Accuracy 0.95
24

16 16
0.94

16
0.93

GTSRB float32
0.92 GTSRB int8
GTSRB int16

0 200 400 600 800

Parameters memory (kiB)

Figure 10. German Traffic Sign Recognition Benchmark: accuracy vs. parameter memory.

6.2. Evaluation of Frameworks and Embedded Platforms

In our experiments, two different targets have been used to deploy a deep neural network
on a microcontroller: the SparkFun Edge and the Nucleo-L452RE-P. Both platforms are set to
run at 48 MHz on a 3.3 V supply and their main specifications are summarized in Table 3.

Table 3. Embedded platforms.

Board Nucleo-L452RE-P SparkFun Edge

MCU STM32L452RE Ambiq Apollo3
Core Cortex-M4F Cortex-M4F
Max Clock 80 MHz 48 MHz (96 MHz “Burst Mode”)
RAM 128 kiB 384 kiB
Flash 512 kiB 1024 kiB
CoreMark/MHz 3.42 2.479
Run current @3.3 V, 48 MHz 4.80 mA 0.82 mA *
* After removing peripherals (Mic1&2, accelerometer . . . )

VDD_MCU is set to 1.8 V for the Nucleo-L452RE-P platform and current measurement
is taken from the IDD jumper. It does not have any on-board peripherals. On the SparkFun
Edge board, the measure of the current is done using the power input pin of the board (after
the programmer). The built-in peripherals were unsoldered from the board to eliminate
their power consumption. The current consumption was measured using a Brymen BM857s
auto-ranging digital multimeter configured in max mode. The energy results are based on
this maximum observed current consumption and the supply voltage of 3.3 V.
As can be seen in Table 3, and even though both platforms are built around a Cortex-M4F
core running at the same frequency, thanks to its subthreshold operation the SparkFun Edge
board consumes considerably less power than the Nucleo-L452RE-P, while also having more
Flash and RAM memory. However, results obtained with the CoreMark benchmark show that
the Ambiq Apollo3 microcontroller is slower than the STM32L452RE. It is worth noting that
the CoreMark results have been measured on the Ambiq Apollo3 microcontroller, while they
have been taken from the datasheet for the STM32L452RE microcontroller.
Sensors 2021, 21, 2984 23 of 34

The deep neural network used in our experiments is the residual neural network de-
scribed in Section 6. This network has been trained on the UCI-HAR dataset presented in
Section 6.1.1. Inference time is measured from 50 test vectors from the testing set of UCI-
HAR on the microcontrollers. TensorFlow Lite for Microcontrollers version 2.4.1 has been
used to deploy the deep neural network on the SparkFun Edge board, while STM32Cube.AI
version 5.2.0 has been used to deploy it on the Nucleo-L452RE-P board, both for the 32-bit
floating-point and fixed-point on 8-bit integers inference. Our framework is used to deploy
the deep neural network on both platforms for 32-bit floating-point, fixed-point on 16-bit
integers and fixed-point on 8-bit integer inference. It is worth noting that optimizations for the
Cortex-M4F provided by CMSIS-NN are enabled for both TensorFlow Lite for Microcontrollers
and STM32Cube.AI tools. Our framework does not make use of these optimizations yet. The
main characteristics of the frameworks are summarized in Table 4.

Table 4. Embedded AI frameworks.

Framework STM32Cube.AI TFLite Micro MicroAI

Source Keras, TFLite, . . . Keras, TFLite Keras, PyTorch *
Validation Integrated tools None Integrated tools
Metrics RAM/ROM footprint, None ROM footprint
inference time, MACC inference time
Portability STM32 only Any 32-bit MCU Any 32-bit MCU
Built-in platform STM32 boards 32F746GDiscovery, SparkFun Edge,
support (Nucleo, . . . ) SparkFun Edge, . . . Nucleo-L452-RE-P
Sources Private Public Public
Data type float, int8_t float, int8_t float, int8_t, int16_t
Quantized data Weights, activations Weights, activations Weights, activations
Quantizer Uniform (from TFlite) Uniform Uniform
Quantized coding Offset and scale Offset and scale Fixed-point Qm.n
* PyTorch models must be semi-automatically converted to Keras model prior to deployment

To compare software and hardware platforms, only the results with 80 filters per convo-
lution are analyzed below. Nevertheless, results with less than 80 filters are still available in
the tables of Appendix E to highlight how fast and efficient a small deep neural network can
be when deployed on a constrained embedded target. They also highlight a higher overhead
for very small neural networks especially for TensorFlow Lite for Microcontrollers compared
to our framework.
In Figure 11, we can observe that TFLite Micro has a higher overhead than
STM32Cube.AI, while MicroAI exhibits a slightly lower overhead than STM32Cube.AI. As
outlined in Table A4 of Appendix E, when the number of filters per convolution increases,
most of the ROM is used by the model’s weights.
The inference time obtained for both platforms and the different deployment tools is
illustrated in Figure 12. As can be seen, the STM32Cube.AI with the 8-bit inference provides
the best solution as it requires only 352 ms for one inference. In the same configuration,
TensorFlow Lite for Microcontrollers requires 592 ms for one inference. Finally, 1034 ms and
1003 ms are required for one inference using our framework on the Nucleo-L452RE-P board
and the SparkFun Edge board, respectively.
Sensors 2021, 21, 2984

TF TF
Lit Lit
eM e
icr Response time (ms) Mi
oS cro ROM footprint (kiB)
pa Sp
rk a rk

0
250
500
750
1000
1250
1500
1750
2000
0
100
200
300
400

Fu Fu
Mi nE Mi nE
cro dg cro d ge
AI ef AI

convolution.
loa flo
Sp
ar t3
Sp
ar a t3
kF 2 kF 2
Mi un Mi un
cro Ed cro Ed
AI ge A IN
ge
ST Nu flo ST uc flo
at at
M3 cle
o 32 M3 le oL 3 2
2C L4 2C 45
ub 52 ub 2R
e.A RE e.A EP
IN Pf IN flo
uc loa uc at
leo t3 le
L4 2 oL 32
52 45
Mi RE Mi 2R
cro Pf cro EP
AI loa AI flo
Sp t3 Sp at
ar 2 ar 32
kF k Fu
Mi un Mi nE
cro Ed cro dg
AI ge AI ei
Nu int Nu nt
c leo 16 cle 16
TF L4 TF oL
Lit 52 Lit 45
eM RE e Mi
2R
EP
icr Pi cro
oS nt int
pa 16 Sp 16
rk a rk
Fu Fu
Mi nE Mi nE
cro dg cro d ge
AI ei
nt A IS int
Sp
ar 8 pa 8
kF rk
Mi un Mi Fu
Ed nE
cro
AI ge
cro
AI d ge
N int Nu i nt
ST
M3
uc
leo 8 ST
M cle 8
2C L4 32 oL
ub 52 Cu 45
e.A RE b e.A 2R
Pi EP
IN nt IN int
uc
leo 8 uc
le 8
L4 oL
52 45
RE 2R
Pi EP
nt int
8 8
int8
int8

int16
int16

float32
float32

Figure 11. ROM footprint for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters per convolution.
24 of 34

Figure 12. Inference time for 1 input for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters per
Sensors 2021, 21, 2984 25 of 34

When using fixed-point on 16-bit integers for the inference, our framework provides
approximately the same performance as with 8 bits. The reason is that the inference code is
the same: similar instructions are generated, and computations are performed using 32-bit
registers. On the Nucleo-L452RE-P, we can observe that the inference time for one input is
1223 ms, while it is only 1042 ms on the SparkFun Edge board. We guess this improvement is
due to different implementations around the core in terms of memory access, especially the
cache for the Flash memory.
Figure 12 also shows that, whatever the tool and target, the 32-bit floating-point inference
is slower than with 16- or 8-bit quantization. We can also observe that our framework requires
1561 ms and 1512 ms for one inference on the SparkFun Edge and the Nucleo-L452RE-P boards,
respectively. The STM32Cube.AI requires 1387 ms for one inference on the Nucleo-L452RE-P
board. Our framework therefore exhibits a comparable performance to the STM32Cube.AI.
Finally, we can see that TensorFlow Lite for microcontrollers on the SparkFun Edge board
provides lower performance, requiring 2087 ms to perform one inference.
To conclude, and as outlined in Figure 13, we can say the SparkFun Edge board provides
the best power efficiency in all situations. The reason is that the SparkFun Edge board power
consumption is approximately 6 times lower than the Nucleo-L452RE-P. Using the SparkFun
Edge board and TensorFlow Lite for Microcontroller with fixed-point on 8-bit integers, one
inference requires 0.45 µWh of energy consumption. In contrast, our framework requires
0.75 µWh and 0.78 µWh on the SparkFun Edge board for inference with fixed-point on 8-bit
and 16-bit integers, respectively. When 32-bit floating-point is used for inference on the
SparkFun Edge board, our framework provides a better energy efficiency than TensorFlow
Lite for Microcontrollers as it requires 1.17 µWh instead of 1.57 µWh.

7
float32
int16
int8
6

5
Energy (µWh)

0
16

8
32

8
2

int
t3

int

int
t3

int
at

int

Pi
loa

loa

P
flo

flo

RE
ge

EP
ef

Ed
ge

2R
Ed

52
RE

RE
dg

un
Ed

L4
nE

kF
Fu
n

leo

leo
L
Fu

ar
rk

leo
rk

uc
Sp

uc
pa
leo

leo

uc
pa

IN
IS
uc

cro

AI
N
oS

.A
cro
N

Mi
cro

cro
A

be
icr

cro

Mi
e.A
cro

ite
Mi

u
cro
eM

2C
L
Mi

TF
Mi
Lit

M3
2C
TF

ST
M3
ST

Figure 13. Energy consumption for 1 input for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters
per convolution.
Sensors 2021, 21, 2984 26 of 34

Concerning the energy consumed on the Nucleo-L452RE-P board, our framework requires
4.58 µWh, 5.42 µWh and 6.70 µWh for one inference using fixed-point on 8-bit integers, on
16-bit integers and 32-bit floating-point, respectively. In comparison, only 6.15 µWh of energy
is required for one inference when the STM32Cube.AI framework is used with 32-bit floating-
point. Finally, we can see that the required energy for one inference when using STM32Cube.AI
with fixed-point on 8-bit integers is 1.56 µWh on the Nucleo-L452RE-P. This amount of energy
is similar to the one obtained with TensorFlow Lite for Microcontrollers on the SparkFun Edge
board when performing floating-point inference.

7. Discussion
First, a high variance is observable when we compare the accuracy results obtained
on the three datasets versus the model size. This variability makes it difficult to draw any
definitive conclusions. However, there is a trend in our results that provides some insights
into performance for each experiment.
As has been shown, execution using fixed-point on 8-bit and 16-bit integers provides a
significant decrease in the inference time, thus also reducing the average power consumption.
As power consumption is a key parameter in embedded systems, shorter inference times are
interesting as they make it possible either to reduce the microcontroller’s operating frequency
or to put the microcontroller in sleep mode for a longer period between two inferences. In
addition, execution using 8-bit and 16-bit integers also provides a significant reduction in
memory footprint. The memory required for the model parameters is divided by 4 and 2 for
for 8-bit and 16-bit quantization, respectively. It is worth noting that the RAM usage, which is
not illustrated here, is also reduced.
Our results also show that performing inference using quantization with fixed-point on
16-bit integers does not lead to a drop in accuracy, whatever test case is considered. Moreover,
inference using 16 bits does not require quantization-aware training to achieve such results.
As both the power consumption and the memory footprint can be decreased, fixed-point
quantization on 16-bit integers is therefore always preferable to 32-bit floating-point inference.
Conversely, 8-bit quantization does not provide a substantial improvement over 16-bit
quantization. Moreover, 8-bit quantization requires performing quantization-aware training.
It is worth noting that quantization-aware training for 8-bit quantization introduces more
variance in the results over the baseline, and is also more sensitive to a change in the training
parameters. As it is quite difficult to achieve a stable training, it is preferable to use an
optimizer such as SGD with conservative parameters, instead of optimizers such as Adam
or RAdam, to reduce the variance of the results, even though it means achieving a lower
maximum accuracy.
During our experiments, it was also observed that the 8-bit post-training quantization
of TensorFlow Lite achieved better results compared to the 8-bit quantization-aware training
provided by our framework. This is likely due to the combination of per-filter quantization,
asymmetric range and non-power-of-two scale factor, as well as optimizations of TensorFlow
Lite to avoid unnecessary truncation and thus loss of precision. We also observed that using
9 bits instead of 8 bits during the post-training quantization allows us to outperform the
TensorFlow Lite quantization performance. Some results showing this improvement are
available in Appendix C for the UCI-HAR dataset. From these results, we can conclude
that the slight additional precision brought by the combination of per-filter quantization,
asymmetric range and non-power of two scale factor does in fact matter. Implementing these
methods in our framework seems therefore required to reduce the accuracy loss of our 8-bit
quantization.
Another benefit of 8-bit quantization is that SIMD instructions can be used (with some
classes of microcontrollers) to improve the inference time and thus further reduce the power
Sensors 2021, 21, 2984 27 of 34

consumption. Such instructions allow performing in a single cycle either 2 multiply–accumulate

operations with 16-bit operands and a common accumulator (SMLAD), or 2 additions of 16-bit
operands (QADD16), or a shift and saturation operation (SSAT). The SMLAD, QADD16 and
SSAT instructions are not yet used in our framework, but this work is in progress. Nonetheless,
a 16-bit quantization scheme can be used with our framework, which is not the case with
either TensorFlow Lite for Microcontrollers or STM32Cube.AI. As presented in the results, the
16-bit quantization from our framework provides a good compromise between accuracy, infer-
ence time and memory footprint, without requiring additional work on quantization-aware
training.
The results obtained on inference time clearly show that both the software and hardware
platforms have a substantial impact on energy efficiency. STM32Cube.AI offers the most
optimized inference engine in terms of execution time, both in floating-point and fixed-
point on integers. Our results show that TensorFlow Lite for Microcontrollers is slower
than STM32Cube.AI in both conditions. For the floating-point inference, our framework is
in between these two software platforms and is only slightly slower than STM32Cube.AI.
However, as optimizations using SIMD instructions have not been implemented yet in our
framework, inference using 8-bit integers still provides lower performance than TensorFlow
Lite for Microcontrollers and STM32Cube.AI.
Regardless of the software performance, running STM32Cube.AI on a Nucleo-L452RE-P
board is only competitive with inference using 8-bit integers when compared to TensorFlow
Lite for Microcontrollers in 32-bit floating-point inference on the SparkFun Edge board. The
reason is that the Ambiq Apollo3 microcontroller on the SparkFun Edge board is much more
energy efficient. In all remaining cases, running TensorFlow Lite for Microcontrollers or our
framework on the SparkFun Edge board provides much better energy efficiency figures than
running STM32Cube.AI or our framework on the Nucleo-L452RE-P board.

8. Conclusions
In this work, we presented a framework to perform quantization and then deployment
of deep neural networks on microcontrollers. This framework represents an alternative to
the STM32Cube.AI proprietary solution and TensorFlow Lite for Microcontrollers, an open-
source but complex environment. Inference time and energy efficiency measured on two
different embedded platforms demonstrated that our framework is a viable alternative to the
aforementioned solutions to perform deep neural network inference. Our framework also
introduces a fixed-point on 16-bit integer post-training quantization which is not available
with the two other frameworks. We have shown that this 16-bit fixed-point quantization
provides an improvement over a 32-bit floating-point inference, while being competitive with
fixed-point on 8-bit integer quantization-aware training. It provides a reduced inference time
compared to floating-point inference. Moreover, the memory footprint is divided by two
while keeping the same accuracy. The 8-bit quantization provides further improvements in
inference time and memory footprint but at the cost of a slight decrease in accuracy and a
more complex implementation.
Work is still in progress to implement some optimization techniques for fixed-point on
8-bit integer inference. Three optimizations are especially targeted: per-filter quantization,
asymmetric range and non-power-of-two scale factor. In addition, using SIMD instructions
in the inference engine should help further decrease the inference time. These optimizations
would therefore make our framework more competitive in terms of inference time and
accuracy. Another possible improvement for fixed-point on integers inference consists of
using 8-bit quantization for the weights and 16-bit quantization for the activations. TensorFlow
Lite for Microcontrollers is currently in the process of implementing this technique. Mixed
precision can indeed provide a way to reduce the memory footprint of layers that do not need a
Sensors 2021, 21, 2984 28 of 34

high-precision representation (using 8 bits for weights and activations), while keeping a higher
precision (16-bit representation) for layers that need it. The CMix-NN [58] library already
provides an implementation of convolution functions for various data type configurations (in
2, 4 and 8 bits). To further improve power consumption and memory footprint, binary neural
networks can also be considered. However, to run them efficiently on microcontrollers, binary
neural networks would need to be implemented using bit-wise operations on 32-bit registers.
This way, as many as 32 computations could be performed in parallel.
Apart from quantization, other techniques can be used to improve the execution of deep
neural networks on embedded targets. One of these techniques is the big/LITTLE DNN
approach [59] where the inference is first done on a very small deep neural network. Then, if
the confidence is too low, inference is done using a larger deep neural network to reduce the
confusion of the classification task. This technique allows a fast inference response time for
most inputs, thus lowering the power consumption. In fact, it has been shown that the set
of inputs that are difficult to classify and so require running the bigger deep neural network
is small. However, this approach does not lower the memory footprint. Other techniques
such as pruning can also be used to obtain a smaller deep neural network while keeping
the same accuracy. When structured pruning [60] is used, for instance, entire filters are
completely removed from the convolutional neural network model. This reduces both the
memory footprint and the power consumption. Finally, other optimization techniques also
consider new neural network architectures. One can cite for example the recently published
MCUNet [2] framework with its TinyNAS tool that aims to identify the neural network model
that will best perform on the target.
Future works will also be dedicated to the deployment of neural network architectures
on FPGA using high-level synthesis tools such as Vivado. In fact, a feasibility study has
already been performed and has shown that our framework can be also used for deployment
on FPGA. Moreover, work is in progress to natively support automatic PyTorch deployment.
To do so, the features provided by the torch.fx module of the newly released PyTorch 1.8.0
are used.
Finally, we are currently working on a real application of our framework that consists in
integrating artificial intelligence into smart glasses [61] to perform, among other tasks, human
activity recognition in the context of elder care. Preliminary results have been published in [6].

Supplementary Materials: The open-source MicroAI software framework [8] is available online at
https://bitbucket.org/edge-team-leat/microai_public.
Author Contributions: Investigation, P.N.; methodology, P.N. and G.B.H.; software, P.N. and G.B.H.;
supervision, A.P., B.M. and V.G.; writing—original draft preparation, P.N.; writing—review and editing,
G.B.H., A.P., B.M., V.G. All authors have read and agreed to the published version of the manuscript.
Funding: This research is funded by “Université Côte d’Azur”, “CNRS”, “Région Sud Provence-Alpes-
Côte d’Azur, France” and “Ellcie Healthy”.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in
the decision to publish the results.
Sensors 2021, 21, 2984 29 of 34

Appendix A. Comparison of the Inference Times of a Microcontroller, a CPU and a GPU

Table A1. Microcontroller (STM32L452RE), CPU (Intel Core i7-8850H) and GPU (NVidia Quadro
P2000M) platforms. Power consumption figures for the GPU and the CPU are the TDP values from the
manufacturer and do not reflect the exact power consumption of the device.

Platform Model Framework Power Consumption

MCU STM32L452RE STM32Cube.AI 0.016 W
CPU Intel Core i7-8850H TensorFlow 45 W
GPU NVidia Quadro P2000M TensorFlow 50 W

Table A2. Comparison of 32-bit floating-point inference time for a single input on a microcontroller, a
CPU and a GPU. The neural network architecture is described in Section 6 with the number of filters per
convolution layer varying from 16 to 80, and the dataset is described in Section 6.1.1. For the CPU and
the GPU, the inference batch size is set to 512 and the dataset is repeated 104 times to try to compensate
for the large startup overhead compared to the total inference time. Measurements are averaged over at
least 5 runs.

Inference Time (ms)

Platform
16 Filters 24 Filters 32 Filters 40 Filters 48 Filters 64 Filters 80 Filters
MCU 85 174 271 404 544 921 1387
CPU 0.0396 0.0552 0.0720 0.0937 0.1134 0.1538 0.2046
GPU 0.0227 0.0197 0.0223 0.0284 0.0317 0.0395 0.0515

Appendix B. Number of Integer ALU Operations for a Fixed-Point Residual

Neural Network

Table A3. Number of arithmetic and logic operations with fixed-point on integers inference for the main
layers of a residual neural network with f the number of filters (output channels), s the number of input
samples, c the number of input channels, k the kernel size, n the number of neurons and i the number of
input layers to the residual Add layer. Conv1D is assumed to be without padding and with a stride of 1.

MACC (1 Cycle) Add (1 Cycle) Shift (1 Cycle) Max/Saturate (2 Cycles)

Conv1D f ×s×c×k N/A 2× f ×s f ×s
ReLU N/A N/A N/A c×s
Maxpool N/A N/A N/A c×s×k
Add N/A s × c × ( i − 1) s×c×i c×s
FullyConnected n×s N/A 2×n n
Sensors 2021, 21, 2984 30 of 34

Appendix C. Comparison of TensorFlow Lite for Microcontrollers and

MicroAI Quantizations

0.955
UCI-HAR int8 MicroAI QAT
UCI-HAR float32
UCI-HAR int9 MicroAI PTQ
UCI-HAR int8 TFLite PTQ
0.950

0.945
Accuracy

0.940

0.935

32 34 36 38 40 42 44 46 48
Filters
Figure A1. Accuracy vs. filters for baseline (float32), 8-bit post-training quantization from TensorFlow
Lite (int8 TFLite PTQ), 8-bit quantization-aware training from our framework (int8 MicroAI QAT), and
9-bit post-training quantization from our framework (int9 MicroAI PTQ). The neural network architecture
is described in Section 6 with the number of filters per convolution layer varying from 32 to 48, and the
dataset is described in Section 6.1.1.

Appendix D. MicroAI Commands to Run for Automatic Training and Deployment of

Deep Neural Networks
Data can be preprocessed (e.g., to apply normalization) from the source dataset and
serialized to an intermediate dataset file with the following command:
1 m i c r o a i < c o n f i g . toml > p r e p r o c e s s _ d a t a

The training phase is started by running the following command:

1 m i c r o a i < c o n f i g . toml > t r a i n

Before being deployed and evaluated, the appropriate code must be generated and built
for the targeted platform by running the following command:
1 m i c r o a i < c o n f i g . toml > prepare_deploy

Once the binaries are generated, they can be deployed, and the model can be evaluated
on the target by running the following command:
1 m i c r o a i < c o n f i g . toml > deploy_and_evaluate
Sensors 2021, 21, 2984 31 of 34

Appendix E. Detailed Results of the Evaluation of Frameworks and Embedded

Platforms Evaluation

Table A4. ROM footprint vs. filters for TFLite Micro, STM32Cube.AI and MicroAI.

ROM Footprint (kiB)

Framework Target Data Type 16 Filters 24 Filters 32 Filters 40 Filters 48 Filters 64 Filters 80 Filters
TFLiteMicro SparkFunEdge float32 116.520 133.988 157.957 188.426 225.395 318.926 438.363
MicroAI SparkFunEdge float32 54.316 67.066 91.035 121.512 158.473 251.863 371.332
MicroAI NucleoL452REP float32 55.770 68.145 92.129 122.582 159.559 253.004 372.434
STM32Cube.AI NucleoL452REP float32 61.965 79.449 103.410 133.898 170.859 264.289 383.742
MicroAI SparkFunEdge int16 46.952 50.629 62.629 77.832 96.355 142.973 202.699
MicroAI NucleoL452REP int16 48.129 51.629 63.613 78.855 97.340 144.051 203.770
TFLiteMicro SparkFunEdge int8 111.051 117.066 124.691 133.957 144.832 171.473 204.613
MicroAI SparkFunEdge int8 43.256 42.249 48.229 55.854 65.089 88.343 118.202
MicroAI NucleoL452REP int8 45.038 43.474 49.464 57.078 66.322 89.683 119.541
STM32Cube.AI NucleoL452REP int8 72.742 77.746 84.336 92.582 102.430 126.996 158.098

Table A5. Inference time for one input vs. filters for TFLite Micro, STM32Cube.AI and MicroAI.

Response Time (ms)

Framework Target Data Type 16 Filters 24 Filters 32 Filters 40 Filters 48 Filters 64 Filters 80 Filters
TFLiteMicro SparkFunEdge float32 179.633 294.157 438.541 624.172 860.835 1406.945 2087.241
MicroAI SparkFunEdge float32 53.247 153.732 259.212 394.494 569.852 1017.118 1561.264
MicroAI NucleoL452REP float32 55.762 152.426 259.160 395.721 559.249 976.732 1512.143
STM32Cube.AI NucleoL452REP float32 85.359 174.082 271.362 403.898 544.406 921.646 1387.083
MicroAI SparkFunEdge int16 40.867 113.035 191.439 287.655 389.450 667.547 1041.617
MicroAI NucleoL452REP int16 44.915 120.308 205.499 318.310 459.880 796.310 1223.513
TFLiteMicro SparkFunEdge int8 92.529 130.760 172.673 225.092 280.942 418.198 591.785
MicroAI SparkFunEdge int8 39.417 101.704 172.551 259.830 375.840 658.441 1003.365
MicroAI NucleoL452REP int8 43.003 107.705 180.830 272.986 383.761 659.996 1034.033
STM32Cube.AI NucleoL452REP int8 32.297 53.871 80.388 111.635 146.022 242.002 352.079

Table A6. Energy consumption for 1 input vs. filters for TFLite Micro, STM32Cube.AI and MicroAI.

Energy (µWh)
Framework Target Data Type 16 Filters 24 Filters 32 Filters 40 Filters 48 Filters 64 Filters 80 Filters
TFLiteMicro SparkFunEdge float32 0.135 0.221 0.330 0.469 0.647 1.058 1.569
MicroAI SparkFunEdge float32 0.040 0.116 0.195 0.297 0.428 0.765 1.174
MicroAI NucleoL452REP float32 0.247 0.675 1.148 1.753 2.478 4.327 6.700
STM32Cube.AI NucleoL452REP float32 0.378 0.771 1.202 1.789 2.412 4.083 6.146
MicroAI SparkFunEdge int16 0.031 0.085 0.144 0.216 0.293 0.502 0.783
MicroAI NucleoL452REP int16 0.199 0.533 0.910 1.410 2.038 3.528 5.421
TFLiteMicro SparkFunEdge int8 0.070 0.098 0.130 0.169 0.211 0.314 0.445
MicroAI SparkFunEdge int8 0.030 0.076 0.130 0.195 0.283 0.495 0.754
MicroAI NucleoL452REP int8 0.191 0.477 0.801 1.209 1.700 2.924 4.581
STM32Cube.AI NucleoL452REP int8 0.143 0.239 0.356 0.495 0.647 1.072 1.560
Sensors 2021, 21, 2984 32 of 34

References
1. Wang, Y.; Wei, G.; Brooks, D. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv 2019, arXiv:1907.10701.
2. Lin, J.; Chen, W.M.; Lin, Y.; Cohn, J.; Gan, C.; Han, S. MCUNet: Tiny Deep Learning on IoT Devices. In Proceedings of the 34th
Conference on Neural Information Processing Systems (NeurIPS 2020), Online, 6–12 December 2020.
3. Lai, L.; Suda, N. Enabling Deep Learning at the IoT Edge. In Proceedings of the International Conference on Computer-Aided
Design (ICCAD’18), San Diego, CA, USA, 5–8 November 2018; Association for Computing Machinery: New York, NY, USA, 2018;
doi:10.1145/3240765.3243473.
4. Kromes, R.; Russo, A.; Miramond, B.; Verdier, F. Energy consumption minimization on LoRaWAN sensor network by using an
Artificial Neural Network based application. In Proceedings of the 2019 IEEE Sensors Applications Symposium (SAS), Sophia
Antipolis, France, 11–13 March 2019; pp. 1–6, doi:10.1109/SAS.2019.8705992.
5. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International
Conference on Machine Learning (PMLR 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.;
Volume 97, pp. 6105–6114.
6. Novac, P.E.; Russo, A.; Miramond, B.; Pegatoquet, A.; Verdier, F.; Castagnetti, A. Toward unsupervised Human Activity Recognition
on Microcontroller Units. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD), 2020, Kranj,
Slovenia, 26–28 August 2020; pp. 542–550, doi:10.1109/DSD51259.2020.00090.
7. Pimentel, J.J.; Bohnenstiehl, B.; Baas, B.M. Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and
Throughput Tradeoffs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 100–113, doi:10.1109/TVLSI.2016.2580142.
8. Novac, P.E.; Pegatoquet, A.; Miramond, B. MicroAI, a software framework for end-to-end deep neural networks training, quantization
and deployment onto embedded devices. Version 1.0. 2021. doi:10.5281/zenodo.5507397.
9. Choi, J.; Chuang, P.I.J.; Wang, Z.; Venkataramani, S.; Srinivasan, V.; Gopalakrishnan, K. Bridging the accuracy gap for 2-bit quantized
neural networks (qnn). arXiv 2018, arXiv:1807.06964.
10. Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. arXiv 2019, arXiv:1902.08153.
11. Nikolić, M.; Hacene, G.B.; Bannon, C.; Lascorz, A.D.; Courbariaux, M.; Bengio, Y.; Gripon, V.; Moshovos, A. Bitpruning: Learning
bitlengths for aggressive and accurate quantization. arXiv 2020 arXiv:2002.03090.
12. Uhlich, S.; Mauch, L.; Yoshiyama, K.; Cardinaux, F.; Garcia, J.A.; Tiedemann, S.; Kemp, T.; Nakamura, A. Differentiable quantization
of deep neural networks. arXiv 2019, arXiv:1905.11452.
13. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks. In Advances in Neural Information
Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: 2016, Barcelona, Spain,
5–10 December 2016; Volume 29.
14. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In
Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham,
Switzerland, 2016; pp. 525–542.
15. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
16. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360.
17. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceed-
ings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255,
doi:10.1109/CVPR.2009.5206848.
18. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778, doi:10.1109/CVPR.2016.90.
19. Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the 28th
International Conference on Neural Information Processing Systems - Volume 1, Montreal, Canada, 7–10 December 2015; MIT Press:
Cambridge, MA, USA, 2015; pp. 1135–1143.
20. Yamamoto, K.; Maeno, K. PCAS: Pruning Channels with Attention Statistics. arXiv 2018, arXiv:1806.05382.
21. Hacene, G.B.; Lassance, C.; Gripon, V.; Courbariaux, M.; Bengio, Y. Attention based pruning for shift networks. arXiv 2019,
arXiv:1905.12300.
22. Ramakrishnan, R.K.; Sari, E.; Nia, V.P. Differentiable Mask for Pruning Convolutional and Recurrent Networks. In Proceedings of the
2020 17th Conference on Computer and Robot Vision (CRV), Ottawa, ON, Canada, 13–15 May 2020; pp. 222–229.
23. He, Y.; Ding, Y.; Liu, P.; Zhu, L.; Zhang, H.; Yang, Y. Learning Filter Pruning Criteria for Deep Convolutional Neural Networks
Acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19
June 2020; pp. 2009–2018.
24. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman
coding. arXiv 2015, arXiv:1510.00149.
Sensors 2021, 21, 2984 33 of 34

25. Fard, M.M.; Thonet, T.; Gaussier, E. Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognit.
Lett. 2020, 138, 185–192.
26. Cardinaux, F.; Uhlich, S.; Yoshiyama, K.; García, J.A.; Mauch, L.; Tiedemann, S.; Kemp, T.; Nakamura, A. Iteratively training look-up
tables for network quantization. IEEE J. Sel. Top. Signal Process. 2020, 14, 860–870.
27. He, Z.; Fan, D. Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approxima-
tion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
16–20 June 2019; pp. 11430–11438, doi:10.1109/CVPR.2019.01170.
28. Lee, E.; Hwang, Y. Layer-Wise Network Compression Using Gaussian Mixture Model. Electronics 2021, 10, 72,
doi:10.3390/electronics10010072.
29. Vogel, S.; Raghunath, R.B.; Guntoro, A.; Van Laerhoven, K.; Ascheid, G. Bit-Shift-Based Accelerator for CNNs with Selectable
Accuracy and Throughput. In Proceedings of the 2019 22nd Euromicro Conference on Digital System Design (DSD), Kallithea, Greece,
28–30 August 2019; pp. 663–667, doi:10.1109/DSD.2019.00106.
30. Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv 2015. arXiv:1412.7024.
31. Holt, J.L.; Baker, T.E. Back propagation simulations using limited precision calculations. In Proceedings of the IJCNN-
91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 8–12 July 1991; Volume ii, pp. 121–126,
doi:10.1109/IJCNN.1991.155324.
32. Vanhoucke, V.; Senior, A.; Mao, M.Z. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and
Unsupervised Feature Learning Workshop (NIPS 2011), Granada, Spain, 12-17 December 2011
33. Garofalo, A.; Tagliavini, G.; Conti, F.; Rossi, D.; Benini, L. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors
Through ISA Extensions. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, DATE 2020,
Grenoble, France, 9–13 March 2020; IEEE: New York, NY, USA, 2020; pp. 186–191, doi:10.23919/DATE48585.2020.9116529.
34. Cotton, N.J.; Wilamowski, B.M.; Dundar, G. A Neural Network Implementation on an Inexpensive Eight Bit Microcontroller.
In Proceedings of the 2008 International Conference on Intelligent Engineering Systems, Miami, FL, USA, 25–29 February 2008;
pp. 109–114, doi:10.1109/INES.2008.4481278.
35. Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International
Conference on International Conference on Machine Learning (ICML’10), Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI,
USA, 2010; pp. 807–814.
36. Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello Edge: Keyword Spotting on Microcontrollers. arXiv 2018, arXiv:1711.07128.
37. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008); IEEE: Piscataway, NJ, USA, 2019; pp.1–84,
doi:10.1109/IEEESTD.2019.8766229.
38. Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al.
Mixed Precision Training. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 30
April–3 May 2018.
39. ARM. ARM Developer Suite AXD and armsd Debuggers Guide, 4.7.9 Q-Format; ARM DUI 0066D Version 1.2; Arm Ltd.: Cambridge, UK,
2001.
40. David, R.; Duke, J.; Jain, A.; Reddi, V.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Regev, S.; et al. TensorFlow Lite Micro:
Embedded Machine Learning on TinyML Systems. arXiv 2020, arXiv:2010.08678.
41. STMicroelectronics. STM32Cube.AI. Available online: https://www.st.com/content/st_com/en/stm32-ann.html (accessed on 19
March 2021).
42. Google. TensorFlow Lite for Microcontrollers Supported Operations. Available online: https://github.com/tensorflow/tensorflow/
blob/master/tensorflow/lite/micro/kernels/micro_ops.h (accessed on 22 March 2021).
43. Google. TensorFlow Lite 8-Bit Quantization Specification. Available online: https://www.tensorflow.org/lite/performance/
quantization_spec (accessed on 19 March 2021).
44. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural
Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713, doi:10.1109/CVPR.2018.00286.
45. STMicroelectronics. Supported Deep Learning toolboxes and layers, Documentation embedded in X-CUBE-AI Expansion Package
5.2.0, 2020. Available online: https://www.st.com/en/embedded-software/x-cube-ai.html (accessed on 19 March 2021).
46. Nordby, J. emlearn: Machine Learning inference engine for Microcontrollers and Embedded Devices. 2019. Available online:
https://doi.org/10.5281/zenodo.2589394 (accessed on 18 February 2021).
47. Sakr, F.; Bellotti, F.; Berta, R.; De Gloria, A. Machine Learning on Mainstream Microcontrollers. Sensors 2020, 20, 2638,
doi:10.3390/s20092638.
48. Givargis, T. Gravity: An Artificial Neural Network Compiler for Embedded Applications. In Proceedings of the 26th Asia and South
Pacific Design Automation Conference (ASPDAC’21), Tokyo, Japan, 18–21 January 2021; Association for Computing Machinery: New
York, NY, USA, 2021; pp. 715–721, doi:10.1145/3394885.3431514.
Sensors 2021, 21, 2984 34 of 34

49. Wang, X.; Magno, M.; Cavigelli, L.; Benini, L. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference
at the Edge of the Internet of Things. IEEE Internet Things J. 2020, 7, 4403–4417.
50. Tom’s Obvious Minimal Language. Available online: https://toml.io/ (accessed on 19 March 2021).
51. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings
of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France,
2015; Volume 37, pp. 448–456.
52. Jinja2. Available online: https://palletsprojects.com/p/jinja/ (accessed on 19 March 2021).
53. Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th
International Conference on Learning Representations, Vancouver, Canada, 30 April–3 May 2018.
54. Davide, A.; Alessandro, G.; Luca, O.; Xavier, P.; Jorge, L.R.O. A Public Domain Dataset for Human Activity Recognition using
Smartphones. In Proceedings of the ESANN, Bruges, Belgium, 24–26 April 2013.
55. Khacef, L.; Rodriguez, L.; Miramond, B. Written and Spoken Digits Database for Multimodal Learning. 2019. Available online:
https://doi.org/10.5281/zenodo.3515935 (accessed on 18 February 2021).
56. Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2018, arXiv:1804.03209.
57. Stallkamp, J.; Schlipsing, M.; Salmen, J.; Igel, C. The German Traffic Sign Recognition Benchmark: A multi-class classification
competition. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August
2011; pp. 1453–1460, doi:10.1109/IJCNN.2011.6033395.
58. Capotondi, A.; Rusci, M.; Fariselli, M.; Benini, L. CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge
Devices. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 871–875, doi:10.1109/TCSII.2020.2983648.
59. Park, E.; Kim, D.; Kim, S.; Kim, Y.; Kim, G.; Yoon, S.; Yoo, S. Big/little deep neural network for ultra low power inference.
In Proceedings of the 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS),
Amsterdam, The Netherlands, 4–9 October 2015; pp. 124–132, doi:10.1109/CODESISSS.2015.7331375.
60. Anwar, S.; Hwang, K.; Sung, W. Structured Pruning of Deep Convolutional Neural Networks. J. Emerg. Technol. Comput. Syst. 2017,
13, 1–18, doi:10.1145/3005348.
61. Arcaya-Jordan, A.; Pegatoquet, A.; Castagnetti, A. Smart Connected Glasses for Drowsiness Detection: a System-Level Modeling
Approach. In Proceedings of the 2019 IEEE Sensors Applications Symposium (SAS), Sophia Antipolis, France, 11–13 March 2019; pp.
1–6, doi:10.1109/SAS.2019.8706022.

An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
Embedded Deep Learning Accelerators Survey
No ratings yet
Embedded Deep Learning Accelerators Survey
19 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
2689-Article Text-11583-1-10-20240704
No ratings yet
2689-Article Text-11583-1-10-20240704
7 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Embedded Deep Learning Accelerators A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators A Survey On Recent Advances
19 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
Accurate and Efficient Stochastic Computing Hardware For Convolutional Neural Networks
No ratings yet
Accurate and Efficient Stochastic Computing Hardware For Convolutional Neural Networks
8 pages
A Real-Time Object Detection Processor With Xnor-B
No ratings yet
A Real-Time Object Detection Processor With Xnor-B
13 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
No ratings yet
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
15 pages
Layer-Sensitive Neural Processing Architecture For Error-Tolerant Applications
No ratings yet
Layer-Sensitive Neural Processing Architecture For Error-Tolerant Applications
13 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Hardware-Friendly User-Specific Machine Learning For Edge Devices
No ratings yet
Hardware-Friendly User-Specific Machine Learning For Edge Devices
29 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Efficient Neural Networks for TinyML
No ratings yet
Efficient Neural Networks for TinyML
39 pages
Neural Networks for IoT Devices
No ratings yet
Neural Networks for IoT Devices
12 pages
FeNN-A RISC-V Vector Processor For Spiking
No ratings yet
FeNN-A RISC-V Vector Processor For Spiking
7 pages
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
No ratings yet
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
13 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
1) - MCUNet Tiny Deep Learning On IoT Devices
No ratings yet
1) - MCUNet Tiny Deep Learning On IoT Devices
15 pages
MythicWhitepaper 2019oct31
No ratings yet
MythicWhitepaper 2019oct31
9 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
Sensors 23 01185
No ratings yet
Sensors 23 01185
20 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
No ratings yet
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
4 pages
HAQ Hardware-Aware Automated Quantization With Mixed Precision
No ratings yet
HAQ Hardware-Aware Automated Quantization With Mixed Precision
9 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
FPGA Deep Neural Network Design
No ratings yet
FPGA Deep Neural Network Design
10 pages
Document IA
No ratings yet
Document IA
178 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Sensors 21 04412
No ratings yet
Sensors 21 04412
44 pages
Lei2020 - Low Power AI ASIC Design For Portable Edge Computing
No ratings yet
Lei2020 - Low Power AI ASIC Design For Portable Edge Computing
4 pages
Thesis 2
No ratings yet
Thesis 2
144 pages
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network Using Stochastic Computing
No ratings yet
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network Using Stochastic Computing
14 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
An Integrated Analysis Framework of Convolutional Neural
No ratings yet
An Integrated Analysis Framework of Convolutional Neural
19 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Paper 8
No ratings yet
Paper 8
7 pages
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
No ratings yet
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
10 pages
Generative Ai
No ratings yet
Generative Ai
15 pages
Embedded Deep Learning Acceleration
No ratings yet
Embedded Deep Learning Acceleration
5 pages
2020 02 Ali CNN
No ratings yet
2020 02 Ali CNN
5 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
No ratings yet
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
14 pages
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
No ratings yet
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
8 pages
On-Device Training of Artificial Intelligence Models On Microcontrollers
No ratings yet
On-Device Training of Artificial Intelligence Models On Microcontrollers
11 pages
AI Model Training on Microcontrollers
No ratings yet
AI Model Training on Microcontrollers
11 pages
03 - Biosignal Characteristics
No ratings yet
03 - Biosignal Characteristics
46 pages
Using Machine Learning To Improve The Prediction of Functional Outcome in Ischemic Stroke Patients
No ratings yet
Using Machine Learning To Improve The Prediction of Functional Outcome in Ischemic Stroke Patients
7 pages
(Lisa - Wyatt - Knowlton - Cynthia - C. - Phillips) - The Logic Model Guidebook
No ratings yet
(Lisa - Wyatt - Knowlton - Cynthia - C. - Phillips) - The Logic Model Guidebook
191 pages
CS229 Project: A Machine Learning Approach To Stroke Risk Prediction
No ratings yet
CS229 Project: A Machine Learning Approach To Stroke Risk Prediction
7 pages
Prediction of Stroke Using Deep Learning Model: October 2017
No ratings yet
Prediction of Stroke Using Deep Learning Model: October 2017
10 pages
2 - Demo - STM32CubeMX Initialization Code Generation PDF
No ratings yet
2 - Demo - STM32CubeMX Initialization Code Generation PDF
43 pages
SONOS NVSM Trapped Charge Study
No ratings yet
SONOS NVSM Trapped Charge Study
5 pages
1 - STM32Cube Overview - HAL Package and STM32CubeMX
No ratings yet
1 - STM32Cube Overview - HAL Package and STM32CubeMX
50 pages
UM2237 STM32CubeProgrammer Software Description
No ratings yet
UM2237 STM32CubeProgrammer Software Description
39 pages
Effect of Vds Variation On The Trapped Charge Distribution of A SONOS Memory
No ratings yet
Effect of Vds Variation On The Trapped Charge Distribution of A SONOS Memory
4 pages
Citra Log
No ratings yet
Citra Log
5 pages
Difference Between HTTP and Https 30
No ratings yet
Difference Between HTTP and Https 30
2 pages
Computer Security and Penetration Testing: Hacking Network Devices
No ratings yet
Computer Security and Penetration Testing: Hacking Network Devices
36 pages
1 - Advanced Network Architecture
No ratings yet
1 - Advanced Network Architecture
10 pages
2 - BODS Errors
100% (1)
2 - BODS Errors
19 pages
Mass Creation: BAPI (Business Application Programing Interface) - You Can Also Use Mass
No ratings yet
Mass Creation: BAPI (Business Application Programing Interface) - You Can Also Use Mass
2 pages
On Modeling, Analysis, and Optimization of Packet Aggregation Systems
No ratings yet
On Modeling, Analysis, and Optimization of Packet Aggregation Systems
4 pages
Lab - Locating Log Files: Objectives
No ratings yet
Lab - Locating Log Files: Objectives
16 pages
Jailkit Installation Guide Linux
No ratings yet
Jailkit Installation Guide Linux
8 pages
History and Functions of Operating Systems
No ratings yet
History and Functions of Operating Systems
10 pages
Comprehensive Automotive Learning
No ratings yet
Comprehensive Automotive Learning
5 pages
2.4.3.4 Lab - Configuring HSRP and GLBP - ILM
No ratings yet
2.4.3.4 Lab - Configuring HSRP and GLBP - ILM
16 pages
00 HowTo Read Fire Ware Logs
No ratings yet
00 HowTo Read Fire Ware Logs
14 pages
AWS
No ratings yet
AWS
11 pages
Cisco-Voice-simple Lab Voice 2 Ip Phones
No ratings yet
Cisco-Voice-simple Lab Voice 2 Ip Phones
10 pages
Network Setting File Transfer Syntec
No ratings yet
Network Setting File Transfer Syntec
16 pages
Management Information System 1 Marks
No ratings yet
Management Information System 1 Marks
3 pages
HUAWEI GENEX Discovery PS Collector Configuration Guide V2.0
No ratings yet
HUAWEI GENEX Discovery PS Collector Configuration Guide V2.0
12 pages
Group Notes: Hardware Requirements
No ratings yet
Group Notes: Hardware Requirements
4 pages
DDoS and Trinoo Attack Overview
100% (3)
DDoS and Trinoo Attack Overview
54 pages
Practical Malware Analysis: CH 7: Analyzing Malicious Windows Programs
No ratings yet
Practical Malware Analysis: CH 7: Analyzing Malicious Windows Programs
75 pages
Cooperative Expendable Micro-Slice Servers (CEMS) : Low Cost, Low Power Servers For Internet-Scale Services
No ratings yet
Cooperative Expendable Micro-Slice Servers (CEMS) : Low Cost, Low Power Servers For Internet-Scale Services
8 pages
JIOAPPS Citrix Login User Guide
No ratings yet
JIOAPPS Citrix Login User Guide
24 pages
Teltonika 03.29.00.rev.21 Release Notes
No ratings yet
Teltonika 03.29.00.rev.21 Release Notes
9 pages
Earth Product List
No ratings yet
Earth Product List
2 pages
2015.eu 15 Todesco Attacking The XNU Kernel in El Capitain
No ratings yet
2015.eu 15 Todesco Attacking The XNU Kernel in El Capitain
72 pages
Ns 2 Simulation
No ratings yet
Ns 2 Simulation
27 pages
ELEC211 Reviewer
No ratings yet
ELEC211 Reviewer
6 pages
DobotVisionStudio User Guide en V4.1.2
100% (1)
DobotVisionStudio User Guide en V4.1.2
375 pages
Microprocessor Course Overview
No ratings yet
Microprocessor Course Overview
50 pages

Quantization and Deployment Od DNN On Microcontroller

Uploaded by

Quantization and Deployment Od DNN On Microcontroller

Uploaded by

sensors

2 IMT Atlantique, 29200 Brest, France; Ghouthi.BoukliHacene@imt-atlantique.fr (G.B.H.);

Sensors 2021, 21, 2984. https://doi.org/10.3390/s21092984 https://www.mdpi.com/journal/sensors

Section 2 presents the challenges of running deep neural networks on microcontrollers

2. The State of the Art in Embedded Execution of Quantized Neural Networks

In this work we will focus on quantization-based compression. To reduce the number of

3. Representation of Real Numbers

Table 1. Single-precision floating-point binary32 representation.

Table 2. Fixed-point Q16.16 on 32-bit representation.

4. Training and Quantization of Deep Neural Networks

4.1. Floating-Point to Fixed-Point Quantization of a Deep Neural Network

4.1.1. Uniform and Non-Uniform

Figure 1. Example of the distribution of weights for a convolutional layer kernel.

4.1.2. Scaling and Offset

4.1.3. Per-Network, Per-Layer and Per-Filter Scale Factor

4.1.4. Conversion Method

where w is the data type width (e.g., 16 bits).

x f ixedi = f loor ( xi × 2n ) (3)

where trunc(y) truncates a real number y to its integer part.

4.2. Post-Training Quantization

The quantization phase introduces a quantization error on each parameter as well as

4.3. Quantization-Aware Training

if training: Update scale factor Update scale factor

Layer operation (Convolution, FC…)

if training: Update scale factor

Quantized forward pass Non-quantized backward pass Skipped during inference

Figure 2. Quantization-aware training

5. Deployment of the Quantized Neural Network

5.1. Existing Embedded AI Frameworks

5.1.1. TensorFlow Lite for Microcontrollers

ply–accumulate (MACC) operations on 16-bit operands with a single 32-bit accumulator in

5.1.3. Other Frameworks

5.2. MicroAI: our Framework Proposition

• open-source frameworks do not support convolutional neural networks with non-

5.3. MicroAI: General Flow

Data acquisition and windowing

int8 Quantization-aware training with PyTorch

PyTorch to Keras model conversion int16 Post-training quantization

C inference code generation from Keras model

5.4. MicroAI: Training

5.5. MicroAI: Deployment

To this aim, a floating-point number x_float can be converted to a fixed-point number

5.7. KerasCNN2C: Conversion Process

5.8. KerasCNN2C: Quantization and Fixed-Point Computation

Figure 4. ResNet model architecture.

6.1. Evaluation of the MicroAI Quantization Method

0.920 UCI-HAR float32

6.1.2. Spoken digits dataset (SMNIST)

0.970 SMNIST float32

0.935 SMNIST float32

0 50 100 150 200 250 300 350

Figure 8. Spoken digits dataset (SMNIST): accuracy vs. parameter memory.

6.1.3. The German Traffic Sign Recognition Benchmark (GTSRB)

0 200 400 600 800

6.2. Evaluation of Frameworks and Embedded Platforms

Table 3. Embedded platforms.

Board Nucleo-L452RE-P SparkFun Edge

Table 4. Embedded AI frameworks.

Framework STM32Cube.AI TFLite Micro MicroAI

consumption. Such instructions allow performing in a single cycle either 2 multiply–accumulate

Appendix A. Comparison of the Inference Times of a Microcontroller, a CPU and a GPU

Platform Model Framework Power Consumption

Inference Time (ms)

Appendix B. Number of Integer ALU Operations for a Fixed-Point Residual

MACC (1 Cycle) Add (1 Cycle) Shift (1 Cycle) Max/Saturate (2 Cycles)

Appendix C. Comparison of TensorFlow Lite for Microcontrollers and

Appendix D. MicroAI Commands to Run for Automatic Training and Deployment of

The training phase is started by running the following command:

Appendix E. Detailed Results of the Evaluation of Frameworks and Embedded

ROM Footprint (kiB)

Response Time (ms)

You might also like