See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/352338213
Neural Network for Real-Time Object Detection on FPGA
Conference Paper · May 2021
DOI: 10.1109/ICIEAM51226.2021.9446384
CITATIONS                                                                                                 READS
10                                                                                                        808
3 authors, including:
            Edward Rzaev                                                                                             Aleksandr Amerikanov
            National Research University Higher School of Economics                                                  National Research University Higher School of Economics
            5 PUBLICATIONS 16 CITATIONS                                                                              16 PUBLICATIONS 96 CITATIONS
               SEE PROFILE                                                                                                SEE PROFILE
 All content following this page was uploaded by Edward Rzaev on 17 October 2021.
 The user has requested enhancement of the downloaded file.
               2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
   Neural Network for Real-Time Object Detection on
                        FPGA
              Edward Rzaev                                Anton Khanaev                              Aleksandr Amerikanov
             HSE University                               HSE University                                HSE University
        Moscow, Russian Federation                   Moscow, Russian Federation                    Moscow, Russian Federation
           errzaev@edu.hse.ru                          askhanaev@edu.hse.ru                           aamerikanov@hse.ru
    Abstract—Object detection is one of the most active research    of the De10-Nano board makes it possible to integrate it in
and application areas of neural networks. In this article we        various embedded systems.
combine FPGA and neural networks technologies to solve the
real-time object recognition problem. The article discusses the         The developed neural network is able to determine the
integration of the YOLOv3 neural network on the DE10-Nano           boundaries of identified objects. Due to the ability of the HPS
FPGA. Slightly worse indicators of the main metrics (mAP, FPS,      core to locally implement powerful data processing algorithms
inference time) when operating a neural network on a De10-          and parallelize their execution at the hardware level, for stable
Nano board in comparison with more expensive solutions based        operation of the robot, it was decided to create a server that
on GPUs, are offset by differences in the cost and dimensions of    will perform this computer vision task. It also gives an
the FPGA board used. Based on the results of the study of           opportunity to combine several neural networks on one
various methods for converting neural networks to FPGA, it was      platform, for example, to be used in conjunction with a neural
concluded that this architecture is applicable for solving          network to recognize speech commands [2]. Implementation
problems of detecting objects on a video stream in real time.       on FPGA most accurately conveys the parallel architecture of
                                                                    neural layers and provides the flexibility to reconfigure the
    Keywords—FPGA, neural networks, YOLOv3, object detection,       entire neural network and its components – artificial neurons.
object recognition, CNN                                             In addition, the configuration of FPGA-based neural networks
                                                                    is easy to change.
                        I.   INTRODUCTION
                                                                        So, the main goal of this project is multiclass recognition of
    The meaning of this work is to make a smart system that         objects on the FPGA. Possibilities of application vary
works in real time and is able to analyze the surrounding space.    depending on the requirements and desires of the customer.
Using the example of this project, we want to demonstrate the       Thus, changing the target data, the system adapts to the
possibilities of using FPGAs for processing a video data            solution of the task without changing the hardware basis.
stream. We offer a lightweight neural network in HDL                Examples of tasks: institution security; counting the number of
implementation, which can be used to solve a wide range of          people in a queue, counting cars in a stream, detecting non-
tasks, for example, to detect and recognize people, animals,        standard behavior of people in public places, detecting animals
vehicles and other objects based on the computer vision             and birds in dangerous places, etc. In light of recent events,
algorithm. To solve the problem, we use a board with a chip of      proposals for solving problems of identifying coronavirus in
the Cyclone V family (De10-Nano). The FPGA device allows            potentially infected people based on x-ray images of their
to parallelize all the necessary calculations, thereby fully        respiratory tract or analysis of data from thermal imagers in
utilizing all its hardware resources [1]. It is also worth adding   public places will also be relevant. With the global placement
the low power consumption of the FPGA with its high                 of cameras in a particular country, it is possible to search for
performance. Because of this, FPGAs are an excellent tool for       wanted people. In addition, this development may be useful for
solving these kinds of problems for embedded systems.               production. Intelligent video surveillance systems are able to
    Significant advantages of De10-Nano are its low price and       recognize in advance signs of an impending accident in a
the presence of an ARM core, which allows to reduce the             factory or warehouse. Thus, it allows you to correct the causes
development time of the project due to the possibility of           of the accident before its immediate occurrence.
connecting peripherals and controlling the board at a higher            Based on all of the above, it can be concluded that success
level. Thus, most of the time can be devoted directly to the        in solving problems that affect the detection of objects in real
development and testing of the neural network. In addition, the     time is limited only by the collection of data for a specific task.
De10-Nano consumes significantly less power than, for               If the necessary data is available, then it is possible to train the
example, Nvidia video cards, which are inferior to FPGAs in         neural network. The necessary settings can be adjusted by
terms of computing power per unit of electricity.                   changing the hyperparameters.
   For processing images in real time, a computing base was             The result of the project is a FPGA board that recognizes
chosen, which has such advantages as low power consumption          the surrounding space from the camera, the output of the
and high speed of work with information, which makes it             results is displayed on the laptop.
possible to use a neural network. Also, the relatively small size
978-1-7281-4587-7/21/$31.00 ©2021 IEEE
               2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
                     II. RELATED WORKS                                 Below is a table of comparisons of YOLO state-of-the-art
    Thanks to the use of a small resource-intensive                (SoTA)    neural networks. It can be seen that the Titan X
infrastructure YOLO made it possible to use powerful devices       processes  images at a speed of 40-90 frames per second, with
in real time with a camera using a processor [3] and a GPU. It     MAP    (mean  Average Precision) indicators for Visual Object
uses a reduced number of layers and can significantly increase     Classes (VOC) of 2007 78.6% and MAP 48.1% for COCO
the speed of the neural network.                                   test-dev.
    Many neural networks designed for preliminary detection
                                                                   TABLE I.        COMPARISON OF NEURAL NETWORKS ON VARIOUS DATA ON
of objects in an image modify classifiers or localizers to                             THE TITAN X GRAPHICS CARD.
perform detection. They apply the model to the image at
multiple locations and scales. Areas with a high image score                       Model                 Train     mAP    FPS
are considered detections.
                                                                              Old YOLO             VOC 2007+2012   63.4   45
    The article [4] provides a comparison of different meta-
architectures, which reflects the advantage of YOLOv3 in                      SSD300               VOC 2007+2012   74.3   46
comparison with analogs. The table II in the article shows that               SSD500               VOC 2007+2012   76.8   19
YOLOv3 with the same or better recognition quality (metric
mAP@50) has a significantly shorter image processing time                     YOLOv2               VOC 2007+2012   76.8   67
(about 4-5 times). For the task of detecting objects on a video               YOLOv2 544x544       VOC 2007+2012   78.6   40
stream, high image processing speed is one of the key
advantages of the YOLOv3 architecture compared to other                       Tiny YOLO            VOC 2007+2012   57.1   207
architectures.                                                                SSD300               COCO trainval   41.2   46
     In proposed neural network, a completely different                       SSD500               COCO trainval   46.5   19
approach is used. In this case, one neural network is used for
                                                                              YOLOv2 608x608       COCO trainval   48.1   40
the complete image. This network divides the image into
regions and predicts bounding boxes and probabilities for each                Tiny YOLO            COCO trainval   -      200
region. These bounding boxes are weighted by the predicted
probabilities. This model has several advantages over                                       III.    METHODOLOGY
classifier-based systems. The neural network processes the              The diagram of connected devices is represented in Figure
entire image during testing, so its predictions are based on the   2.
part of the image. It also makes predictions with a single
network estimate, as opposed to systems like the Region-based
Convolutional Network (R-CNN), which require thousands of
estimates for a single image. All of the above combined makes
it extremely fast, over 1000 times faster than R-CNN and 100
times faster than Fast R-CNN [5].
    Figure 1 graphically depicts the process of bounding boxes
building.
Fig. 1. The process of detecting objects in the image [6].
                                                                   Fig. 2. The block diagram of the project.
    The article [7] introduces the REQ-YOLO architecture,              YOLOv3 is a rather heavyweight neural network and
which is based on the YOLO architecture. In fact, REQ-YOLO         requires a large amount of video memory and computing
is a highly compressed version of the YOLO architecture for        resources to be able to recognize objects with high accuracy
improving FPGA performance. A special feature of REQ-              and quality. Therefore, for the limited resources of De10-Nano,
YOLO is its simplicity at the software and hardware levels         it was decided to use a lighter version - Tiny YOLOv3.
when detecting objects. In both works, quantization of weights     Reducing the resolution of the input image, reducing the layers
is used, which makes it possible to significantly reduce the       both in terms of feature selection and in terms of object
number of calculations, and, therefore, the amount of memory       classification and regression of the location of objects made it
used by the neural network. Unlike the work [7], our project       possible to significantly facilitate the neural network, however,
provides an accurate assessment of the quality of recognition      the quality of object detection also deteriorated.
and the speed of image processing by a neural network.
                 2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
             TABLE II.      TINY YOLOV3 ARCHITECTURE [8].            it became possible to significantly accelerate the neural
                                                                     network and reduce the amount of energy consumed, albeit by
Layer       Type    Filters Size/Stride   Input         Output
                                                                     reducing it by 15– 20% [9] accuracy of the model.
0       Convolutional 16     3×3/1   416 × 416 × 3   416 × 416 × 16       As a dataset for training, a set of images
1       Maxpool              2×2/2   416 × 416 × 16 208 × 208 × 16
                                                                      OpenImagesV4 [10] from Google was selected. This is an open
                                                                      dataset in which there are almost 2 million tagged images with
2       Convolutional 32     3×3/1   208 × 208 × 16 208 × 208 × 32    a hierarchical structure of classes (their number is about 600).
                                                                      To train the neural network, a data subset with 18 classes was
3       Maxpool              2×2/2   208 × 208 × 32 104 × 104 × 32    used, including classes such as people, various types of
4       Convolutional 64     3×3/1   104 × 104 × 32 104 × 104 × 64
                                                                      furniture, transportation, and various office, kitchen and other
                                                                      accessories. In total, the dataset has about 28,600 drawings.
5       Maxpool              2×2/2   104 × 104 × 64 52 × 52 × 64      They were downloaded using the OIDv4 Toolkit [11].
6       Convolutional 128    3×3/1   52 × 52 × 64    52 × 52 × 128       To train the neural network, the BlueOil [12] framework is
                                                                      used, which allows you to solve various machine learning
7       Maxpool              2×2/2   52 × 52 × 128   26 × 26 × 128    problems using FPGAs.
8       Convolutional 256    3×3/1   26 × 26 × 128   26 × 26 × 256        The first step is to prepare a server with a GPU for training
                                                                      a neural network. It worth mentioning that newer generation of
9       Maxpool              2×2/1   26 × 26 × 256   13 × 13 × 256    Nvidia GPUs are prefered for solving this problem, since the
10      Convolutional 512    3×3/1   13 × 13 × 256   13 × 13 × 512    vast majority of libraries for developing and training neural
                                                                      networks are written specifically for CUDA kernels in
11      Maxpool              1×1/1   13 × 13 × 512   13 × 13 × 512    languages C or C++. The server can be either a local computer
                                                                      or a remote device with a Linux operating system on board.
12      Convolutional 1024   3×3/1   13 × 13 × 512   13 × 13 × 1024   The development of the project was carried out on the Ubuntu
13      Convolutional 256    1×1/1   13 × 13 × 1024 13 × 13 × 256     18.04 distribution. Also, GPU drivers higher than 410 are
                                                                      needed. It is recommended to have about 50 GB of free space
14      Convolutional 512    3×3/1   13 × 13 × 256   13 × 13 × 512    for the development. Docker must be installed on the server to
                                                                      get started with the project. Used hardware for training neural
15      Convolutional 255    1×1/1   13 × 13 × 512   13 × 13 × 255
                                                                      network is a local computer with an Nvidia GeForce 940MX
16      YOLO                                                          video card with 2 GB of video memory.
17      Route 13
                                                                          The ability to develop and train neural networks bypassing
                                                                      the process of creating an environment in which many of all
18      Convolutional 128    1×1/1   13 × 13 × 256   13 × 13 × 256    software components do not conflict with each other due to the
                                                                      difference in the versions of the modules and libraries used and
19      Up-sampling          2×2/1   13 × 13 × 128   26 × 26 × 128    the portability of developments in general is especially
20      Route 19 8
                                                                      convenient. That is why for the successful operation of the
                                                                      entire project it was decided to create a Docker Container. It
21      Convolutional 256    3×3/1   13 × 13 × 384   13 × 13 × 256    allows to reproduce project even on a completely new device
                                                                      or server. The developed Docker Container is built based on
22      Convolutional 255    1×1/1   13 × 13 × 256   13 × 13 × 256    the Linux operating system, the Ubuntu 18.04 distribution kit,
23      YOLO
                                                                      and contains both hardware and software modules necessary
                                                                      for project development.
    There are many different factors to take into account during         The architecture of the trained neural network using Blueoil
the training of a neural network. The number of classes, the          was converted into a binary file for the firmware of the DE10-
specifics of the problem, the size of the bounding rectangles         Nano board. A configuration file with neural network weights
and others are to be considered. Data-independent factors also        was added to it. After that, the SD Card image was modified to
have a big impact. For example, the choice of the correct             provide Blueoil support. Then the received files were added to
training step, the algorithm for calculating the backpropagation      Ubuntu on the board, and the board was reprogrammed and the
of the error, the number of processed pictures per one update of      necessary packages for Python were installed on the FPGA
the weights.                                                          board. This made it possible, with proper connection of the
                                                                      camera and the rest of the periphery, to launch a neural
    Quantization was applied to the layers of the neural              network to detect objects on the DE10-Nano.
network, that is, a reduction in the number of bits that are
allocated to represent one network parameter. So, instead of             Based on the experience of working with the DE10-Nano
using 32 bits for one floating-point number, 8 bits are allocated     board, it was decided to develop a cooling system for the
for one parameter. Since the model weights occupy                     board's chip. This board has an industrial Cyclone V chip that
approximately 2–3 times less RAM space and the calculations           requires additional cooling or overheating protection, which
themselves use approximately 2–2.5 times less execution time,         the developers did not implement when creating the board.
               2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Thus, it was decided to use the development from our previous          The horizontal axis represents the number of pictures. The
project [13], which solves the problem.                            vertical axis shows input resolution of the picture. The rest of a
                                                                   training parameters were the same in each experiment.
                                                                       As a result, the most optimal value was the resolution of
                                                                    128x128 pixels, as it is excellent in terms of learning speed and
                                                                    preserving the maximum amount of information during data
                                                                    preprocessing.
                                                                        The training step was 0.003, which decreased by a factor of
                                                                    10 every 1000 updates of the parameters. There were 20,000
                                                                    iterations in total.
                                                                    B. Converting a model as a firmware to an FPGA
                                                                         The process of transferring developed and trained neural
Fig. 3. Board cover with cooler.                                    network to De10-Nano can be conditionally divided into 2
                                                                    parts. The first is a computational graph, in which the entire
    In the research, the use of De10-Nano gives on average the      object recognition algorithm is written, and the second is the
following indicators on the OpenImagesV4 sample from                parameters of the neural network involved in the calculations.
Google: FPS ≈ 28-33; mAP ≈ 29.1%. Based on these data, we           It is advisable to load the model parameters once from the main
can conclude that the De10-Nano copes well with the task            memory of the board, while the computational graph can be
when compared with the top-class Titan X video card, the cost       represented as a binary FPGA firmware file.
of which is more than 10 times higher than the cost of the used
De10-Nano board.                                                        For computations of the neural network, the De10-Nano
                                                                    crystal is used directly, while the ARM core is utilized for the
    The computing power of De10-Nano [14] is aimed at               high-level control of the board. Also, it deals with connecting
solving the problem:                                                and configuring peripherals. It is possible to update the board
                                                                    configuration directly from terminal without powering it off
    •    Cyclone V FPGA: 0.16 GFLOPS;
                                                                    using Bash and Python scripts.
    •    Dual Core ARM Cortex-A9 MPCore: 2 GFLOPS.
                                                                    C. Testing and debugging the project in real time
    Table 2 shows the data on the use of the computing
resources of the board.                                                 During the development of the project, its debugging and
                                                                    testing were successfully carried out. The quality of the mAP
                                                                    metric = 29.4%. The average FPS fluctuates around 30 frames
            TABLE III.      RESOURCES USED BY DE10-NANO.            per second, which makes it possible to successfully analyze the
                   Estimates Resource Usage Summary                 environment in real time. With the input resolution of the image
          Resource                       Usage                      224×224, the FPS dropped to 10 frames per second. Therefore, there
          Logic utilization              59%                        was no point in taking less than the resolution of 128×128 pixels
          ALUTs                          39%                        since the quality of object recognition drops with faster
          Dedicated logic registers      25%                        rendering of frames to 19.7%.
          Memory blocks                  57%
          DSP blocks                     43%
                                                                                              V.    RESULTS
          IV.    EXPERIMENTAL RESULTS AND ANALYSIS                     This project shows that the Cyclone V chips are able to
                                                                    handle the processing of a 128×128 video stream by a neural
A. Hyperparameter Tuning                                            network in real time.
   The neural network was trained on a GPU with 2 GB of
                                                                        The developed chip cooling system in one of our previous
GPU memory. On average, it took about 8 hours to train the
                                                                    projects [13] was perfect for improving the chip's performance
model. The capacity of the GPU memory was enough for a
                                                                    when processing a video stream. This system allows to get
maximum of 4 pictures to update the model weights. It did not
                                                                    stable FPS indicators over a long period of time while using
make sense to take a smaller amount, since the training time of
                                                                    FPGA in active mode.
the neural network will increase significantly, and gradient
computation becomes less stable. However, choosing the right            This project is notable for the fact that it is much cheaper
input resolution of the pictures got the following result:          than similar solutions [15]–[17] using expensive video cards,
                                                                    but at the same time important indicators (mAP, FPS, inference
   TABLE IV.       ERROR FUNCTION VALUE FOR DIFFERENT INPUT DATA
                                                                    time) are acceptable within the framework of the object
                              FORMAT                                detection problem, that is, this project has an applied character
                                                                    and is workable in real-life tasks.
                              1      2         4
                 96х96       3.28   2.95   2.8                         As a result of the project, the following metrics were
                 128х128     3.1    2.78   2.6                      achieved: mAP = 29.4% and FPS in the range [28.3, 33.4].
                 168х168     2.91   2.72   MemoryError
              2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
    Below are examples of how the neural network operates on                            VI. CONCLUSIONS
the De10-Nano board.                                                 In this paper, the implementation of a lightweight neural
                                                                  network in HDL is presented. It is applicable to solving a wide
                                                                  range of tasks, for example, such as problems of detecting and
                                                                  recognizing people, animals, vehicles and other objects, based
                                                                  on computer vision algorithms. This development works on the
                                                                  De10-Nano FPGA board and has good FPS, mAP metric,
                                                                  which make it efficient and applicable in various tasks.
                                                                                                              REFERENCES
                                                                              [1]    T. V. Huynh, “Deep neural network accelerator based on FPGA,” 2017
                                                                                     4th NAFOSTED Conf. on Information and Computer Science, pp. 254–
                                                                                     257, 2017. DOI: 10.1109/NAFOSTED.2017.8108073.
                                                                              [2]    R. A. Solovyev, “Deep Learning Approaches for Understanding Simple
                                                                                     Speech Commands,” 2020 IEEE 40th Int. Conf. on Electronics and
                                                                                     Nanotechnology,          pp.          688–693,         2020.        DOI:
Fig. 4. An example of a demonstration of the operation of a neural network.          10.1109/ELNANO50318.2020.9088863.
                                                                              [3]    M. B. Ullah, “CPU Based YOLO: A Real Time Object Detection
                                                                                     Algorithm,” 2020 IEEE Region 10 Symposium, pp. 552–555, 2020.
                                                                                     DOI: 10.1109/TENSYMP50017.2020.9230778.
                                                                              [4]    J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
                                                                                     Tech Rep., pp. 1–6, 2018.
                                                                              [5]    R. Girshick, Fast R-CNN, 2015.
                                                                              [6]    YOLO: Real-Time Object Detection.
                                                                              [7]    [C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, “REQ-
                                                                                     YOLO: A resource-aware, efficient quantization framework for object
                                                                                     detection on FPGAS,” FPGA 2019 – Proc. 2019 ACM/SIGDA Int.
                                                                                     Symp. Field-Programmable Gate Arrays, pp. 33–42, 2019. DOI:
                                                                                     10.1145/3289602.3293904.
                                                                              [8]    W. He, Z. Huang, Z. Wei, C. Li, and B. Guo, “TF-YOLO: An improved
                                                                                     incremental network for real-time object detection,” Appl. Sci., vol. 9,
                                                                                     no. 16, 2019. DOI: 10.3390/app9163225.
                                                                              [9]    B. Jacob, Quantization and Training of Neural Networks for Efficient
                                                                                     Integer-Arithmetic-Only Inference, 2018.
                                                                              [10]   A. Kuznetsova, “The Open Images Dataset V4: Unified image
                                                                                     classification, object detection, and visual relationship detection at
                                                                                     scale,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956–1981, 2018. DOI:
Fig. 5. Different examples of the operation of a neural network.                     10.1007/s11263-020-01316-z.
                                                                              [11]   GitHub - EscVM/OIDv4_ToolKit: Download and visualize single or
   The video stream from the camera is used as input to the                          multiple classes from the huge Open Images v4 dataset.
neural network.                                                               [12]   GitHub - blue-oil/blueoil: Bring Deep Learning to small devices.
                                                                              [13]   InnovateFPGA|EMEA|EM029 - Anthropomorphic robot on FPGA.
   To demonstrate the operation of the neural network, the
                                                                              [14]   N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, and A. Ramirez,
processed video stream is transmitted via SSH to the working                         “The low power architecture approach towards exascale computing,” J.
machine in real time.                                                                Comput. Sci., vol. 4, no. 6, pp. 439–443, 2013. DOI:
                                                                                     10.1016/j.jocs.2013.01.002.
    Depending on the initial set of classes of objects to be
                                                                              [15]   A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
detected, the recognition quality and the speed of the                               Speed and Accuracy of Object Detection,” arXiv, 2020.
algorithms change. An increase in the complexity of the                       [16]   W. Liu, “SSD: Single Shot MultiBox Detector,” Lect. Notes Comput.
detection task leads to a deterioration in its characteristics.                      Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
                                                                                     Bioinformatics), vol. 9905 LNCS, pp. 21–37, 2015. DOI: 10.1007/978-
                                                                                     3-319-46448-0_2.
                                                                              [17]   S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, and U. San Diego,
                                                                                     Aggregated Residual Transformations for Deep Neural Networks.
 View publication stats