KEMBAR78
Fully Convolutional | PDF | Field Programmable Gate Array | Parallel Computing
0% found this document useful (0 votes)
112 views4 pages

Fully Convolutional

Uploaded by

REAL Gyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views4 pages

Fully Convolutional

Uploaded by

REAL Gyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2017 First New Generation of CAS

A Convolutional Neural Network Fully


Implemented on FPGA for Embedded Platforms
Marco Bettoni∗ , Gianvito Urgese∗ , Yuki Kobayashi† , Enrico Macii∗ , and Andrea Acquaviva∗
∗ Politecnico di Torino, Torino, Italy, 0039 011 090 7042. Email: gianvito.urgese@polito.it
† NEC Corporation, Kawasaki, Japan. Email: y-kobayashi@hq.jp.nec.com

Abstract—Convolutional Neural Networks (CNNs) allow fast The chosen model defines how to perform the training and
and precise image recognition. Nowadays this capability is highly the computation involved during the test, specifying parame-
requested in the embedded system domain for video processing ters and functions to be used for CNN recognition. The CNN
applications such as video surveillance and homeland security.
Moreover, with the increasing requirement of portable and flow is summarized as follows:
ubiquitous processing, power consumption is a key issue to be • Convolution: The matricial convolution operator is applied
accounted for. over the feature maps of the Input image, such as the RGB
In this paper, we present an FPGA implementation of CNN color channels. The computation is shown in Equation 1,
designed for addressing portability and power efficiency. Perfor- where O is the number of Output feature maps of size H ×
mance characterization results show that the proposed imple- W , I is the number of Input feature maps, and K × K is
mentation is as efficient as a general purpose 16-core CPU, and the size of the Kernel, which is the convolution operand
almost 15 times faster than a SoC GPU for mobile application. obtained from the training.
Moreover, external memory footprint is reduced by 84% with
respect to a standard CNN software application. Out[O][H][W ] = ΣIi=0 ΣK K
kh=0 Σkw=0
In[i][H + kh][W + kw] × Kernel[O][i][kh][kw] (1)
I. I NTRODUCTION
• Activation: A threshold function which is applied on the
In this paper, we propose the design of a hardware architec- convolution output. The ReLU (x) = max(0, x) function
ture implementing a customizable Convolutional Neural Net- is widely adopted, but others are common as well, such as
work (CNN) framework where several CNN schemas can be Tanh and Sigmoid.
configured and executed. We analyzed the CNN computational
flow for identifying the most critical points to be parallelized in • Pooling: The average or maximum value over an input re-
the FPGA implementation. We described the CNN framework gion is evaluated, generating a resized image representative
of the pool values. Equation 2 shows the Pooling by Average
architecture using a High Level Synthesis (HLS) language and operation, where P h × P w is the pooling window size.
tested the new HW-CNN module on an Altera Stratix V FPGA
embedded in a Terasic DE-5-Net board. Out[O][H][W ] = (ΣP h Pw
ph=0 Σpw=0

The CNN algorithm performs fast and precise image recog- In[O][H + ph][W + pw])/P h/P w (2)
nition, which is a highly requested feature in the context of • Fully-Connected (FC): Implemented at the end of a CNN,
embedded systems. The biggest involvement of this type of the FC layers provides the classification of the features
algorithms can be found in the Artificial Intelligence field, extracted by convolution. The FC layers are implemented
bringing contribution to numerous applications, such as in as in Equation 3:
fire detection in forests [1], robotics [2], autonomous driving Out[O] = ΣIi=0 In[i] × Kernel[O][i] (3)
[3] and mobile applications [4]. In this latter domain, battery
lifetime and memory resources are a serious concern for CNN
implementations.
Several CNN models are available in the literature for
general purpose applications. In 2012, Alex-Net Model [5]
has been the first efficient application of the CNN and lately
more accurate models have been proposed, such as GoogLe-
Net [6] featuring the Inception concept and Fast R-CNN [7]
with advanced capabilities for detecting the position of the
subject in the picture.
For teaching the CNN to recognize defined objects, the
network needs to be Trained. During the Training Phase, a set
of labeled images is used for generating the set of parameters
to be applied in the neural network. By means of the Test
Phase, the capability of the network to identify and classify Fig. 1: Convolution process representation. The magnifications are
the pictures is evaluated. representative of the CNN edge-detection.

978-1-5090-6447-2/17 $31.00 © 2017 IEEE 25


49
DOI 10.1109/NGCAS.2017.16
The Alex-Net model is used as a reference in this work,
since it performs an accurate recognition (84.7% Top-5 ac-
curacy) and it is generally used as a benchmark for CNN
implementations. This model includes 5 Convolutional Layers
followed by 3 FC Layers, and makes use of both Activation
and Pooling, requiring in total nearly 1.5 billions operations.
Figure 1 shows an example of the Alex-Net model used
to set-up our HW-CNN architecture running on the FPGA.
The RGB channels of an image are provided as inputs and
processed by the following 8 layers. The intermediate images
are shown, highlighting the edge-detection capability of the
CNN.
A common approach for CNN acceleration exploits GPU Fig. 2: CNN Pipeline. The data-flow pass through the DL, CV, AT,
cards, able to perform recognition over several hundreds of PL, UL Units, all controlled by the CU. A double buffering system
images per second [8]. An alternative state-of-art approach is implemented as PP buffers, allowing concurrent operation on the
leverages on FPGA for matricial convolution acceleration picture tiles.
while the other computation steps are performed on a general
purpose CPU [9].
In this paper we proposed a CNN fully implemented component described in pseudo-C to an RTL format.
in FPGA, which executes Convolution, Activation, Pool- We implemented a parallel version of a general and cus-
ing and FC layers. More specifically, the proposed so- tomizable CNN architecture. For this purpose, we used two
lution named HW-CNN has the following characteristics: parallelization techniques: the Tiling Technique designed by
• Standalone Implementation, (no GPU, no CPU); Zang et al. [9] and the Pipelining Technique commonly
implemented for data stream elaborations.
• Power efficient CNN computation;
The Tiling Technique has been exploited to overcome the
• Low FPGA resources and memory dependency; data dependency of the CNN calculation, which, due to the
• Software reprogramming for CNN model compliance. massive recurrence of the data, does not allow to fit the FPGA
To characterize the proposed implementation, we performed internal memory. The input image is therefore fragmented
comparative performance and power evaluations against a in tiles smaller than the original picture, and the CNN can
software version running on a general purpose CPU and a performed by computing each tile individually. The biggest
mobile SoC GPU. Memory utilization results are also reported. advantage of performing the tiling technique on an FPGA is
Overall, the results show that the proposed implementation the significant parallelization degree achievable by computing
is power efficient and lend itself to be adopted in mobile several tiles efficiently on the hardware logic. In our HW-
application with stringent power and resource requirements. CNN implementation, up to 8 tiles of size 32 × 32 pixels are
The rest of the paper is organized as follows: a description computed in parallel.
of the internal implementation (Section II), the performance The Pipeline is composed of 5 units: Download (DL),
and obtained results (Section III) and finally the conclusion Convolution (CV), Activation (AT), Pooling (PL) and Upload
(Section IV). (UL), which perform the homonyms functions. The scheduling
of the function is managed by the Control Unit (CU), which
II. I MPLEMENTATION generates the parameters for each unit depending on the CNN-
The developed architecture can be configured for execute all Model size and structure.
the steps of a CNN model defined by users. Thus, it can be This configuration is shown in Figure 2. The computation
entrusted for computing recognition of a picture or a video- flow passes through all the units from DL to UL, where the
stream. The HW-CNN allows great compatibility with any extremes are dedicated to the data transfer with the on-board
CNN model because the CNN parameters can be reconfigured RAM. In order to avoid the data hazard existing between
in software. For computing a classification, the input image two consequent unit, a double buffering [11] system has
is passed from an external communication interface (LAN or been adopted, implementing the Ping-Pong Technique already
Serial connection). Then, the HW-CNN compute all the CNN exploited in [9]. The buffer duplication prevents the units to
steps generating a list of recognition decision that is sent back read uncompleted data, or to update data which has not been
to the host. The recognition is completely performed on the elaborated yet. In the successive pipeline stage, the CU unit
FPGA, which requires a DDR-RAM to store the raw input swaps the data by means of control logic depicted in the Ping
image and the intermediate results. Pong (PP) Buffer of Figure 2.

A. FPGA Implementation B. Convolution Unit


We developed the HW-CNN implementation using the NEC The core of the CNN computation is the CV Unit, where
CyberWorkBench HLS compiler (CWB) [10], exporting the both Convolution and Fully-Connected layers are computed.

50
26
III. R ESULTS AND D ISCUSSIONS
We configured the HW-CNN with the Alex-Net model structure
and parameters used in the Cong work [9]. The following eval-
uation has been compared with a software implementation, since
the FPGA-based implementation [9] found in literature reports only
performances the CNN steps computed on FPGA and not considering
all the other steps executed on the host. Thus making impractical a
direct comparison between the two architectures. We tested the per-
formances by comparing different CNN software implementations:
an optimized C code designed by Zhang et al. [9], the Caffe Python
Library working in CPU-only mode (without GPU parallelization)
[12] and, for the GPU mobile comparison, the clBLAS OpenCL
library which has been evaluated by Lokhmotov et al. [13].

A. Timing and Power Results


For the timing performances, we considered the time required to
Fig. 3: Convolution Unit (CV) schema. The iI matrices on the left compute an image recognition. The time required by our HW-CNN is
side are the Input tiles, the oO matrices in the right side are the reported in Figure 5a where it is compared with the Caffe execution
Output tiles and the Ko,i are the Kernel matrices. over an Intel Xeon CPU E5-2630 v3 @ 2.40GHz 32-CPU and with
the clBLAS library tested on an ARM Mali-T628 GPU.
For the power comparison, the Performance per Watt unit has
been used, a value obtained by the Equation 5 where the number
The convolution operator requires to perform a considerable of operations executed by the device is considered, altogether with
amount of MAC operations, which is proportional to the the Thermal Device Power in Watt.
Neural Network size and the image resolution. The HW-CNN Operations
Performance per Watt = (5)
implementation optimizes this computation by parallelizing 24 T ime × P ower
times the MAC operator. The aim of this comparison is to give an idea of the different
In Figure 3 is represented the CV schema of the HW- timing performances, considering the discrepancies in terms
CNN implementation, where it is possible to notice the MAC of hardware and level of portability. We designed the HW-
parallelization of I Input tiles and O outputs. CNN module with the clear intent to reduce at minimum the
The CV Unit has been efficiently optimized by an internal hardware requirements, while relaxing the timing constraint to
scheduling which avoids idleness among the CV components. a reasonable value, still bearable for the mobile user. On the
In fact, for the MAC computation, data must be acquired by the other side, the FPGA adoption brings a speed-up by almost
input buffer, and after the operation, stored in the output buffer. 15 times over the SoC GPU.
The process has been internally pipelined for guaranteeing that Table I reports the CNN power efficiencies of the considered
the address calculation, MAC execution, buffer reading and implementations. The general purpose CPU values have been
writing were performed in a single clock cycle. This internal extracted from [9], where the results has been obtained by
pipelining has been efficiently coded by means of the CWB executing an optimized CNN code on an Intel Xeon CPU E5-
automatic pipelining feature, and proved on the real hardware 2430 (@2.20GHz).
with the scheduling shown in Figure 4. This comparison shows that running the Alex-Net Model
The CV operation is repeated for each pixel that compose on the FPGA using our HW-CNN architecture is as power
the Output tile. The loop logic has been hard-coded for the efficient as the CPU implementation running on 16 threads
Twidth × Theight dimension of the Output tile. Eventually, and almost 3 times more efficient than the software exe-
Equation 4 computes the number of clock cycles necessary cution without parallelization. The comparison with mobile
to the CV unit to complete a CNN stage, which depends on
the kernel size K ×K, the Output tile size and the CV internal
pipeline latency Tpl .
CVtime = Tpl + K 2 × Twidth × Theight (4) (a) Time Performances (b) Memory Reduction

Fig. 5: In 5a are compared the recognition times over a CPU, on the


FPGA implementation and a SoC GPU, highlighting the performance
Fig. 4: Convolution Unit internal pipeline. cost for the sake of portability. 5b compares the RAM memory
requirements, where the FPGA makes efficient use of on-chip data
compared with the other software implementations.

51
27
TABLE I: Power performances. GPU in both terms of timing (15×) and power (16×). The
OP/s Power Perf. per Watt proposed implementation is even 3 times more power efficient
Device [GOP/s] [W] [GOP/s/W] with respect to the reference CPU, and equivalently efficient
GPU Mobile OpenCL [13] 0.02 3 0.007 to the same CPU parallelized 16 times. Moreover, the require-
CPU single thread [9] 3.54 95 0.037
CPU 16-threads [9] 12.87 95 0.135
ment of an external memory has been reduced by 83%, when
FPGA HW-CNN [This work] 0.75 5.54 0.135 compared to the software version of CNN.
Finally, this architecture has been designed to allow software
reconfiguration, which allows the user to apply various CNN
hardware shows that the current mobile implementation of model and to efficiently test the same picture against different
CNN are outperformed by the FPGA by more than a 19× trained data and recognized classes.
factor. This demonstrates the effectiveness of the low-power
ACKNOWLEDGMENTS
FPGA computation, and the possibility to adopt similar CNN
implementation for mobile applications where battery life is The HLS compiler and the technical support was provided
the major constrain. by NEC Corporation, Japan.
R EFERENCES
B. Resource usage
[1] Qingjie Zhang et al. “Deep Convolutional Neural Networks for Forest
The percentage of required FPGA resources is reported Fire Detection”. In: 2016 International Forum on Management, Edu-
in Table II. The synthesis report shows that few resources cation and Information Technology Application. Atlantis Press. 2016.
[2] Lei Tai and Ming Liu. “Deep-learning in Mobile Robotics-from
have been implemented, allowing the architecture to fit more Perception to Control Systems: A Survey on Why and Why not”. In:
compact FPGAs, such as the Xilinx Zynq. The most required arXiv preprint arXiv:1612.07139 (2016).
is the on-chip memory, which has been exploited for the main [3] Mariusz Bojarski et al. “End to end learning for self-driving cars”. In:
arXiv preprint arXiv:1604.07316 (2016).
purpose of caching the intermediate CNN results on FPGA. [4] Ryosuke Tanno and Keiji Yanai. “Caffe2C: A Framework for Easy
Implementation of CNN-based Mobile Applications”. In: Adjunct
TABLE II: FPGA Resources. Proceedings of the 13th International Conference on Mobile and
Resource Stratix V - FPGA Chip Usage Ubiquitous Systems: Computing Networking and Services. ACM. 2016,
Logic 65,463 ALMs (28%) pp. 159–164.
Register 3.5kB (3%) [5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet
DSP 104 blocks (41%) classification with deep convolutional neural networks”. In: Advances
Memory 4,752kB (73%) in neural information processing systems. 2012, pp. 1097–1105.
[6] Christian Szegedy et al. “Going deeper with convolutions”. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
The memory necessary on the on-board DDR RAM is Recognition. 2015, pp. 1–9.
greatly reduced, as it is shown in Figure 5b. The chart reports [7] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection
with region proposal networks”. In: Advances in neural information
the comparison of RAM memory requirements for performing processing systems. 2015, pp. 91–99.
the Alex-Net model on both FPGA and software. The values [8] NVIDIA. GPU-Based Deep Learning Inference. URL: https://www.
are motivated by the fact that during the CNN execution, the nvidia . com / content / tegra / embedded - systems / pdf / jetson tx1
whitepaper.pdf (visited on 01/18/2017).
intermediate results are stored in the PP buffers along the [9] Chen Zhang et al. “Optimizing fpga-based accelerator design for
pipeline, rather than transferred to the RAM memory. deep convolutional neural networks”. In: Proceedings of the 2015
The reduced amount of external Memory and FPGA re- ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays. ACM. 2015, pp. 161–170.
sources are significant figures for encourage the adoption of [10] Kazutoshi Wakabayashi. “CyberWorkBench: Integrated design envi-
HW-CNN-like architectures in mobility applications. While ronment based on C-based behavior synthesis and verification”. In:
this allows the implementation to be placed on smaller FPGA VLSI Design, Automation and Test, 2005.(VLSI-TSA-DAT). 2005 IEEE
VLSI-TSA International Symposium on. IEEE. 2005, pp. 173–176.
or silicon component, the architecture can be easily adapted [11] Wikipedia. Multiple buffering. URL: https : / / en . wikipedia . org / wiki /
to fit the smallest FPGA devices, exploiting the versatility of Multiple buffering (visited on 02/28/2017).
the HLS synthesis coupled with the modular programming [12] Evan Shelhamer Yangqing Jia. Caffe - Deep learning framework by the
BVLC. URL: http://caffe.berkeleyvision.org (visited on 01/20/2017).
technique. [13] Anton Lokhmotov and Grigori Fursin. “Optimizing convolutional neu-
ral networks on embedded platforms with OpenCL”. In: Proceedings
IV. C ONCLUSION of the 4th International Workshop on OpenCL. ACM. 2016, p. 10.
In this paper we propose a Convolutional Neural Network
(CNN) fully implemented in FPGA, that enables image recog-
nition in low-power embedded systems with limited resources.
This features have been made feasible by extending state-
of-the-art implementations where only the convolution step
is accelerated on FPGA, with the modular addition of extra
functionalities. This modularity guarantees compliance with
existing CNN models, but also the possibility to easily intro-
duce new functionalities.
The experiments show that our HW-CNN can quickly per-
form image recognitions, outperforming the reference SoC

52
28

You might also like