KEMBAR78
Survey Paper | PDF | Unmanned Aerial Vehicle | Deep Learning
0% found this document useful (0 votes)
19 views14 pages

Survey Paper

This survey reviews deep learning-based object detection algorithms specifically for low-altitude UAV datasets, highlighting their growing importance in applications such as surveillance and navigation. It categorizes various detection algorithms into one-stage and two-stage frameworks, discussing their performance and challenges in the context of UAV imagery. The paper aims to provide a comprehensive resource for researchers by summarizing current methodologies and identifying research gaps in the field.

Uploaded by

areo.arefem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Survey Paper

This survey reviews deep learning-based object detection algorithms specifically for low-altitude UAV datasets, highlighting their growing importance in applications such as surveillance and navigation. It categorizes various detection algorithms into one-stage and two-stage frameworks, discussing their performance and challenges in the context of UAV imagery. The paper aims to provide a comprehensive resource for researchers by summarizing current methodologies and identifying research gaps in the field.

Uploaded by

areo.arefem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/346201758

Deep learning-based object detection in low-altitude UAV datasets: A survey

Article in Image and Vision Computing · December 2020


DOI: 10.1016/j.imavis.2020.104046

CITATIONS READS

156 4,192

3 authors, including:

Payal Mittal Raman Singh


Panjab University University of the West of Scotland
10 PUBLICATIONS 193 CITATIONS 52 PUBLICATIONS 940 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Payal Mittal on 26 February 2021.

The user has requested enhancement of the downloaded file.


Image and Vision Computing 104 (2020) 104046

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier.com/locate/imavis

Deep learning-based object detection in low-altitude UAV datasets:


A survey
Payal Mittal a, Raman Singh b, Akashdeep Sharma c,⁎
a
Research Scholar, UIET, Panjab University, Chandigarh, India
b
Assistant Professor, CSED, Thapar Institute of Engineering and Technology, Patiala
c
Assistant Professor, UIET, Panjab University, Chandigarh

a r t i c l e i n f o a b s t r a c t

Article history: Deep learning-based object detection solutions emerged from computer vision has captivated full attention in re-
Received 26 September 2020 cent years. The growing UAV market trends and interest in potential applications such as surveillance, visual nav-
Accepted 9 October 2020 igation, object detection, and sensors-based obstacle avoidance planning have been holding good promises in the
Available online 11 October 2020
area of deep learning. Object detection algorithms implemented in deep learning framework have rapidly be-
came a method for processing of moving images captured from drones. The primary objective of the paper is
Keywords:
Deep learning
to provide a comprehensive review of the state of the art deep learning based object detection algorithms and
Object detection analyze recent contributions of these algorithms to low altitude UAV datasets. The core focus of the studies is
Unmanned aerial vehicles low-altitude UAV datasets because relatively less contribution was seen in the literature when compared with
Computer vision standard or remote-sensing based datasets. The paper discusses the following algorithms: Faster RCNN, Cascade
Low-altitude aerial datasets RCNN, R-FCN etc. into two-stage, YOLO and its variants, SSD, RetinaNet into one-stage and CornerNet, Objects as
Point etc. under advanced stages in deep learning based detectors. Further, one-two and advanced stages of de-
tectors are studied in detail focusing on low-altitude UAV datasets. The paper provides a broad summary of low
altitude datasets along with their respective literature in detection algorithms for the potential use of researchers.
Various research gaps and challenges for object detection and classification in UAV datasets that need to deal
with for improving the performance are also listed.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction recent advent of UAVs in a wide range of applications such as visual


surveillance, rescue, and entertainment, is accompanied by the demand
An Unmanned Aerial Vehicle (UAV), commonly known as drones is for safety. A report by Goldman Sachs forecasts the global drone indus-
a flying device controlled either by a human operator or through auton- try to reach $100 billion by 2020. Further, the world based retail or
omously operating onboard workstations [1]. Depending on several consumer drone market is depicted in Fig. 1a. The report also concludes
purposes, drones collect on-demand images from low altitude airspace that the drone market is ready for strong growth in the commercial
which provide vast support for emergency item deliveries, border pa- space with the evolution of the regulations [6]. Most drone-enabled ser-
trolling, emergency rescue in case of disaster and visual surveillance vices rely on onboard imaging capabilities, and the significant applica-
for crowd safety such tasks [2]. The market growth opportunity for vi- tions include detections, classification, environmental monitoring,
sion processing in drones or consumer aerial vehicles expand the total transportations systems, and aerial assessments such as disaster re-
number of vehicle owners. Further, different countries encourage sponse and building inspection as depicted in Fig. 1b [7]. UAV captured
existing drone users to upgrade their hardware for better computation images and their post analysis are two major categories that fall in com-
is notable [3]. Recently, law enforcement agencies of countries have is- mercial applications of aerial vehicles. Applications in aerial images in-
sued several guidelines to fly UAVs in a restricted manner so that they clude landslide mapping, search and rescue, wildlife monitoring, the
cannot harm intruder privacy [4]. The Drone flight regulations in several creation of digital elevation maps and utilization of mounted camera
Countries' national legislation, especially in European Countries, request for a multitude of purposes. The technology behind innovation in aerial
that the drones should not fly over crowds, and even more several laws applications is responsible for digital video stabilization, autonomous
define the minimum distance the drone can fly near a crowd [5]. The navigation, and terrain analysis [8].
The analysis utilizes post-flight data to produce fine-grained data
⁎ Corresponding author.
and from crop quantity measure to water quality can be accessed in a
E-mail addresses: payalmittal6792@gmail.com (P. Mittal), raman.singh@thapar.edu fraction of time [10]. Fortunately, the characteristics of aerial vehicles
(R. Singh), akashdeep@pu.ac.in (A. Sharma). such as the cost effectiveness, high performance and low power

https://doi.org/10.1016/j.imavis.2020.104046
0262-8856/© 2020 Elsevier B.V. All rights reserved.
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

Fig. 1. a) World retail/consumer drone market $ billions [6]. b) Distribution of applications in UAV datasets [9].

consumption made it possible to incorporate functional intact vision ca- focusing on standard [16,17] or remote-sensing based datasets [18]. Fur-
pabilities [11] into UAVs. The rapid proliferation of aerial vehicle tech- ther, our study provides a complete state-of-the art object detection algo-
nology is already well underway in a computing world. In computer rithms for low-altitude aerial images which includes single, two and
vision, the most popular way to localize or detect an object in an advanced stages of detectors. The categorization of object detectors in
image is to represent its location with the help of bounding boxes the form of stages is discussed in next sections. Moreover, Table 1 lists
[12]. Object detection in low-altitude UAV datasets have been per- out the major surveys related to general and UAV based object detection
formed using deep learning and some detections examples have and points out the sharp disparities of the current survey to these pub-
displayed in Fig. 2. Object detection, a technique of identifying variable lished works. [12] analyzed a review of state of the art algorithms in ap-
objects in a given image and inserting a boundary around them to pro- plication domains of deep learning based object detection. The generic
vide localization coordinates. Object detection in aerial images has and salient object detection were discussed for both face and pedestrian
gained the attention of researchers working in this field as aerial vehi- detection by adopting the methodology of multi- scale and multi-feature
cles provide stereoviews from a camera mounted on them. Deep learn- boosting forest. The paper focused on state-of-the-art deep learning
ing based approaches for object detection is revolutionizing the based object detectors but did not include any information pertinent to
capabilities of autonomous navigation vehicles robustly [15]. The work low-altitude UAV datasets. [19] provided a detailed review of vision-
presented in paper is intended to offer a wide-ranging indication on based systems in UAVs that fostered the development of UAVs in ad-
the use of deep learning based object detection approaches specifically vanced and modern tasks. This paper elaborated more concepts related
on low-altitude aerial datasets. It will serve as a repository of all current to UAVs and less focus can be seen on specific object detection area.
growth made in deep learning based object detection in low-altitude [20,21] presented survey of application based studies in UAVs specifically
datasets and also help young researchers to consult research issues for about disaster response and traffic surveillance respectively. Both the
further perusal in this field. surveys lack in discussing deep learning based object detectors in a detail
manner. [22] provided a comprehensive survey about the progress made
in object detection since 2012. It also includes one, two and advanced
1.1. Motivation and contribution
stages of deep object detection but only for general image datasets. [23]
avoid problem of inherent class imbalance and improve detection capa-
Our study is focused by the need to find several convolutional net-
bility by introducing a novel deep network in low-altitude aerial images.
works based methods for object detection in low altitude UAV datasets
However, it does not provide a survey for object detection and lists the
and group them together for the assistance of research society. The pro-
current state-of-the-art for low-altitude aerial images.
posed survey is different from already published surveys as the studies
From these studies, it is quite evident that most survey papers pre-
included are based only on low-altitude aerial datasets rather than
sented summarized deep learning based object detection algorithms
in a general manner or targeted at specific aerial applications. In our
proposed work, a comprehensive survey based on deep learning-
based object detection algorithms on low-altitude UAV datasets has
been presented. There is a need to summarize the contents all under
one umbrella for budding researchers, academician, industry and end
users. This paper is aimed for researchers new to the field of object de-
tection tasks related to low altitude-based UAV datasets.
The primary objectives of the research paper are as follows:

• To review the current taxonomy of deep learning-based object detec-


tion algorithms with respect to aerial data.
• To provide a comprehensive list of low-altitude UAV datasets present
in literature and analyze the current state of the art object detection
algorithms in datasets.
• Literature findings about why advanced deep detectors work better
than popular one-stage and two-stage deep detectors.
• Summarize their comparative performances by analyzing results on
low-altitude benchmark datasets.

The organization of the paper is as follows: Section II elaborates the


Fig. 2. Examples of object detection in UAV datasets [13,14]. work related to development of detectors with respect to low-altitude

2
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

Table 1
Comparison of past surveys and proposed work.

Studies Description Deep learning based Low-altitude Multi-application One and two Advanced object
object detection uav datasets stage detectors detection methods

[12] Reviewed of object detection in a multi-application √ × √ √ ×


environment
[19] Analysis of computer vision based object recognition × × √ × ×
algorithms
[24] Described deep learning based object detection √ × √ × ×
methods particularly on remote sensing datasets
[25] Highlighted deep learning based object detection √ × × × ×
based region proposal generation
[26] Discussed deep learning based object detection √ × × √ ×
particularly for pedestrians
[27] Focused on typical generic object detection √ × √ √ ×
architectures
[20] Survey of uav imagery acquisition for disaster × √ × × ×
research
[21] Survey of uav for traffic surveillance × × √ × ×
[22] Provided a comprehensive survey of object detection √ × × √ √
since 2012
[28] Aspects of generic object detection √ × × √ √
[29] Survey about milestone detectors, detection datasets √ × √ √ ×
[30] Presented computer vision based uav concepts of × × √ × ×
object recognition
[23] Avoid problem of inherent class imbalance and √ √ × √ ×
improve detection capability
[31] Recent advances in deep learning for object detection √ × √ √ √
Proposed Summarize multiple application based deep learning √ √ √ √ √
based object detection algorithms in low-altitude uav
datasets

aerial datasets. Section III describes a comprehensive analysis of deep occluded by other objects or background obstacles. It is essential to han-
learning based taxonomy for object detection in low-altitude UAV dle occlusions by context or semantic information.
datasets. A detailed review has been done on one-stage and two-stage •Large scale variations
based algorithms for object detection. Other recent advanced ap- The objects have a substantial difference in scales, even for the ob-
proaches for object detection based on current development in the jects in the same category. Meanwhile, fusing multi-level convolutional
deep learning area are also covered. Section IV discusses about available features to integrate contextual semantic information is also effective to
low-altitude UAV datasets for object detection algorithms and studies handle scale variations, just like the architecture in FPN. In addition,
using benchmark datasets have also been considered. The performance multi-scale testing and model ensemble are effective to deal with the
issues in low-altitude aerial data which lead to research gaps and chal- scale variations.
lenges in the field of object detection are also listed in section VI. The last •Class imbalance
section concludes the whole study and list some critical consequences Class imbalance is another issue of object detection. The most
for further investigation. straightforward approach is using the sampling strategy to balance the
samples in different classes. Meanwhile, some methods integrate the
2. Related work weights of different object classes in the loss function to handle this
issue such as focal loss [39]. How to solve the class imbalance issue is
UAVs have been widely exploited in application areas such as search still an open problem.
and rescue [32], security and monitoring [33], disaster management It is quite evident in recent years, a boost in research publications
[20], crop management [34] and communications missions [35]. Aerial happened due to emergence in the field of deep learning based object
vehicles have ability to fly at different speeds to hover over a target, to detection but high value of accuracy cannot be achieved in case of
perform flight outdoors and maneuvers in close proximity to objects low-altitude UAVs. The domain of object detection is infinite in nature
over a point of interest [36]. These features make them fit to replace if we consider each and every development but we would strictly
humans in operations where human intervention becomes difficult or stick to algorithms which have scope in low-altitude aerial images.
exhaustive. There exist major challenges in low-altitude UAV based ob- The literature of object detection in aerial images has been classified
ject detection when compared with standard images such as huge scale into two categories: classical and modern object detection approaches.
variations, densely distribution of objects, arbitrary orientations, object The classical categorization includes conventional techniques which in-
relative motion and turbulence of atmospheric conditions lead to blur- clude vision based as well as machine classifiers based approaches
ring of objects [37]. All these challenges led to the development of object whereas modern deals with deep learning based algorithms which is
detection approaches in low-altitude aerial images that use low-level our focus area. Classical approaches of object detection include all
scene features as well as deep features for processing. There exist major developments made in the field of aerial images using
some other important critical issues in object detection on drone plat- handcrafted features based machine learning approaches. Classical ap-
forms due to which difference in mAP can be seen [38]. proaches include vision technologies in aerial images such as inertial
•Small object detection optical flow [40,41], shape-based descriptors [42,43], online boosting
Objects are usually small in size in aerial scenes so, it is necessary to with features based on histogram of orient gradient (HOG) [44], de-
extract more contextual semantic information for discriminative repre- formable part based (DPM) descriptors [45], multiple trained cascaded
sentation of small objects. Haar classifiers [46] and Markov random field descriptors [47]. Machine
•Occlusion learning based approaches in aerial images make use of automated clas-
Occlusion is another acute issue that limits the detection perfor- sifiers on handcrafted features to boost the performance of algorithms.
mance, especially in drone based scenes where objects usually are These algorithms include Bayesian networks [48], graph cut methods

3
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

[49], HOG with SVM classifier [50], hybrid of viola jones and SVM [51], deep learning have been distributed into two broad categories: region
multi-scale HOG [52], AdaBoost classifier [53], SIFT descriptor with proposal or two pass algorithms and non-region proposal or single-
SVM classifier [35,54] and stochastic constraints based detection shot based detector algorithms. In region proposal methods, input im-
methods [55]. ages are pass into CNN after proposing category independent region
The inertial optical flow based evaluation to search motion objects through proposal techniques. Image region is the interest area of pro-
and to differentiate in linear as well angular velocities. The shape- cessing image which a user wants to detect and classify. These regions
based feature extraction approach was used for detecting stationary are generated automatically by using specific techniques such as edge
and moving objects in cluttered scenes. Some detection systems were box [73], selective search [74] or by deep Region Proposal Network
based on a bayesian network which combined several features of learn- (RPN) [56] without considering the image features. Among extracted
ing and makes use of gradient mask filters as features for the low- regions, a feature vector was extracted using a deep network which
resolution and blurred noise in aerial data but failed miserably. The was used by SVM for classification into a particular category. In contrast,
mentioned classical approaches of object detection such as inertial opti- single pass detectors do not generate multiple regions of an image and
cal flow based evaluation, shape context feature descriptors and pass the whole image at once into a fixed grid based CNN. Boundary
boosting framework with HOG features in aerial images were found to box coordinates were inserted around the specified region in an image
be struggling to learn discriminative object-specific features such as for detection. These methods are speedy and do not need a complex
contour size and hierarchical shape dynamics. These methods were pipeline as compared with region-based approaches.
prone to produce more errors in practical conditions because these The modern approaches of object detection have practical uses in
were not robust for moving objects, dynamic backgrounds or illumina- real-time world in an efficient manner through the ability to count peo-
tion variability which are inherent characteristics of aerial images. As a ple [75], detect faces [76], indexing in visual search engines and aerial
result, these failed to attain a considerable accuracy because these image analysis [77]. The advantages of modern object detection over
models were limited by the resolution of the captured images and the classical approaches exist in enhancing accuracy due to large processing
insufficiency of the existing feature descriptions. For example, HOG, capabilities and automatically extracting features without performance
SIFT, Markov Random field and other such feature descriptors are inef- overhead. But at the same instance, disadvantages such as more amount
ficient in feeding a vector with millions of numbers to an algorithm of labeled data, computation overhead and hardware requirements. The
due to large amount of time and consider noisy information as well. overall performances of modern approaches make it suitable for object
In the latest years, to reduce human efforts in processing and in- detection in low-altitude aerial images. In the next section, a compre-
crease the efficiency of algorithms, deep learning based object detection hensive analysis of several deep learning-based algorithms has been
in modern approaches came into existence. Deep object detection algo- presented in reference to UAV datasets that evaluate object detection
rithms such as faster RCNN [56], Mask RCNN [57], FPN [58], R-FCN [59], methods to find classified regions and predict class scores.
YOLO [60], SSD [61], RetinaNet [39] use more complex and deep visual
features extracted from the image to generate image regions. The mod-
ern approaches of object detection have better accuracy than classical 3. Deep learning-based object detection algorithms
approaches due to more computation processing capabilities. The mod-
ern approaches of object detection can further be divided into tradi- Aerial imaging through UAVs is used in numerous applications such
tional deep and advanced object detectors. But in case of low-altitude as entertainment, detections and classifications studies, wildlife obser-
aerial images, we will study why traditional deep learning based algo- vation, and other intriguing purposes. In recent era, unlike aircraft,
rithms does not perform well so more recent advanced algorithms UAVs are affordable to end users looking for aerial imaging systems
such as Cascade RCNN [62], CornerNet [63], CenterNet [64] and within a confined budget. The advanced approaches of deep learning
RefineDet [65] need to be implemented to achieve better results. based object detection have bright future in an efficient manner.
•Modern approaches of object detection Among deep learning based detectors, several innovations of object de-
Deep learning-based object detection algorithms are dominant and tection algorithms in low altitude UAVs have been witnessed in the re-
proven tools for allowing intelligent solutions for detection problems. cent years. The viewpoint variation is one of the biggest challenges in
Deep learning approaches automatically learn features of objects at images captured from drones, since the dataset distribution contains
multiple abstraction levels without depending on handcrafted features. images captured in top view angle, while other images might be cap-
The recent advancements in deep learning based models made object tured from a lower view angle. The features learned from the object in
detection applications easier to develop than ever before. Besides, different angles are not transferable. So, it becomes mandatory to detect
with current deep approaches focusing on full end-to-end detection aerial based objects from powerful detectors. Deep learning-based ob-
and classification pipelines, performance has also been improved signif- ject detection algorithms have been categorized into two stage, one
icantly. Generally, in object recognition process, a classifier takes an stage and advanced methods for aerial images as highlight in Fig. 3.
input image and produces a single output in the form of probability dis- The algorithms such as faster RCNN [56], Mask RCNN [57], Cascade
tribution in terms of class scores over multiple classes, but when the RCNN [62], FPN [58] and R-FCN [59] fall under the taxonomy of two
image has multiple objects of interest, classification produces fewer im- stage detectors whereas YOLO [78], SSD [61], RefineDet [65] and
pressive results. A classifier might classify the image into less positive RetinaNet [39] under one stage detectors. The recent advancements in
categories, but cannot locate objects in the picture but in object detec- object detection which are also quite popular among aerial data such
tion technique, it gives much more confident predictions for the as CornerNet [63], Objects as points [79] and Foveabox [80] which are
likeliness of multiple objects. Deep architectures such as LeNet [66], based on anchorless methodology are also listed under the Fig. 3. More-
AlexNet [67], Inception [68], ResNet [69], DenseNet [70], Inception- over, the brief description of each deep learning based detector, the cat-
ResNet [71] etc. were successful in achieving classification accuracy egory in which it belongs, backbone network, input sizes, GitHub code
but less efforts were seen in object detection. The central issue debated repositories and losses description is given in Table 2. The description
during the ILSVRC workshop did the CNN classification results inserted of loss function contains classification and localization loss with respect
on detection problems. The vastly deployed classification based to each detector. The overall objective loss function is a weighted sum of
ImageNet [72] dataset cannot be directly employed to detection prob- the localization loss and the confidence loss where localization loss is
lems so as a result, PASCAL VOC challenging dataset comes into picture the mismatch between the predicted boundary box and the ground
for calculating detection results [16]. This debate motivated many re- truth box and classification loss is the loss in assigning class labels to
searchers working in this direction for the imposition of pre-trained predicted boxes. Moreover, a combination of L1 and L2 loss is known
models to detection task. All existing object detection techniques in as smooth L1 loss which is suboptimal for accurate object localization.

4
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

with different sizes to learn multi-scale characteristics of objects. This


advanced technique restores the edges of detection objects disturbed
by environment clutter, improving detections by avoiding the cropping
induced deformation of input images of different sizes. To reduce ex-
pensive process of consumed training time in object detection through
RCNN, fast RCNN algorithm was proposed in [84] based on box
regressing approach that comprised of end-to-end training algorithm
which performs classification of object proposals and identifies spatial
locations to obtain bounding box coordinates. Fast RCNN algorithm
had several advantages when compared with RCNN:

• Quality of detection in terms of performance metric mean area preci-


sion (mAP) was higher than RCNN
• Training was done in a single stage by implementing multi-task (clas-
sification as well as regression) loss
• Training process simultaneously filters all deep layers
• No memory storage required for feature extraction

Another major advancement in case of fast RCNN is that the whole


network can be trained with multi-task losses that improve significant
accuracy. [56] again updated fast RCNN by introducing faster RCNN
Fig. 3. Taxonomy of deep learning based object detection methods. named algorithm by incorporating the following advancements:

• Designing almost cost-free regions from RPN instead of explicitly


techniques for region proposals used in RCNN and fast RCNN, pro-
In the next section, we briefly discuss all the developments made in ob-
duced a unified pipeline of fast RCNN and RPN as a single network.
ject detectors with respect to low-altitude aerial images.
• This method enabled an integrated object detection system to run at
real-time frame rates.
3.1. Two stage based object detection algorithms • RPN takes the output feature maps from the same deep network used
in RCNN and slide filters over them to form region proposals, resulting
The two stage object detection algorithms simply mean detecting in 4*k coordinates and 2*k scores per location in output.
objects in two passes. The different stages generate a sparse set of re- • It predicted offsets relative to the corner of some reference boxes
gions of interest (RoIs) and classify each of them by a network. Early ob- called anchors. These anchors are pre-selected with multi-scales and
ject detection models such as OverFeat [81] showed that different tasks aspect ratios at each location.
of localizing, classifying and predicting bounding boxes could be learned • The learned RPN also enhanced region proposal quality and the cumu-
using a unified shared deep network. These approaches work inside a lative object detection accuracy.
combined framework by using convolutional networks for detection.
The multiscale sliding window approach in OverFeat algorithm can be The detection of objects in an image at multiple scales is a funda-
efficiently implemented. One of the first advanced algorithm using mental challenge in computer vision and scale invariant Feature Pyra-
deep learning for object detection named as RCNN [82] was published mid Networks (FPN) [58] seems to be a standard solution [58]. The
in 2014 which presented an almost 50% improvement on the object de- main objective of FPN is to produce a multi-scale feature representation
tection challenge [72]. RCNN computes object location from a large set at high-resolution levels. The principle advantage of featuring each level
of region candidates, crops them, and classifies each using a deep net- of an image pyramid is that it produces a multi-scale feature represen-
work. Meanwhile, [83] proposed a deep CNN based on multi-scale spa- tation in which all levels are semantically strong, including the high-
tial pyramid pooling to sample vehicle detection from aerial imagery resolution levels. A remarkable increase can be seen for average

Table 2
A brief description of deep learning based detectors.

Category Backbone Input size Object detection GitHub code repositories Classification loss Localisation loss
algorithm

2-stage VGG16 1000*600 Faster RCNN (2015) https://github.com/smallcorgi/Faster-RCNN_TF Log loss over 2 classes Smooth L1 Loss
2-stage Inc-Res-v2 1000*600 Deformable R-FCN https://github. Cross entropy Smooth L1 Loss
(2017) com/msracver/Deformable-ConvNets/tree/master/rfcn
2-stage ResNeXt-101 1280*800 Mask RCNN (2017) https://github.com/matterport/Mask_RCNN Categorical cross entropy Smooth L1 Loss
2-stage Res101-FPN 1280*800 Cascaded RCNN https://github. Categorical cross entropy Smooth L1 Loss
(2018) com/zhaoweicai/Detectron-Cascade-RCNN
1-stage VGG16 300*300 SSD (2016) https://github.com/balancap/SSD-Tensorflow softmax_cross_entropy_with_logits Smooth L1 Loss
1-stage ResNet-101 640*400 RetinaNet (2017) https://github.com/fizyr/keras-retinanet focal loss, where α = 0.25 Smooth L1 Loss
1-stage DarkNet-53 608*608 YOLO V3 (2018) https://github.com/pjreddie/darknet Binary Cross Entropy Sum of squared
error
1-stage VGG16 320*320 RefineDet (2018) https://github.com/sfzhang15/RefineDet Cross entropy/log loss Smooth L1 Loss
Softmax loss
1-stage Hourglass 512*512 CornerNet (2018) https://github.com/princeton-vl/CornerNet-Lite focal loss, where α = 2, b = 4 Smooth L1 Loss
https://github.com/princeton-vl/CornerNet
1-stage VGG16 512*512 M2Det (2019) https://github.com/qijiezhao/M2Det softmax_cross_entropy_with_logits Smooth L1 Loss
1-stage Hourglass 512*512 CenterNet (2019) https://github.com/xingyizhou/CenterNet focal loss, where α = 2 L1 Loss
https://github.com/Duankaiwen/CenterNet

5
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

precision in COCO dataset [17] by 2.3 points and PASCAL dataset [16] by shifted towards one stage detectors due to adaptability towards meet-
3.8 points over baseline of faster RCNN on ResNets [56]. FPN, easily ex- ing challenges like providing high speed and less memory require-
tended to mask proposals and further improves average recall and ments. Single stage algorithms have a different concept than two stage
speed significantly for object detection tasks and even in semantic seg- detectors in which the whole image is passed at once into a fixed grid
mentation methods [85]. FPNs can be utilized in many applications based CNN rather than in patches. In the initial days of one stage
rather than object detection tasks such as generating segmentation pro- based detection algorithms, [60] suggested a real-time single pass
posals. SharpMask [86], used FPNs to generate proposals as they were based detection algorithm named YOLO which produced better results
trained on image crops for predicting instance segments and respective i.e. mAP higher than two stage detectors in short time. The key idea
class scores. Based on FPN, Mask R-CNN [57] further extends a mask was to look at an image to predict number of objects and identify loca-
predictor by adding an extra branch in parallel with the bounding box tion of objects. YOLO approach trained on complete images and directly
recognition. Moreover, Cascade R-CNN [62] trains multi-stage R-CNNs boosted detection performance. This integrated model had numerous
with increasing IoU thresholds stage-by-stage and thus the multi- benefits over established methods of object detection:
stage R-CNNs are sequentially more powerful for accurate localization.
As a result, the last stage R-CNN can produce detections with the most • Fast speed of base network which run at 45 fps on high performance
accurate localization accuracy. Lastly, R-FCN [59] have been proposed GPU.
for accurate and efficient object detection. In contrast to previous two • Learned generalizable representations of objects means less chances
pass detectors such as fast and faster RCNN, which applied a costly of breaking down when pertained to new domains
per-region subnetwork, R-FCN detector is fully convolutional with al-
most all computation shared on the entire image. Region-based feature The one-object rule of YOLO limits close detected objects and high
maps or positive-sensitive score maps were proposed in R-FCN to ad- localization errors and lower recall value forced to develop YOLOv2,
dress a confusion between translation-invariance in classification and the second version of YOLO, aims at improving accuracy significantly
translation-variance in detection. This method adopts fully while consistent at maintaining speed. YOLOv2 pushes mAP by adding
convolutional image classifier backbones such as ResNets [69] for object batch normalization, high resolution classifier and convolutional with
detection. Recently, deep learning algorithms, two-stage detectors (R- anchor boxes. YOLO applied a softmax activation function for conver-
CNNs), have achieved state-of-the-art detection performance in com- sion of class scores into probabilities. YOLOv3 [78] replaced softmax
puter vision. But our focus is on detecting low-altitude aerial objects, function with logistic classifiers to use multi-label classification and
significant object detection accuracy cannot be achieved from the used binary cross-entropy loss for reducing computational complexity.
above discussed two-stage methods as they are based on sliding- YOLOv3 made 3 predictions at 3 different scales per location and to de-
window search and shallow-learning-based features with heavy com- termine anchors, it applied k-means clustering process. This efficient
putational costs and limited representation power. However, several development, YOLOv3, achieves relatable results than previous versions
challenges limit the applications of R-CNNs in object detection from on low-altitude aerial datasets. But, this advanced version made signif-
low-altitude aerial images [87]: icant improvement in detecting small size object but still higher locali-
zation errors exist. There is also the development of YOLOv4 [91], just
• The vehicles in large-scale aerial images are relatively small in size, few days back, provide efficient results with optimal speed. YOLOv4
and R-CNNs have poor localization performance with small objects; consists of CSPDarknet53 [92] as the backbone in which CutMix [93]
• R-CNNs are particularly designed for detecting the bounding box of and Mosaic data augmentation, DropBlock regularization [94] and
the targets without extracting attributes; class label smoothing [95] methods utilized for the functioning of back-
• The manual annotation is generally expensive and the available man- bone network. It achieved state-of-the-art with 43.0% average precision
ual annotation of vehicles for training R-CNNs are not sufficient in on the MS-COCO dataset. SSD algorithm [61], a more advanced single
number. shot detector proved more accurate when compared with two stage de-
• Faster R-CNN involves two fully connected layers for RoI recognition, tectors that performed region proposals. The feature detection of SSD
while R-FCN produces a large score maps. provided significant improvement over previous detectors which were
calculated by running a convolutional network on input only once. Fur-
Thus, the speed of these networks is slow due to the heavy-head de- ther, it utilized anchor boxes concept for learning coordinates of
sign in the architecture. Even if we significantly reduce the base model, bounding boxes. The detection results of SSD proved significant im-
the computation cost cannot be largely decreased accordingly. provement with a mAP of 31.2% on MS-COCO test dataset as compared
Recently, a new two-stage detector, light-head R-CNN [88] address to 21.6% in YOLO. The amendment in speed comes from improvements
the shortcomings present in faster RCNN by making the head of net- such as eradicating bounding box proposals, feature resampling stage,
work as light as possible, by using a thin feature map and a cheap R- using separate filters for suggested aspect ratios and small filter for
CNN subnet. Further, [89] propose Deformable Convolutional Networks predicting class scores and offsets in bounding box locations.
to model geometric transformations by learning additional off- sets RefineDet [9] improves one-stage detector by two-step cascade re-
without supervision. gression. These two inter-connected modules imitate the two-stage
The recent methods such as Cascade RCNN, light-head RCNN make structure to produce accurate detection results with high efficiency.
advancement than existing deep learning based object detectors in RefineDet achieves the current state-of-the-art results on generic object
case of low-altitude aerial images but still some modifications have to detection (i.e., PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO
be done such as use of attention mechanisms [90] in deep networks to [17]). Some literature work [96] had introduced the attention mecha-
detect objects of interests. The aerial images are of higher resolution in nism in RefineDet to further improve the performance specifically for
nature so a larger size of receptive field is needed. aerial images.
RetinaNet [39] is another FPN based single stage detector, which in-
3.2. One stage object detection algorithms volves Focal-Loss to address class imbalance issue caused by extreme
foreground-background ratio. The loss function for a large number of
Early success in deep learning based object detection field was background examples resulted in the degenerative model in RetinaNet
achieved through two-stage detectors but speed was a real challenge solved by introducing focal loss [39], a new dynamic loss function
in former approaches. The higher efficiency attribute of one-stage de- used to alter weights between positive and negative examples of train-
tectors over two-stage detectors makes them deployable in low- ing data. Through novel focal loss, reshaping of cross entropy loss has
altitude object detection scenarios. The researchers are eventually been made towards correctly classified training examples. It also

6
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

prevents a large number of easy negative examples from flooding the 33% which needs more inputs from researchers around the globe to
object detector during the training process. To evaluate the effective- achieve high accuracy in low-altitude aerial images.
ness of focal loss, a simple dense detector was designed and trained. Now, after knowing the taxonomy of deep learning based object de-
RetinaNet is able to match or we can say that achieved better result of tection algorithms for low-altitude aerial images, we can say that the
nearly 39.1% AP on MS-COCO test dataset when compared with YOLO traditional detectors include faster RCNN, YOLOv2, SSD detectors
or SSD. which lack in achieving large mAP, particularly when trained on chal-
The highest accuracy object detectors are based on region proposal lenging low-altitude aerial datasets. But, if have a look on some of the
methods i.e. RCNN series, where a classifier is applied to a sparse set aerial literature studies, a jump in accuracies can be seen. The detection
of candidate object locations. In contrast, one pass detectors have the accuracy values in aerial images are relatable good when detectors
potential to be faster and simpler but have trailed the accuracy of two trained on real-time UAV captured images, older aerial datasets such
pass detectors. The central cause for lacking in accuracy is extreme fore- as VEDAI, Munich Vehicle and number of training images in aerial
ground background class imbalance encountered during training of datasets are large. The categorization of two-stage and one-stage detec-
dense detectors. tors is done to analyze in a better way. [100] focused on developing
hard-mining strategies to tackle the detection of small objects. It detects
3.3. Advanced approaches in deep learning based object detection vehicles using faster RCNN algorithm in infrared images of VEDAI
dataset and obtained an average precision of 77.8 and recall of 31.04
In previous sections, a detailed analysis has been presented for the in detection results. These results inspired by LeNet-5 network and
one stage and two stage deep learning based object detectors in low- were trained using SGD with a dropout of 0.5 to get 256*256 heatmaps
altitude UAV images. If we consider the inherent characteristics of aerial from 1064*1064 pixels images. [101] proposed an enhanced detection
images which are definitely challenging than standard images, then, technique based on faster RCNN to tackle challenges imposed by base-
technologically powerful detectors will be needed. In recent times, re- line RPNs i.e., faster RCNN has poor localization performance especially
searchers made significant contribution towards building detectors for small sized objects due to course feature maps. To improvise the re-
which are powerful yet efficient. The current state-of-the-art utilized call metric, a hyper RPN (HRPN) was employed to classify small sized
anchorless concept in one stage detectors which will help aerial datasets objects with a grouping of hierarchical feature maps on Munich Vehicle
also to obtain better results than one stage and two stage detectors. Al- dataset. The classifier was also altered by boosting classifiers to validate
though anchor based detectors in which we have discussed one stage candidate regions. The mAP under 1 scale and 1 ratio setting obtained
and two stage detectors have achieved much progress in object detec- was 0.7624, for 1 scale and 3 ratios were 0.7950, for 3 scale and 1 ratio
tion, it is still difficult to select optimal parameters of anchors. Few was 0.7624 and 3 scale and 3 ratios were 0.7954. [102] introduced a
drawbacks of using anchor boxes in one-stage detector are: deep network named rotatable residual network based on region pro-
posal to find multi-oriented objects in aerial images. This deep network
• One-stage detector places anchor boxes densely over an image and used a rotatable RPN to generate rotatable RoIs from feature maps and a
generate final box predictions by scoring anchor boxes and refining strategy of batch averaging rotatable anchor was employed to initialize
their coordinates through regression. the shape of vehicle objects. Further, a rotatable position sensitive type
• Anchor boxes for training will creates a huge imbalance between pos- pooling layer was also intended to keep the orientation as well as posi-
itive and negative anchor boxes and slows down training [39] tion information on DLR 3 K Munich 3 K and VEDAI dataset. Recall
• Become very complicated when combining with multiscale architec- values were provided with variable loss weights in ResNet-101 under
tures where a single network makes separate predictions at multiple Intersection-over-Union (IOU) of 0.5 such as on DLR, it was 0.733 and
resolutions, with each scale using its own set of anchor boxes. [39, on VEDAI was 0.528. [103] trained faster RCNN for strong performance
61,97] in small unmanned aerial systems. To provide robust training data to
train a CNN, a combination of the publicly available dataset such as
VEDAI, DLR 3 K, SUSEX Avon Park, SUSEX Camp Atterbury and DARPA
To guarantee high recall, more anchors are essential but will intro- TAILWIND dataset was done for computing performance. VGG16 net-
duce high computational complexity; moreover, different datasets cor- work achieved confidence threshold of 0.627 for 200 iterations at NMS
respond to different optimal anchors. To solve these issues, anchor-free threshold of 0.4 whereas ZF network obtained 0.686 threshold for 100
detectors attract much research in recent times and have achieved sig- iterations at the same threshold. [104] applied faster RCNN algorithm
nificant advances with complex backbone networks. CornerNet [63] on MIT and Caltech car dataset to detect different types of vehicles
performs object detection by detecting objects as paired key points to which are common in the traffic scene. For car model, average accuracy
eliminate the need for designing a set of anchor boxes commonly used was found to be 79.9% for ZF network and 82.3% for VGG16 whereas for
in prior single-stage detectors. In addition, a new type of corner pooling minibus model, 73.9% and 74.8% respectively and SUV model, 68.3% and
layer was introduced that helped the network to better localize corners. 70.1% accuracy using improved faster RCNN network. [105] proposed an
It achieves a 42.2% average precision on MS COCO, outperforming all innovative bird detection framework in low-resolution aerial data im-
existing one-stage detectors. To decrease the high processing cost, ages using a deep network. The processing of aerial data images was
CornerNet-Lite [98] introduced which is a combination of two efficient done through converting low-resolution to high-resolution images by
variants of CornerNet: CornerNet-Saccade with an attention mechanism super-resolution CNN (SRCNN) and very deep super-resolution
and CornerNet-Squeeze with a new compact backbone architecture. (VDSR) techniques and then implemented faster RCNN algorithm to lo-
Moreover, [64] detected the object as a triplet, rather than a pair, of calize birds. Faster RCNN achieved mAP of 94.81% on BIRD-50 dataset
key points, which improves both precision and recall. [79] models the and 95.51% on CUB-200 dataset whereas YOLO obtained 96.77% and
detecting object as the center point of its bounding box, and localizes 97.71% respectively. The above discussed studies achieved significant
to all other object properties, such as size, 3D location, orientation, mAP due to the aerial datasets which have utilized are older and not
and even pose. On the other hand, [80] proposed an accurate, flexible in current use by researchers.
and completely anchor-free framework. It predicts category-sensitive Meanwhile, [106] proposed new dataset to localize waste plastic
semantic maps for the object and category-agnostic bounding box for bottles in the wild which has 25,407 UAV captured images with diverse
each position that potentially contains an object. The related literature backgrounds. The oriented bounding boxes were used to annotate for
of advanced detectors with respect to low-altitude images is in develop- achieving detailed information and several other object detection algo-
ment stage as only few of them were tested on only VisDrone aerial rithms were evaluated on UAV-BD dataset such as faster RCNN, SSD,
dataset [99]. The maximum value of mAP in case of 10 classes is around YOLOv2, and RRPN. The average precision values of RRPN (88.6%), SSD

7
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

(87.6%), faster RCNN (86.4%) and YOLOv2 (67.3%) were obtained. In this step framework of object detection and action recognition was pro-
case, the dataset obtained was of only single class and very huge in na- posed to extract object coordinates and blend high resolution person
ture. Further, [107] used faster RCNN based on Caffe framework for ve- objects with distinctive possible actions by reinitialized VGG16 net-
hicle target detection and drawing the moving trace of each vehicle. The work. The experiment was performed on the Okutama-Action dataset
crowdsourcing marking platform was used to mark vehicle targets in which contains a total of 33 high-resolution videos containing the per-
frames by adding class and location label. The accuracy rate achieved son class. The mAP for action detection at 0.5 IoU was found to be
was 96.5% on vehicle detection, and traffic flow statistics showed an av- 15.39% for SSD512, 18.80% for SSD960 and 28.30% for proposed ap-
erage of 92.7% on total traffic conditions. [108] presented a novel dataset proach. [118] introduced an aerial dataset from drone and a processing
from Microsoft opensource simulator namely as AirSim based on wild- method to categorize pose estimation of human as normal or abnormal
life monitoring and applied faster RCNN algorithm into 70 thermal in- through presence of perspective projection in in-flight images due to
frared videos captured from a simulator. Efficient performance metrics which people look tilted. Our literature findings state that majorly
such as precision 0.4690 and recall 0.0925 were obtained while tested deep learning based detectors when trained on challenging low-
into models from popular SPOT animal poacher dataset. [109] imple- altitude UAV datasets does not provide significant detection accuracy.
mented two main modules of detection by faster RCNN and Hungarian The viewpoint variation is one of the biggest challenges in images cap-
method based tracking method using deep CNN in UAV images. The tured from UAVs, since the dataset distribution contains images cap-
training was performed on VIVID and CAVIAR dataset while testing on tured in top view angle, while other images might be captured from a
real-time drone images and achieved precision of 0.87. [110] designed lower view angle. The features learned from the object in different an-
concentric circles and pentagons shaped UAV landing signs and gles are not transferable. Moreover, a study of detectors when trained
employed faster RCNN to identify landing marks for UAV. The experi- on challenging low-altitude UAV datasets is presented in Table 3. The
ments demonstrated that a speed of nearly 81 milliseconds each description of deep detector approach, dataset, training details and per-
frame and 97.8% accuracy was achieved by using faster RCNN for classi- formance metric. The mAP values are less and needs sincere improve-
fication and detection. [111] investigated the use of regression based ments by paying attention to detection of aerial objects.
single CNN named YOLOv2 for detection of vehicles in UAV captured
images as well as CSK tracking method as novel data annotation method 4. Low-altitude uav datasets
for real-time UAV feed and Stanford drone dataset. IoU was used for
performance score and value bigger than 0.7 overlapped with ground Most of the deep learning based object detection algorithms have
truth box considered as positive samples for training. All been trained on PASCAL VOC dataset to detect different objects in dy-
implementations were based on Caffe framework and achieved higher namic environment. The dataset consists of 20 catalogs closely related
mAP of 77.12%, 67.99% and 72.95% for SSD, YOLO and YOLOv2 respec- to human life, including human and animal, vehicle, and indoor item.
tively from real-time dataset as compared with standard drone dataset From the mentioned object categories, one can figure that the actual
[112] with 56.25%, 42.31% and 68.0% values respectively. [113] pre- size of most objects in the dataset is large. Therefore, the detection
sented a holistic approach for designing CNN networks for UAV pur- model based on the dataset composed of large objects will not be effec-
poses utilizing low-power based embedded processors. Real-time tively detected for the small objects in reality. To achieve better detec-
images were collected for training of single pass based detector and tion accuracy, aerial datasets should be utilized in an effective manner.
the structure of Tiny-YOLO model used as a criterion with 9 In the latest times, an indicative number of low-altitude UAV datasets
convolutional layers and max- pooling layers ranging from 4 to 6. Fur- have been made open source for researchers and developers to analyze
ther, a comprehensive assessment of proposed TinyYoloVoc, performance of deep learning-based object detection algorithms. The
TinyYoloNet, SmallYolov3 and DroNet from baseline network was per- truth acceptance of an object detection algorithm decided from choos-
formed and DroNet achieved significant detection accuracy with 5 to ing a benchmark dataset to solve a specific problem. We have collected
18 fps. [114] examined state-of-the-art convolutional based detectors datasets from heterogeneous resources to form a list of all available low-
on 9525 labeled images captured from 11 multi-rotor drones using a altitude UAV datasets for evaluation of detection algorithms as depicted
Pan-Tilt-Zoom (PTZ) camera. The PR curves and speed of detection in Table 4. We have considered aerial images which are captured by
models mentioned as fps in case of SSD MobileNet was 20.8, 12.0 for drones, approximately 120 m or less flying above the ground. The snap-
SSD Inception V2, 3.1 for RFCN ResNet 101, 2.4 for FRCNN ResNet 101, shots related to standard UAV datasets such as CARPK [75], UAVBD
0.7 for FRCNN Inception ResNet and 13.0 for YOLOv2. The highest accu- [106], Okutama [122] and VEDAI [123] with varying scales and orienta-
racy obtained from faster RCNN but was slowest in detection and train- tions are presented in Fig. 4. The low-altitude UAV datasets are inher-
ing while YOLOv2 was much faster than other models considering ently different from ground video based detection datasets such as
speed-accuracy trade-offs. [115] evaluated YOLOv2 algorithm on a cus- VOT2015 [124]. The small scale and multiple orientations of objects
tom dataset of 500 manually labeled images retrieved from Parrot A.R due to different elevations when recording from a high perspective,
Drone. The main goal was to detect falling of persons and further making it difficult to detect all the objects in aerial images. UAV123
tracked by Kalman filter using vision based drone control. The training has 123 video sequences and 110 k frames. It can be used for object de-
of algorithm was done in mini-batches of 2000 with learning rate of tections of multiple classes such as a car, bike, person, ball, bird etc. and
0.001 and momentum of 0.99 at 2000 epochs. The number of false pos- can easily be embedded with visual tracker benchmark by merging the
itives and false negatives was 5.97% and 13.57% out of 86.24% positive respective configuration and sequence files. The main aim of developing
detection results. At last, the real-time captured UAV images manually this benchmark dataset is to solve trajectory forecasting in object track-
also achieved high detection accuracy. The large artificial aerial ing problems. Campus dataset [112], compiled around 50 videos in the
dataset also produced good results as [116] proposed end-to-end object real-time outdoor environment of a university campus that follow social
detection model YOLOv2 by creating artificial dataset of background etiquette-based ground rules for moving inside campus and further it
subtracted real images. The dataset contained around 676,534 images not only includes pedestrians object classes but bicyclist, skateboarders,
from diverse backgrounds by proposing an algorithm for preparing cart, car, and bus type of objects. A comprehensive list of specifically
the dataset. The network was trained and fine-tuned for 10 k iterations low-altitude UAV datasets for estimation of deep learning-based object
with batch size of 128 and positioning batch normalization following detection and tracking has been shown in Table 4. Some datasets man-
convolutional layers. [117] employed SSD, a deep object detector to pro- aged to provide source code functionality in their implemented
duce object region of interests from low-altitude aerial images. An alter- methods so that users get an idea for problem assessment. The pre-
native deep network capable of pedestrian action labels offered by sented Table 4 mentions the year in which dataset was made public, for-
human sources was also used to acquire common sub-space. The two- mat and classes from where all required information can be achieved

8
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

Table 3
A summary for object detection map values when trained on low-altitude aerial datasets.

Studies Description Dataset Training mAP


Values

[118] Introduced an aerial dataset from drone and a processing Aerial dataset of around Different resolution of 416*416, 480*480 and 544*544 sizes 38.72
method to categorize pose estimation of human as normal or 1350 images was formed were considered for 45 k iterations with 0.9 momentum and
abnormal through presence of perspective projection in using DJI Phantom4 drone weight decay of 0.0005.
in-flight images due to which people look tilted.
[117] The proposed two-step framework to generate object proposals Okutama-Action dataset Containing a total of 33 and selected 24 videos for the 18.80
and fuse resolution proposals with different possible actions by training and remaining for test phase
reinitialized VGG16 network.
[119] A scale-aware network is proposed to determine the scale of VisDrone dataset Open-source code of mask RCNN with Pytorch 33.9
predefined anchors, which can effectively reduce the scale
search range, reduce the risk of overfitting, and improve the
detection accuracy and speed in aerial images.
[120] The whole network structure consists of an input image will be VisDrone dataset The image is segmented into 4 × 4 blocks on average and 22.61
input to the ResNet50 backbone, which is implemented with merged into the training set with the original images. The
deformable convolution layers. The feature maps further refine training set is increased 5x as much as the original one
with FPN then RPN extract some Region of Interest (RoI).
[77] The Clustered object Detection (ClusDet) network consists of VisDrone dataset Each image is uniformly divided into 6 and 4 chips without 32.4
three key components: (1) a cluster proposal subnet (CPNet); overlap
(2) a scale estimation subnet (ScaleNet); and (3) a dedicated
detection network (DetecNet).
[23] Introduce a Deep Feature Pyramid Network (DFPN) VisDrone dataset Training is performed using SGD, using an initial learning rate 30.6
architecture. Similar to FPN, our goal is to leverage a ConvNet's of 0.0001.
pyramidal feature hierarchy, which has semantics from low to
high levels, and build deep feature pyramids with high-level
semantics throughout.
[96] Different from ClusDet, this method is to consider the regions VisDrone dataset Cropped images are generated for each image and the entire 30.3
where the difficult targets are concentrated, and we abandoned training dataset is four times larger than the original dataset
ScaleNet of ClusDet to streamline the entire process.
[121] Propose the novel PENet structure to detect objects in aerial VisDrone dataset Added three additional classes: human, non-vehicles, and 41.1
images. PENet has three components: Mask Re-sampling vehicles into existing 10 classes of VisDrone dataset
Module (MRM), Coarse-PENet (CPEN), and Fine-PENet (FPEN).
[37] ResNeXt-101 based Multi-scale inference and bounding box VisDrone dataset Train networks on 8 NVIDIA GTX 1080Ti GPUs, using 35.69
voting mini-batch SGD as the optimization method

about their application areas. As observed from Table 4, Okutama which UAV-Gesture is dedicated to mainly recognizing gestures of
dataset [122] is specifically dedicated to human action detection be- humans captured by a low-altitude flying drone. The VisDrone dataset
tween different humans and objects. CARPK dataset [75] provides local- was collected using various drone platforms and under various weather
ization and counting of cars object in parking lot to gather free space and lighting conditions. These frames are manually, annotated with
information for new entrants. The UAVBD dataset [106] is dedicated to more than 2.6 million bounding boxes of targets of frequent interests,
procuring waste plastic bottles from mountains and wild grasses for such as pedestrians, cars, bicycles, and tricycles. Some important attri-
recycling from drone's view. VIRAT dataset [125] only confined to detec- butes including scene visibility, object class and occlusion, are also pro-
tion of vehicles and pedestrians in complex visual events. The recent vided for better data utilization. The majority of datasets have refined
and challenging aerial datasets include UAV- Gesture, VisDrone in ground truth annotations of images such as the inclusion of region of

Table 4
Comprehensive list of low-altitude UAV datasets.

Year Dataset Height Format Description Annotation classes Task Reference link

2011 VIRAT Not 25 event recognition in Horizontal BB Multi class Detection https://www.crcv.ucf.
specified videos surveillance Tracking edu/data/VIRAT.php
2016 UAV123 10-50 m 110 k tracking objects in diverse Horizontal BB Multi class Tracking https://ivul.kaust.edu.
frames scenes sa/Pages/Dataset-UAV123.aspx
2016 Stanford Not 50 trajectory forecasting Horizontal BB Multi class Tracking http://cvgl.stanford.
Campus specified videos edu/projects/uav_data/
2016 VEDAI Not 1200 small vehicle detection Oriented BB Multi class Detection https://downloads.greyc.fr/vedai/
specified images
2017 CARPK 40 m 90 K counts and localize target cars Horizontal BB single Detection https://lafi.github.io/LPN/
images in videos Counting
2017 UAVDT 10-70 m 80 k detection of vehicles in Horizontal BB Multi class Detection https://sites.google.
frames complex backgrounds Tracking com/site/daviddo0323/projects/uavdt
2017 Okutama 10-45 m 77.4 k concurrent human action Horizontal BB Single class Detection http://okutama-action.org
frames detection Multi actions
2018 UAV-BD 10-30 m 25 k aiming to find and localize Oriented BB single Detection http://jwwangchn.cn/UAV-BD/
images plastic bottles in the wild.
2018 UMCD 6-15 m 50 UAV mosaicking and change Attributes (Object Multi class Object and http://www.umcd-dataset.net
videos detection type, shape, wrt bg) change
detection
2018 UAV-GESTURE 3-5 m 119 gestures of humans Horizontal BB Single class Detection https://github.
videos Multi action Tracking com/asankagp/UAV-GESTURE
2018 VisDrone Not 179,264 detect and track multiple Horizontal BB Multi class Detection http://www.aiskyeye.com
specified frames object categories Tracking

9
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

Fig. 4. Snapshots of CARPK a), Okutama b), VEDAI c) Aerial dataset [75,122,123].

interest coordinates, class in which image belongs to, track ids etc. Al- comprehensions about the growth of various object detection tech-
though very little work has been done in all low-altitude UAV datasets niques for the assistance of researchers, academicians and end users.
provided but it can be observed that the detection accuracy achieved Different findings have been identified through a literature survey that
by aerial datasets are lower when compared with standard image supports further research in the low-altitude UAV processing capabili-
datasets. Additionally, the use of low-altitude UAV datasets by various ties. The anticipation of effective deep learning techniques such as
deep detectors has been listed in Fig. 5. The cumulative number of pub- one-pass and two-pass based detectors lags behind when compared
lications under the categories such as one pass, two pass, advanced ap- with advanced techniques as the success achieved by recent detectors
proaches have been discussed. In years 2013–2020, a number of in low-altitude based UAV object detection is mind-boggling. In our ob-
publications published in various categorizations of object detection al- servation, substantial proliferation in the number of deep learning ap-
gorithms. The cumulative sum of each algorithm in brackets has listed proaches for object detection in low-altitude UAV datasets have been
as well as some datasets such as CARPK, VEDAI, Stanford Drone, observed in recent publications as shown in Fig. 6. It can be inferred
VisDrone displayed greater contribution than other datasets. It can be after 2016, one and two stage detectors shown publication interest
concluded that in emerging years, one pass based and other advanced among researchers and recently, advanced detectors shown more prog-
detectors aim better mAP when trained on above discussed aerial ress in low-altitude UAV images.
datasets.

5. Discussions 5.1. Performance based comparisons in object detection methods

Deep learning based aerial object detection has been proven suc- It is very important to compare performance accuracies as well as
cessful in ensuring public safety in terms of motor vehicle accidents, computational complexities on standard datasets of image recognition
ship collisions, border and power line surveillance and solar farms en- as well as on low-altitude aerial images. A good algorithm balances a
ergy inspection such real-time crucial applications [126]. We have tradeoff between accuracy and inference time. The accuracy metric
discussed broadly two categories of object detection methods in low- used is called Mean Average Precision or mAP. Average Precision (AP)
altitude aerial images i.e. one pass and two pass detectors. The two- is calculated for each class from the area under precision-recall curve.
stage algorithms achieved significant results but the speed is slower Now predictions are sorted by their confidence score from highest to
whereas the accuracy of the recent advanced approaches such as lowest. Then 11 different confidence thresholds called ranks are chosen
CenterNet, RefineDet, CornerNet is better in case of aerial images with such that the recall at those confidence values have 11 values ranging
faster speed. The current state-of-the-art two-stage methods, such as from 0 to 1 by 0.1 interval. The thresholds should be such that the Recall
Faster R-CNN [56], R-FCN [59], and FPN [58], Cascade RCNN [62] have at those confidence values is 0, 0.1, 0.2, 0.3 to 0.9 and 1.0. AP is now com-
three benefits over the one-stage detectors which are as follows: puted as the average of maximum precision values at these chosen 11
recall values. We use the evaluation protocol in MS COCO [17] to evalu-
(1) Applying two-stage structure with sampling heuristics to handle ate the results of detection algorithms, including AP, AP50, AP75 met-
class imbalance; rics. Specifically, AP is computed by averaging over all 10 Intersection
(2) Utilizing two-step cascade to regress the object box parameters; over Union (IoU) thresholds (i.e., in the range [0.50: 0.95] with the uni-
(3) Describing two-stage features to describe the objects. form step size 0.05) of all categories, which is used as the primary metric
for ranking. AP50 and AP75 are computed at the single IoU thresholds
The performances of various discussed algorithms on low-altitude 0.5 and 0.75 over all categories. Able on challenging MS-COCO dataset
UAV datasets can be evaluated. This section provides few useful and detectors attained much lower accuracy in this challenging dataset

Fig. 6. A summary of papers published in deep learning-based UAV object detection on


Fig. 5. Use of standard UAV datasets by various object detection algorithms. low altitude datasets.

10
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

Table 5
Detection results on MS-COCO test dataset [63].

Method Backbone AP AP50 AP75

Two-pass detectors
Faster RCNN ResNet101 36.2 59.1 39.0
Mask-RCNN ResNeXt101 39.8 62.3 43.4
Cascade RCNN ResNet101 42.8 62.1 46.3
One-pass detectors
Yolov2 DarkNet-19 21.6 44.0 19.2
SSD513 ResNet101 31.2 50.4 33.3
RetinaNet ResNet101 39.1 59.1 42.3
RefineDet ResNet101 41.8 62.9 45.7
CornerNet Hourglass-104 40.6 56.4 43.2

of 80 classes. The recent anchorless concept based detector i.e.


CornerNet outperforms existing detectors which is based on a novel Fig. 7. Relative share of various deep learning based object detection algorithms.

hourglass-104 backbone. The mentioned Table 5 have been divided


into two-pass and one-pass detectors based on their respective AP advancements indicates that aerial imaging techniques will truly evolve
scores and it is quite clear that recent approaches such as RetinaNet, in the coming years and sincere efforts need to be done in object detec-
Cascade RCNN, RefineDet and CornerNet perform better than faster tion field of low-altitude aerial images.
RCNN, YOLO and SSD algorithms. But in case of challenging aerial
VisDrone dataset, even the highest performing detector in MS-COCO 5.2. Conclusion and future scope
test-dev dataset i.e. CornerNet only achieves an average mAP of 17.41
which is very less when compared with 40.6 as depicted in Table 6. Object detection has always been a fundamental but challenging
The aerial images are in high-resolution i.e. about 2, 000 × 1, 500 pixels, issue in computer vision. To the best of our knowledge, this is the first
while most of the images are less than 500 × 500 pixels in standard survey in the literature which focuses on object detection using deep
datasets such as MS-COCO. learning in low-altitude UAV datasets. The given studies based on
Multiple deep learning detectors for object detection tasks have low-altitude UAV datasets for object detection algorithms shows that
been studied for low-altitude UAVs in previous sections. The relative inherent characteristics of aerial images possess serious challenges to
share of major deep learning-based object detection algorithms in low algorithms performances. In the present study, a comprehensive survey
altitude UAVs have been depicted in Fig. 7. It is presented to create on deep learning-based object detection algorithms has been done par-
awareness among readers about the growth of detection techniques ticularly on aerial datasets. Our work analyzes that in case of aerial
with respect to each other in terms of publication share. There exist sev- datasets, recent advanced deep learning based detectors such as
eral objects detection algorithms based on advanced deep learning and RetinaNet, Cascade RCNN, CornerNet achieves better mAP than previous
making a choice for the right method has been crucial and depends on state-of-the-art detectors among which faster RCNN, YOLO or SSD
the specific problem chosen by user. It can be seen that one-stage and listed. We have also highlighted the pros and cons of one-pass and
two stage have relatively larger share than advanced detectors as two-pass based detection algorithms with respect to aerial images and
these have more support in implementation than recent advanced de- this study will be helpful for the researchers who are interested in ex-
tectors. But, advanced detectors are more successful in detecting low- ploring object detection from low-altitude aerial images. For instance,
altitude based aerial objects. The current values of mAP can be improved in case of challenging aerial VisDrone dataset, even the highest
if research focus is drawn on right direction. Additionally, few recom- performing detector in MS-COCO test dataset i.e. CornerNet only
mendations for future research have also been suggested by emphasiz- achieves an average mAP of 17.41 which is very less when compared
ing some of the open issues prevalent in the domain of object detection. with 40.6 for standard images. The aerial objects of interest are often
Some recommended solutions include: too small and too dense relative to the images. In addition, objects of in-
terests are often in different relative sizes which makes them difficult
a. To prevent large-scale features from covering small scale features of for detect through standard algorithms. It is evident that sincere re-
aerial images, the feature tensor that is outputted from different RoI search focus needed on implementation of deep learning based object
pooling should be normalized before those tensors concatenated. detection algorithms to low-altitude aerial images.
b. In order to get abstract object features, be ensure to have enough
pixels to describe small objects so that a combination of the features
of different scales represent the local details of the object.
Declaration of Competing Interest

The technological interventions and a host of applications influenced The authors declare that they have no known competing financial
the aerial imaging market which is expected to expand at a rate of 14.2% interests or personal relationships that could have appeared to influ-
in the coming years. Altogether, the manner of technological ence the work reported in this paper.

Table 6 References
Detection results on test dataset on VisDrone 2019 [99].
[1] E. Semsch, M. Jakob, D. Pavlicek, M. Pechoucek, Autonomous UAV surveillance in
Method AP AP50 AP75
complex urban environments, Proceedings of the 2009 IEEE/WIC/ACM Interna-
Faster RCNN 3.55 8.75 2.43 tional Joint Conference on Web Intelligence and Intelligent Agent Technology, 2,
R-FCN 7.20 15.17 6.38 2009, pp. 82–85.
Cascade RCNN 16.09 31.91 15.01 [2] M. Tzelepi, A. Tefas, Human crowd detection for drone flight safety using
Yolov3 10.25 21.56 8.70 convolutional neural networks, Signal Processing Conference (EUSIPCO), 2017
SSD 2.52 4.78 2.47 25th European 2017, pp. 743–747.
RetinaNet 11.81 21.37 11.62 [3] Brian Dipert, Vision Processing Opportunities in Drones, [Online] Available:
RefineDet 14.90 28.76 14.08 https://www.embedded-vision.com/platinum-members/embedded-vision-alli-
ance/embedded-visiiontraining/documents/pages/drones 2017 (Accessed: 24-
CornerNet 17.41 34.12 15.78
Jan-2019).

11
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

[4] G. McNeal, Drones and Aerial Surveillance: Considerations for Legislatures, [On- RSJ International Conference on Intelligent Robots and Systems (IROS) 2015,
line]. Available: https://www.brookings.edu/research/drones-and-aerial-surveil- pp. 2473–2480.
lance-considerations-for-legislatures/ 2014 (Accessed: 16-Dec-2018). [42] S. Wang, Vehicle detection on aerial images by extracting corner features for rota-
[5] L. Ruth, Regulation of Drones: Comparative Analysis, 2019. tional invariant shape matching, 2011 IEEE 11th International Conference on Com-
[6] K. Haye, Drones-Reporting for Work, 2019. puter and Information Technology 2011, pp. 171–175.
[7] P. Cohn, A. Green, M. Langstaff, M. Roller, Commercial Drones are Here: The Future [43] S.M. Thornton, M. Hoffelder, D.D. Morris, Multi-sensor detection and tracking of
of Unmanned Aerial Systems, [Online]. Available: https://www.mckinsey.com/in- humans for safe operations with unmanned ground vehicles, Proceedings 1st.
dustries/capital-projects-and-infrastructure/our-insights/commercial-drones-are- Workshop on Human Detection from Mobile Robot Platforms, IEEE ICRA, IEEE,
here-the-future-of-unmanned-aerial-systems 2017 (Accessed: 16-Feb-2019). 2008.
[8] I. Colomina, P. Molina, Unmanned aerial systems for photogrammetry and remote [44] A. Su, X. Sun, H. Liu, X. Zhang, Q. Yu, Online cascaded boosting with histogram of
sensing: a review, ISPRS J. Photogramm. Remote Sens. 92 (2014) 79–97. orient gradient features for car detection from unmanned aerial vehicle images, J.
[9] Interact Analysis-A New Version of Intelligent Automation, [Online]. Available: Appl. Remote. Sens. 9 (1) (2015) 96063.
https://www.interactanalysis.com/drones-market-2022-predictions/ 2014. [45] M. Andriluka, et al., Vision based victim detection from unmanned aerial vehicles,
[10] Z. Zhou, J. Irizarry, Y. Lu, A multidimensional framework for unmanned aerial sys- 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems 2010,
tem applications in construction project management, J. Manag. Eng. 34 (3) (2018) pp. 1740–1747.
4018004. [46] A. Gaszczak, T.P. Breckon, J. Han, Real-time people and vehicle detection from UAV
[11] S. Todorovic, M.C. Nechyba, A vision system for intelligent mission profiles of micro imagery, Intelligent Robots and Computer Vision XXVIII: Algorithms and Tech-
air vehicles, IEEE Trans. Veh. Technol. 53 (6) (2004) 1713–1725. niques, 7878, 2011, p. 78780B.
[12] Z.-Q. Zhao, P. Zheng, S. Xu, X. Wu, Object detection with deep learning: a review, [47] Z. Li, W. Shi, P. Lu, L. Yan, Q. Wang, Z. Miao, Landslide mapping from aerial photo-
arXiv Prepr. arXiv1807.05511 11 (2018) 3212–3222. graphs using change detection-based Markov random field, Remote Sens. Environ.
[13] K. Mok, Deep Learning Drone Detects Fights, Bombs, Shootings in Crowds, [Online]. 187 (2016) 76–90.
Available: https://thenewstack.io/deep-learning-drone-detects-fights-bombs- [48] T. Zhao, R. Nevatia, Car detection in low resolution aerial images, Image Vis.
shootings-in-crowds/ (Accessed: 13-Jan-2019). Comput. 21 (8) (2003) 693–703.
[14] Detection and Counting of Arabian Oryx from Aerial Image, [Online]. Available: [49] S. Sun, C. Salvaggio, Aerial 3D building detection and modeling from airborne
https://blogs.flytbase.com/ai-drones/ (Accessed: 18-Sep-2018). LiDAR point clouds, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 6 (3) (2013)
[15] A.R. Pathak, M. Pandey, S. Rautaray, Application of deep learning for object detec- 1440–1449.
tion, Procedia Comput. Sci. 132 (2018) 1706–1717. [50] X. Cao, C. Wu, P. Yan, X. Li, Linear SVM classification using boosting HOG features
[16] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual for vehicle detection in low-altitude airborne videos, 2011 18th IEEE International
object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. Conference on Image Processing 2011, pp. 2421–2424.
[17] T.-Y. Lin, et al., Microsoft coco: common objects in context, European Conference [51] Y. Xu, G. Yu, Y. Wang, X. Wu, Y. Ma, A hybrid vehicle detection method based on
on Computer Vision 2014, pp. 740–755. viola-jones and HOG+ SVM from UAV images, Sensors 16 (8) (2016) 1325.
[18] G.-S. Xia, et al., DOTA: a large-scale dataset for object detection in aerial images, [52] D. Sugimura, T. Fujimura, T. Hamamoto, Enhanced cascading classifier using multi-
Proc. CVPR, 2018. scale HOG for pedestrian detection from aerial images, Int. J. Pattern Recognit. Artif.
[19] A. Al-Kaff, D. Martin, F. Garcia, A. de la Escalera, J.M. Armingol, Survey of Computer Intell. 30 (03) (2016) 1655009.
Vision Algorithms and Applications for Unmanned Aerial Vehicles, Expert Syst. [53] M. Teutsch, W. Krüger, J. Beyerer, Evaluation of object segmentation to improve
Appl., 2017 moving vehicle detection in aerial videos, 2014 11th IEEE International Conference
[20] S.M. Adams, C.J. Friedland, A survey of unmanned aerial vehicle (UAV) usage for on Advanced Video and Signal Based Surveillance (AVSS) 2014, pp. 265–270.
imagery collection in disaster research and management, 9th International Work-
[54] T. Moranduzzo, F. Melgani, Automatic car counting method for unmanned aerial
shop on Remote Sensing for Disaster Response, 8, , 2011.
vehicle images, IEEE Trans. Geosci. Remote Sens. 52 (3) (2014) 1635–1647.
[21] A. Puri, A survey of unmanned aerial vehicles (UAV) for traffic surveillance, Dep.
[55] C. Huang, P. Chen, X. Yang, K.-T.T. Cheng, REDBEE: a visual-inertial drone system
Comput. Sci. Eng. Univ. South Florida (2005) 1–29.
for real-time moving object detection, Intelligent Robots and Systems (IROS),
[22] S. Agarwal, J.O. Du Terrail, F. Jurie, Recent advances in object detection in the age of
2017 IEEE/RSJ International Conference on 2017, pp. 1725–1731.
deep convolutional neural networks, arXiv Prepr. arXiv1809.03193 (2018) 1–104.
[56] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection
[23] S. Vaddi, C. Kumar, A. Jannesari, Efficient object detection model for real-time UAV
with region proposal networks, Advances in Neural Information Processing Sys-
applications, arXiv Prepr. arXiv1906.00786 (2019) 1–10.
tems 2015, pp. 91–99.
[24] G. Cheng, J. Han, A survey on object detection in optical remote sensing images,
[57] K.H.G.G.P. Dollár, R. Girshick, Mask R-CNN, 2017.
ISPRS J. Photogramm. Remote Sens. 117 (2016) 11–28.
[58] T.-Y. Lin, P. Dollár, R.B. Girshick, K. He, B. Hariharan, S.J. Belongie, Feature Pyramid
[25] J. Hosang, R. Benenson, P. Dollár, B. Schiele, What makes for effective detection pro-
networks for object detection, CVPR 1 (2) (2017) 4.
posals? IEEE Trans. Pattern Anal. Mach. Intell. 38 (4) (2016) 814–830.
[59] J. Dai, Y. Li, K. He, J. Sun, R-fcn: object detection via region-based fully convolutional
[26] A. Brunetti, D. Buongiorno, G.F. Trotta, V. Bevilacqua, Computer vision and deep
networks, Advances in Neural Information Processing Systems 2016, pp. 379–387.
learning techniques for pedestrian detection and tracking: a survey,
Neurocomputing 300 (2018) 17–33. [60] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time
[27] Z.-Q. Zhao, P. Zheng, S. Xu, X. Wu, Object detection with deep learning: a review, object detection, Proceedings of the IEEE Conference on Computer Vision and Pat-
IEEE Trans. Neural Networks Learn. Syst. 30 (11) (2019) 3212–3232. tern Recognition 2016, pp. 779–788.
[28] L. Liu, et al., Deep learning for generic object detection: a survey, Int. J. Comput. Vis. [61] W. Liu, et al., SSD: single shot multibox detector, European Conference on Com-
128 (2) (2020) 261–318. puter Vision 2016, pp. 21–37.
[29] Z. Zou, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, arXiv Prepr. [62] Z. Cai, N. Vasconcelos, Cascade r-cnn: delving into high quality object detection,
arXiv1905.05055 (2019) 1–39. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
[30] C. Kanellakis, G. Nikolakopoulos, Survey on computer vision for UAVs: current de- 2018, pp. 6154–6162.
velopments and trends, J. Intell. Robot. Syst. 87 (1) (2017) 141–168. [63] H. Law, J. Deng, Cornernet: detecting objects as paired keypoints, Proceedings of
[31] X. Wu, D. Sahoo, S.C.H. Hoi, Recent Advances in Deep Learning for Object Detection, the European Conference on Computer Vision (ECCV) 2018, pp. 734–750.
Neurocomputing, 2020. [64] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, Centernet: keypoint triplets for ob-
[32] S. Rathinam, et al., Autonomous searching and tracking of a river using an UAV, ject detection, Proceedings of the IEEE International Conference on Computer Vi-
2007 American Control Conference 2007, pp. 359–364. sion 2019, pp. 6569–6578.
[33] P.-M. Olsson, J. Kvarnström, P. Doherty, O. Burdakov, K. Holmberg, Generating UAV [65] S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network for
communication networks for monitoring and surveillance, 2010 11th International object detection, Proceedings of the IEEE Conference on Computer Vision and Pat-
Conference on Control Automation Robotics & Vision 2010, pp. 1070–1077. tern Recognition 2018, pp. 4203–4212.
[34] E.A. George, G. Tiwari, R.N. Yadav, E. Peters, S. Sadana, UAV systems for parameter [66] Y. LeCun, Others, LeNet-5, convolutional neural networks, URL http//yann. lecun.
identification in agriculture, 2013 IEEE Global Humanitarian Technology Confer- com/exdb/lenet 2015, p. 20.
ence: South Asia Satellite (GHTC-SAS) 2013, pp. 270–273. [67] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[35] B. Xu, X. Xu, C.-M. Own, On the feature detection of nonconforming objects with convolutional neural networks, Advances in Neural Information Processing Sys-
automated drone surveillance, Proceedings of the 3rd International Conference tems 2012, pp. 1097–1105.
on Communication and Information Processing 2017, pp. 484–489. [68] W. Ouyang, et al., Exploit all the layers: fast and accurate cnn object detector with
[36] S. Campbell, W. Naeem, G.W. Irwin, A review on improving the autonomy of un- scale dependent pooling and cascaded rejection classifiers, Proceedings of the IEEE
manned surface vehicles through intelligent collision avoidance manoeuvres, Conference on Computer Vision and Pattern Recognition 2014, pp. 346–361.
Annu. Rev. Control. 36 (2) (2012) 267–283. [69] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Pro-
[37] J. Zhou, C.-M. Vong, Q. Liu, Z. Wang, Scale adaptive image cropping for UAV object ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
detection, Neurocomputing 366 (2019) 305–313. 2016, pp. 770–778.
[38] P. Zhu, L. Wen, X. Bian, L. Haibin, Q. Hu, Vision meets drones: a challenge, arXiv [70] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, K. Keutzer, Densenet:
Prepr. arXiv1804.07437 (2018) 1–11. implementing efficient convnet descriptor pyramids, arXiv Prepr.
[39] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, arXiv1404.1869 (2014) 1–11.
arXiv Prepr. arXiv1708.02002 (2017) 2980–2988. [71] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and
[40] A. Miller, P. Babenko, M. Hu, M. Shah, Person tracking in UAV video, Multimodal the impact of residual connections on learning, Thirty-First AAAI Conference on Ar-
Technologies for Perception of Humans, Springer 2008, pp. 215–220. tificial Intelligence, 2017.
[41] D. Meier, R. Brockers, L. Matthies, R. Siegwart, S. Weiss, Detection and characteriza- [72] O. Russakovsky, et al., Imagenet large scale visual recognition challenge, Int. J.
tion of moving objects with aerial vehicles using inertial-optical flow, 2015 IEEE/ Comput. Vis. 115 (3) (2015) 211–252.

12
P. Mittal, R. Singh and A. Sharma Image and Vision Computing 104 (2020) 104046

[73] C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, European [102] Q. Li, L. Mou, Q. Xu, Y. Zhang, X.X. Zhu, R $^ 3$-Net: a deep network for multi-
Conference on Computer Vision 2014, pp. 391–405. oriented vehicle detection in aerial images and videos, arXiv Prepr.
[74] J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers, A.W.M. Smeulders, Selective search arXiv1808.05560 (2018) 1–14.
for object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171. [103] J. Kaster, J. Patrick, H.S. Clouse, Convolutional neural networks on small unmanned
[75] M.-R. Hsieh, Y.-L. Lin, W.H. Hsu, Drone-based object counting by spatially regular- aerial systems, Aerospace and Electronics Conference (NAECON), 2017 IEEE Na-
ized regional proposal network, The IEEE International Conference on Computer tional 2017, pp. 149–154.
Vision (ICCV), 1, , 2017. [104] L. Suhao, L. Jinzhao, L. Guoquan, B. Tong, W. Huiqian, P. Yu, Vehicle type detection
[76] H.-J. Hsu, K.-T. Chen, Face recognition on drones: issues and limitations, Proceed- based on deep learning in traffic scene, Procedia Comput. Sci. 131 (2018) 564–572.
ings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Appli- [105] C. Li, B. Zhang, H. Hu, J. Dai, Enhanced bird detection from low-resolution aerial
cations for Civilian Use 2015, pp. 39–44. image using deep neural networks, Neural. Process. Lett. (2018) 1–19.
[77] F. Yang, H. Fan, P. Chu, E. Blasch, H. Ling, Clustered object detection in aerial images, [106] J. Wang, W. Guo, T. Pan, H. Yu, L. Duan, W. Yang, Bottle detection in the wild using
Proceedings of the IEEE International Conference on Computer Vision 2019, low-altitude unmanned aerial vehicles, 2018 21st International Conference on In-
pp. 8311–8320. formation Fusion (FUSION) 2018, pp. 439–444.
[78] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv Prepr. [107] J.-S. Zhang, J. Cao, B. Mao, Application of deep learning and unmanned aerial vehi-
arXiv1804.02767 (2018) 1–6. cle technology in traffic flow monitoring, Machine Learning and Cybernetics
[79] X. Zhou, D. Wang, P. Krähenbühl, Objects as points, arXiv Prepr. arXiv1904.07850 (ICMLC), 2017 International conference on, 1, 2017, pp. 189–194.
(2019) 1–12.
[108] E. Bondi, et al., AirSim-W: a simulation environment for wildlife conservation with
[80] T. Kong, F. Sun, H. Liu, Y. Jiang, J. Shi, Foveabox: beyond anchor-based object detec-
UAVs, Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustain-
tor, arXiv Prepr. arXiv1904.03797 29 (2019) 7389–7398.
able Societies 2018, p. 40.
[81] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: inte-
[109] H.D. Nguyen, I.S. Na, S.H. Kim, G.S. Lee, H.J. Yang, J.H. Choi, Multiple human tracking
grated recognition, localization and detection using convolutional networks,
in drone image, Multimed. Tools Appl. (2018) 1–15.
arXiv Prepr. arXiv1312.6229 (2013) 1–16.
[82] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate ob- [110] J. Chen, X. Miao, H. Jiang, J. Chen, X. Liu, Identification of autonomous landing sign
ject detection and semantic segmentation, Proceedings of the IEEE Conference on for unmanned aerial vehicle based on faster regions with convolutional neural net-
Computer Vision and Pattern Recognition 2014, pp. 580–587. work, Chinese Automation Congress (CAC), 2017 2017, pp. 2109–2114.
[83] T. Qu, Q. Zhang, S. Sun, Vehicle detection from high-resolution aerial images using [111] T. Tang, Z. Deng, S. Zhou, L. Lei, H. Zou, Fast vehicle detection in UAV images, Re-
spatial pyramid pooling-based deep convolutional neural networks, Multimed. mote Sensing with Intelligent Processing (RSIP), 2017 International Workshop
Tools Appl. 76 (20) (2017) 21651–21663. on 2017, pp. 1–5.
[84] R. Girshick, Fast r-cnn, Proceedings of the IEEE International Conference on Com- [112] A. Robicquet, A. Sadeghian, A. Alahi, S. Savarese, Learning social etiquette: human
puter Vision 2015, pp. 1440–1448. trajectory understanding in crowded scenes, European Conference on Computer
[85] K. Chen, K. Fu, M. Yan, X. Gao, X. Sun, X. Wei, Semantic segmentation of aerial im- Vision 2016, pp. 549–565.
ages with shuffling convolutional neural networks, IEEE Geosci. Remote Sens. Lett. [113] C. Kyrkou, G. Plastiras, T. Theocharides, S.I. Venieris, C.-S. Bouganis, DroNet: effi-
15 (2) (2018) 173–177. cient convolutional neural network detector for real-time UAV applications, De-
[86] P.O. Pinheiro, T.-Y. Lin, R. Collobert, P. Dollár, Learning to refine object segments, sign, Automation & Test in Europe Conference & Exhibition (DATE), 2018 2018,
European Conference on Computer Vision 2016, pp. 75–91. pp. 967–972.
[87] C.-J. Seo, Vehicle Detection using Images taken by Low-Altitude Unmanned Aerial [114] J. Park, D.H. Kim, Y.S. Shin, S. Lee, A comparison of convolutional object detectors
Vehicles (UAVs), Indian J. Sci. Technol. 9 (2016). for real-time drone tracking using a PTZ camera, Control, Automation and Systems
[88] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Light-head r-cnn: in defense of two- (ICCAS), 2017 17th International Conference on 2017, pp. 696–699.
stage object detector, arXiv Prepr. arXiv1711.07264 (2017) 1–8. [115] C. Iuga, P. Druagan, L. Bușoniu, Fall monitoring and detection for at-risk persons
[89] W. Ouyang, et al., Deepid-net: deformable deep convolutional neural networks for using a UAV, IFAC-PapersOnLine 51 (10) (2018) 199–204.
object detection, Proceedings of the IEEE Conference on Computer Vision and Pat- [116] C. Aker, S. Kalkan, Using deep networks for drone detection, arXiv Prepr.
tern Recognition 2015, pp. 2403–2412. arXiv1706.05726 (2017) 1–16.
[90] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, Proceedings of the IEEE [117] A. Soleimani, N.M. Nasrabadi, Convolutional neural networks for aerial multi-label
Conference on Computer Vision and Pattern Recognition 2018, pp. 7132–7141. pedestrian detection, 2018 21st International Conference on Information Fusion
[91] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, YOLOv4: optimal speed and accuracy of (FUSION) 2018, pp. 1005–1010.
object detection, arXiv Prepr. arXiv2004.10934 (2020) 1–17. [118] H.-Y. Wang, Y.-C. Chang, Y.-Y. Hsieh, H.-T. Chen, J.-H. Chuang, Deep learning-based
[92] C.-Y. Wang, H.-Y.M. Liao, I.-H. Yeh, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, CSPNet: a new human activity analysis for aerial images, Intelligent Signal Processing and Com-
backbone that can enhance learning capability of CNN, arXiv Prepr. munication Systems (ISPACS), 2017 International Symposium on 2017,
arXiv1911.11929 (2019) 390–391. pp. 713–718.
[93] S. Yun, D. Han, S.J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: regularization strategy to [119] R. Jin, D. Lin, Adaptive Anchor for Fast Object Detection in Aerial Image, IEEE
train strong classifiers with localizable features, Proceedings of the IEEE Interna- Geosci. Remote Sens. Lett, 2019.
tional Conference on Computer Vision 2019, pp. 6023–6032.
[120] X. Zhang, E. Izquierdo, K. Chandramouli, Dense and small object detection in UAV
[94] G. Ghiasi, T.-Y. Lin, Q.V. Le, Dropblock: a regularization method for convolutional
vision based on cascade network, Proceedings of the IEEE International Conference
networks, Advances in Neural Information Processing Systems 2018,
on Computer Vision Workshops 2019, p. 0.
pp. 10727–10737.
[121] Z. Tang, X. Liu, G. Shen, B. Yang, PENet: object detection using points estimation in
[95] R. Müller, S. Kornblith, G.E. Hinton, When does label smoothing help? Advances in
aerial images, arXiv Prepr. arXiv2001.08247 (2020) 1–7.
Neural Information Processing Systems 2019, pp. 4696–4705.
[96] J. Zhang, J. Huang, X. Chen, D. Zhang, How to fully exploit the abilities of aerial [122] M. Barekatain, et al., Okutama-Action: an aerial view video dataset for concurrent
image detectors, Proceedings of the IEEE International Conference on Computer human action detection, 1st Joint BMTT-PETS Workshop on Tracking and Surveil-
lance, CVPR 2017, pp. 1–8.
Vision Workshops 2019, p. 0.
[97] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A.C. Berg, DSSD: deconvolutional single shot de- [123] S. Razakarivony, F. Jurie, Vehicle detection in aerial imagery: a small target detec-
tector, arXiv Prepr. arXiv1701.06659 (2017) 1–11. tion benchmark, J. Vis. Commun. Image Represent. 34 (2016) 187–203.
[98] H. Law, Y. Teng, O. Russakovsky, J. Deng, Cornernet-lite: efficient keypoint based [124] M. Kristan, et al., The visual object tracking vot2015 challenge results, Proceedings
object detection, arXiv Prepr. arXiv1904.08900 (2019) 1–15. of the IEEE International Conference on Computer Vision Workshops 2015,
[99] D.R. Pailla, VisDrone-DET2019: The Vision Meets Drone Object Detection in Image pp. 1–23.
Challenge Results, 2019. [125] S. Oh, et al., A large-scale benchmark dataset for event recognition in surveillance
[100] J.O. Du Terrail, F. Jurie, On the use of deep neural networks for the detection of video, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
small vehicles in ortho-images, Image Processing (ICIP), 2017 IEEE International 2011, pp. 3153–3160.
Conference on 2017, pp. 4212–4216. [126] O. Gusikhin, D. Filev, N. Rychtyckyj, Intelligent vehicle systems: applications and
[101] T. Tang, S. Zhou, Z. Deng, H. Zou, L. Lei, Vehicle detection in aerial images based on new trends, Informatics in Control Automation and Robotics, Springer 2008,
region convolutional neural networks and hard negative example mining, Sensors pp. 3–14.
17 (2) (2017) 336.

13

View publication stats

You might also like