KEMBAR78
Recent Advances in Deep Learning For Object Detection | PDF | Deep Learning | Image Segmentation
0% found this document useful (0 votes)
242 views26 pages

Recent Advances in Deep Learning For Object Detection

Uploaded by

M Tahir Mujtaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views26 pages

Recent Advances in Deep Learning For Object Detection

Uploaded by

M Tahir Mujtaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Neurocomputing 396 (2020) 39–64

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Recent advances in deep learning for object detection


Xiongwei Wu a,∗, Doyen Sahoo b, Steven C.H. Hoi a,b
a
School of Information System, Singapore Management University, Singapore
b
Salesforce Research Asia

a r t i c l e i n f o a b s t r a c t

Article history: Object detection is a fundamental visual recognition problem in computer vision and has been widely
Received 11 August 2019 studied in the past decades. Visual object detection aims to find objects of certain target classes with
Revised 9 January 2020
precise localization in a given image and assign each object instance a corresponding class label. Due to
Accepted 21 January 2020
the tremendous successes of deep learning based image classification, object detection techniques using
Available online 25 January 2020
deep learning have been actively studied in recent years. In this paper, we give a comprehensive survey of
Communicated by Dr Zenglin Xu recent advances in visual object detection with deep learning. By reviewing a large body of recent related
work in literature, we systematically analyze the existing object detection frameworks and organize the
Keywords:
Object detection survey into three major parts: (i) detection components, (ii) learning strategies, and (iii) applications &
Deep learning benchmarks. In the survey, we cover a variety of factors affecting the detection performance in detail,
Deep convolutional neural networks such as detector architectures, feature learning, proposal generation, sampling strategies, etc. Finally, we
discuss several future directions to facilitate and spur future research for visual object detection with
deep learning.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction semantic cues as well as the spatial information about the image.
In fact, object detection is the basic step towards many computer
In the field of computer vision, there are several fundamen- vision applications, such as face recognition [5–7], pedestrian de-
tal visual recognition problems: image classification [1], object de- tection [8–10], video analysis [11,12], and logo detection [13–15].
tection and instance segmentation [2,3], and semantic segmenta- In the early stages, before the deep learning era, the pipeline
tion [4] (see Fig. 1). In particular, image classification (Fig 1.1(a)), of object detection was divided into three steps: (i) proposal gen-
aims to recognize semantic categories of objects in a given im- eration; (ii) feature vector extraction; and (iii) region classifica-
age. Object detection not only recognizes object categories, but also tion. During proposal generation, the objective was to search lo-
predicts the location of each object by a bounding box (Fig. 1(b)). cations in the image which may contain objects. These locations
Semantic segmentation (Fig. 1(c)) aims to predict pixel-wise clas- are also called regions of interest (roi). An intuitive idea is to scan
sifiers to assign a specific category label to each pixel, thus the whole image with sliding windows [16–20]. In order to cap-
providing an even richer understanding of an image. However, in ture information about multi-scale and different aspect ratios of
contrast to object detection, semantic segmentation does not dis- objects, input images were resized into different scales and multi-
tinguish between multiple objects of the same category. A rela- scale windows were used to slide through these images. During
tively new setting at the intersection of object detection and se- the second step, on each location of the image, a fixed-length fea-
mantic segmentation, named “instance segmentation” (Fig. 1(d)), ture vector was obtained from the sliding window, to capture dis-
is proposed to identify different objects and assign each of them criminative semantic information of the region covered. This fea-
a separate categorical pixel-level mask. In fact, instance segmenta- ture vector was commonly encoded by low-level visual descriptors
tion can be viewed as a special setting of object detection, where such as SIFT (Scale Invariant Feature Transform) [21], Haar [22],
instead of localizing an object by a bounding box, pixel-level lo- HOG (Histogram of Gradients) [19] or SURF (Speeded Up Robust
calization is desired. In this survey, we direct our attention to re- Features) [23], which showed a certain robustness to scale, illumi-
view the major efforts in deep learning based object detection. A nation and rotation variance. Finally, in the third step, the region
good detection algorithm should have a strong understanding of classifiers were learned to assign categorical labels to the covered
regions. Commonly, support vector machines (SVM) [24] were used

here due to their good performance on small scale training data.
Corresponding author.
In addition, some classification techniques such as bagging [25],
E-mail addresses: xwwu.2015@phdis.smu.edu.sg (X. Wu), dsahoo@salesforce.com
(D. Sahoo), chhoi@smu.edu.sg, shoi@salesforce.com (S.C.H. Hoi). cascade learning [20] and adaboost [26] were used in region clas-

https://doi.org/10.1016/j.neucom.2020.01.085
0925-2312/© 2020 Elsevier B.V. All rights reserved.
40 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

scent (SGD) via back-propagation and showed competitive perfor-


mance on digit recognition. After that, however, deep convolutional
neural networks were not heavily explored, with support vector
machines becoming more prominent. This was because deep learn-
ing had some limitations: (i) lack of large scale annotated training
data, which caused overfitting; (ii) limited computation resources;
and (iii) weak theoretical support compared to SVMs. In 2009, Jia
et al. [37] collected a large scale annotated image dataset ImageNet
which contained 1.2M high resolution images, making it possible
to train deep models with large scale training data. With the de-
velopment of computing resources on parallel computing systems
(such as GPU clusters), in 2012 Krizhevsky et al. [33] trained a
large deep convolutional model with ImageNet dataset and showed
significant improvement on Large Scale Visual Recognition Chal-
lenge (ILSVRC) compared to all other approaches. After the success
of applying DCNN for classification, deep learning techniques were
quickly adapted to other vision tasks and showed promising results
compared to the traditional methods.
In contrast to hand-crafted descriptors used in traditional de-
Fig. 1. Comparison of different visual recognition tasks in computer vision. (a) “Im- tectors, deep convolutional neural networks generate hierarchical
age Classification” only needs to assign categorical class labels to the image; (b) feature representations from raw pixels to high level semantic in-
“Object detection” not only predict categorical labels but also localize each object formation, which is learned automatically from the training data
instance via bounding boxes; (c) “Semantic segmentation” aims to predict categori-
and shows more discriminative expression capability in complex
cal labels for each pixel, without differentiating object instances; (d) “Instance seg-
mentation”, a special setting of object detection, differentiates different object in- contexts. Furthermore, benefiting from the powerful learning ca-
stances by pixel-level segmentation masks. pacity, a deep convolutional neural network can obtain a better
feature representation with a larger dataset, while the learning ca-
pacity of traditional visual descriptors are fixed, and can not im-
sification step, leading to further improvements in detection accu- prove when more data becomes available. These properties made it
racy. possible to design object detection algorithms based on deep con-
Most of the successful traditional methods for object detec- volutional neural networks which could be optimized in an end-to-
tion focused on carefully designing feature descriptors to obtain end manner, with more powerful feature representation capability.
embedding for a region of interest. With the help of good fea- Currently, deep learning based object detection frameworks
ture representations as well as robust region classifiers, impres- can be primarily divided into two families: (i) two-stage de-
sive results [27,28] were achieved on Pascal VOC dataset [29] (a tectors, such as Region-based CNN (R-CNN) [2] and its variants
publicly available dataset used for benchmarking object detection). [34,38,39] and (ii) one-stage detectors, such as YOLO [40] and its
Notably, deformable part based machines (DPMs) [30], a break- variants [41,42]. Two-stage detectors first use a proposal genera-
through detection algorithm, were 3-time winners on VOC chal- tor to generate a sparse set of proposals and extract features from
lenges in 2007, 2008 and 2009. DPMs learn and integrate multiple each proposal, followed by region classifiers which predict the cat-
part models with a deformable loss and mine hard negative exam- egory of the proposed region. One-stage detectors directly make
ples with a latent SVM for discriminative training. However, during categorical prediction of objects on each location of the feature
2008 to 2012, the progress on Pascal VOC based on these tradi- maps without the cascaded region classification step. Two-stage
tional methods had become incremental, with minor gains from detectors commonly achieve better detection performance and re-
building complicated ensemble systems. This showed the limita- port state-of-the-art results on public benchmarks, while one-stage
tions of these traditional detectors. Most prominently, these lim- detectors are significantly more time-efficient and have greater ap-
itations included: (i) during proposal generation, a huge number plicability to real-time object detection. Fig. 2 also illustrates the
of proposals were generated, and many of them were redundant; major developments and milestones of deep learning based object
this resulted in a large number of false positives during classifica- detection techniques after 2012. We will cover basic ideas of these
tion. Moreover, window scales were designed manually and heuris- key techniques and analyze them in a systematic manner in the
tically, and could not match the objects well; (ii) feature descrip- survey.
tors were hand-crafted based on low level visual cues [23,31,32], The goal of this survey is to present a comprehensive under-
which made it difficult to capture representative semantic informa- standing of deep learning based object detection algorithms. Fig. 3
tion in complex contexts. (iii) each step of the detection pipeline shows a taxonomy of key methodologies to be covered in this sur-
was designed and optimized separately, and thus could not obtain vey. We review various contributions in deep learning based ob-
a global optimal solution for the whole system. ject detection and categorize them into three groups: detection
After the success of applying deep convolutional neural net- components, learning strategies, and applications & benchmarks.
works (DCNN) for image classification [1,33], object detection For detection components, we first introduce two detection set-
also achieved remarkable progress based on deep learning tech- tings: bounding box level (bbox-level) and pixel mask level (mask-
niques [2,34]. The new deep learning based algorithms outper- level) localization. Bbox-level algorithms require to localize objects
formed the traditional detection algorithms by huge margins. Deep by rectangle bounding boxes, while more precise pixel-wise masks
convolutional neural network is a biologically-inspired structure are required to segment objects in mask-level algorithms. Next, we
for computing hierarchical features. An early attempt to build summarize the representative frameworks of two detection fami-
such a hierarchical and spatial-invariant model for image classi- lies: two-stage detection and one-stage detection. Then we give a
fication was “neocognitron” [35] proposed by Fukushima. How- detailed survey of each detection component, including backbone
ever, this early attempt lacked effective optimization techniques for architecture, proposal generation and feature learning. For learning
supervised learning. Based on this model, Lecun et al. [36] opti- strategies, we first highlight the importance of learning strategy of
mized a convolutional neural network by stochastic gradient de- detection due to the difficulty of training detectors, and then in-
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 41

Fig. 2. Major milestone in object detection research based on deep convolution neural networks since 2012. The trend in the last year has been designing object detectors
based on anchor-free (in red) and AutoML (in green) techniques, which are potentially two important research directions in the future. (For interpretation of the references
to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Taxonomy of key methodologies in this survey. We categorize various contributions for deep learning based object detection into three major categories: Detection
Components, Learning Strategies, Applications and Benchmarks. We review each of these categories in detail.

troduce the optimization techniques for both training and testing and discuss future directions in Section 9. The code is available at
stages in detail. Finally, we review some real-world object detec- https://github.com/XiongweiWu/Awesome- Object- Detection.
tion based applications including face detection, pedestrian detec-
tion, logo detection and video analysis. We also discuss publicly 2. Problem settings
available and commonly used benchmarks and evaluation metrics
for these detection tasks. Finally we show the state-of-the-art re- In this section, we present the formal problem setting for object
sults of generic detection on public benchmarks over the recent detection based on deep learning. Object detection involves both
years. recognition (e.g., “object classification”) and localization (e.g., “lo-
We hope our survey can provide a timely review for researchers cation regression”) tasks. An object detector needs to distinguish
and practitioners to further catalyze research on detection systems. objects of certain target classes from backgrounds in the image
The rest of the paper are organized as follows: in Section 2, we with precise localization and correct categorical label prediction to
give a standard problem setting of object detection. The details each object instance. Bounding boxes or pixel masks are predicted
of detector components are listed in Section 3. Then the learning to localize these target object instances.
strategies are presented in Section 4. Detection algorithms for real- Moreformally, assume
world applications and benchmarks are provided in Sections 5 and
 we are given a collection of N annotated
images x1 , x2 , . . . , xN , and for ith image xi , there are Mi objects
6. State-of-the-art results of generic detection, face detection and belonging to C categories with annotations:
pedestrian detection are listed in Section 7. Finally, we conclude  
yi = (c1i , bi1 ), (c2i , bi2 ), . . . , (cM
i
i
, biMi ) (1)
42 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

where cij (cij ∈ C) and bij (bounding box or pixel mask of the object) setting and is based on traditional detection setting. Instance seg-
denote categorical and spatial labels of jth object in xi respectively. mentation requires to segment each object by a pixel-wise mask
The detector is f parameterized by θ . For xi , the prediction yipred instead of a rough rectangle bounding box. Due to more precise
shares the same format as yi : pixel-level prediction, instance segmentation is more sensitive to
  spatial misalignment, and thus has higher requirement to process
yipred = (cpred
i
1
, bipred ), (cpred
1
i
2
, bipred ), . . . )
2
(2) the spatial information. The evaluation metric of instance segmen-
Finally a loss function  is set to optimize detector as: tation is almost identical to the bbox-level detection, except that
the IoU computation is performed on mask predictions. Though
1 λ
N
the two detection settings are slightly different, the main compo-
(x, θ ) = (yipred , xi , yi ; θ ) + θ 2
2
(3)
N 2 nents introduced later can mostly be shared by the two settings.
i=1

where the second term is a regularizer, with trade-off parame- 3.2. Detection paradigms
ter λ. Different loss functions such as softmax loss [38] and focal
loss [43] impact the final detection performance, and we will dis- Current state-of-the-art object detectors with deep learning can
cuss these functions in Section 4. be mainly divided into two major categories: two-stage detectors
At the time of evaluation, a metric called intersection-over- and one-stage detectors. For a two-stage detector, in the first stage,
union (IoU) between objects and predictions is used to evaluate a sparse set of proposals is generated; and in the second stage, the
the quality of localization (we omit index i here): feature vectors of generated proposals are encoded by deep convo-

Area(bpred bgt ) lutional neural networks followed by making the object class pre-
IoU(bpred , bgt ) =  (4) dictions. An one-stage detector does not have a separate stage for
Area(bpred bgt )
proposal generation (or learning a proposal generation). They typ-
Here, bgt refers to the ground truth bbox or mask. An IoU thresh- ically consider all positions on the image as potential objects, and
old  is set to determine whether a prediction tightly covers the try to classify each region of interest as either background or a tar-
object or not (i.e. IoU ≥ ; commonly researchers set  = 0.5). get object. Two-stage detectors often reported state-of-the-art re-
For object detection, a prediction with correct categorical label as sults on many public benchmark datasets. However, they generally
well as successful localization prediction (meeting the IoU criteria) fall short in terms of lower inference speeds. One-stage detectors
is considered as positive, otherwise it’s a negative prediction: are much faster and more desired for real-time object detection
 applications, but have a relatively poor performance compared to
Positive cpred = cgt and IoU(bpred , bgt ) > 
Prediction = (5) the two-stage detectors.
Negative otherwise

For generic object detection problem evaluation, mean average pre- 3.2.1. Two-stage detectors
cision (mAP) over C classes is used for evaluation, and in real world Two-stage detectors split the detection task into two stages: (i)
scenarios such as pedestrian detection, different evaluation metrics proposal generation; and (ii) making predictions for these propos-
are used. The details of evaluation metric for different detection als. During the proposal generation phase, the detector will try to
tasks will be discussed in Section 6. In addition to detection accu- identify regions in the image which may potentially be objects. The
racy, inference speed is also an important metric to evaluate object idea is to propose regions with a high recall, such that all objects
detection algorithms. Specifically, if we wish to detect objects in a in the image belong to at least one of these proposed region. In
video stream (real-time detection), it is imperative to have a de- the second stage, a deep-learning based model is used to classify
tector that can process this information quickly. Thus, the detector these proposals with the right categorical labels. The region may
efficiency is also evaluated on Frame per second (FPS), i.e., how either be a background, or an object from one of the predefined
many images it can process per second. Commonly a detector that class labels. Additionally, the model may refine the original local-
can achieve an inference speed of 20 FPS, is considered to be a ization suggested by the proposal generator. Next, we review some
real-time detector. of the most influential efforts among two-stage detectors.
R-CNN [2] is a pioneering two-stage object detector proposed
3. Detection components by Girshick et al. in 2014. Compared to the previous state-
of-the-art methods based on a traditional detection framework
In this section, we introduce different components of object de- SegDPM [44] with 40.4% mAP on Pascal VOC2010, R-CNN signif-
tection. The first is about the choice of object detection paradigm. icantly improved the detection performance and obtained 53.7%
We first introduce the concepts of two detection settings: bbox- mAP. The pipeline of R-CNN can be divided into three components:
level and mask-level algorithms. Then, We introduce two major (i) proposal generation, (ii) feature extraction and (iii) region clas-
object detection paradigms: two-stage detectors and one-stage de- sification. For each image, R-CNN generates a sparse set of pro-
tectors. Under these paradigms, detectors can use a variety of deep posals (around 20 0 0 proposals) via Selective Search [45], which
learning backbone architectures, proposal generators, and feature is designed to reject regions that can easily be identified as back-
representation modules. ground regions. Then, each proposal is cropped and resized into a
fixed-size region and is encoded into a (e.g. 4096 dimensional) fea-
3.1. Detection settings ture vector by a deep convolutional neural network, followed by a
one-vs-all SVM classifier. Finally the bounding box regressors are
There are two settings in object detection: (i) vanilla object learned using the extracted features as input in order to make the
detection (bbox-level localization) and (ii) instance segmentation original proposals tightly bound the objects. Compared to tradi-
(pixel-level or mask-level localization). Vanilla object detection has tional hand-crafted feature descriptors, deep neural networks gen-
been more extensively studied and is considered as the traditional erate hierarchical features and capture different scale information
detection setting, where the goal is to localize objects by rectangle in different layers, and finally produce robust and discriminative
bounding boxes. In vanilla object detection algorithms, only bbox features for classification. utilize the power of transfer learning, R-
annotations are required, and in evaluation, the IoU between pre- CNN adopts weights of convolutional networks pre-trained on Im-
dicted bounding box with the ground truth is calculated to mea- ageNet. The last fully connected layer (FC layer) is re-initialized for
sure the performance. Instance segmentation is a relatively new the detection task. The whole detector is then finetuned on the
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 43

pre-trained model. This transfer of knowledge from the Imagenet proposal generator could be learned via supervised learning meth-
dataset offers significant performance gains. In addition, R-CNN re- ods. RPN is a fully convolutional network which takes an image of
jects huge number of easy negatives before training, which helps arbitrary size and generates a set of object proposals on each po-
improve learning speed and reduce false positives. sition of the feature map. The network slid over the feature map
However, R-CNN faces some critical shortcomings: (i) the fea- using an n × n sliding window, and generated a feature vector for
tures of each proposal were extracted by deep convolutional net- each position. The feature vector was then fed into two sibling out-
works separately (i.e., computation was not shared), which led to put branches, object classification layer (which classified whether
heavily duplicated computations. Thus, R-CNN was extremely time- the proposal was an object or not) and bounding box regression
consuming for training and testing; (ii) the three steps of R-CNN layer. These results were then fed into the final layer for the ac-
(proposal generation, feature extraction and region classification) tual object classification and bounding box localization. RPN could
were independent components and the whole detection framework be inserted into Fast R-CNN and thus the whole framework could
could not be optimized in an end-to-end manner, making it dif- be optimized in an end-to-end manner on training data. This way
ficult to obtain global optimal solution; and (iii) Selective Search RPN enabled proposal generation in a data driven manner, and was
relied on low-level visual cues and thus struggled to generate high also able to enjoy the discriminative power of deep backbone net-
quality proposals in complex contexts. Moreover, it is unable to en- works. Faster R-CNN was able to make predictions at 5FPS on GPU
joy the benefits of GPU acceleration. and achieved state-of-the-art results on many public benchmark
Inspired by the idea of spatial pyramid matching (SPM) [46], He datasets, such as Pascal VOC 2007, 2012 and MSCOCO. Currently,
et al. proposed SPP-net [47] to accelerate R-CNN as well as learn there are huge number of detector variants based on Faster R-CNN
more discriminative features. Instead of cropping proposal regions for different usage [39,49–51].
and feeding into CNN model separately, SPP-net computes the fea- Faster R-CNN computed feature map of the input image and ex-
ture map from the whole image using a deep convolutional net- tracted region features on the feature map, which shared feature
work and extracts fixed-length feature vectors on the feature map extraction computation across different regions. However, the com-
by a Spatial Pyramid Pooling (SPP) layer. SPP partitions the feature putation was not shared in the region classification step, where
map into an N × N grid, for multiple values of N (thus allowing each feature vector still needed to go through a sequence of FC
obtaining information at different scales), and performs pooling on layers separately. Such extra computation could be extremely large
each cell of the grid, to give a feature vector. The feature vectors as each image may have hundreds of proposals. Simply remov-
obtained from each N × N grid are concatenated to give the repre- ing the fully connected layers would result in the drastic decline
sentation for the region. The extracted features are fed into region of detection performance, as the deep network would have re-
SVM classifiers and bounding box regressors. In contrast to RCNN, duced the spatial information of proposals. Dai et al. [52] proposed
SPP-layer can also work on images/regions at various scales and Region-based Fully Convolutional Networks (R-FCN) which shared
aspect ratios without resizing them. Thus, it does not suffer from the computation cost in the region classification step. R-FCN gen-
information loss and unwanted geometric distortion. erated a Position Sensitive Score Map which encoded relative posi-
SPP-net achieved better results and had a significantly faster tion information of different classes, and used a Position Sensitive
inference speed compared to R-CNN. However, the training of ROI Pooling layer (PSROI Pooling) to extract spatial-aware region
SPP-net was still multi-stage and thus it could not be optimized features by encoding each relative position of the target regions.
end-to-end (and required extra cache memory to store extracted The extracted feature vectors maintained spatial information and
features). In addition, SPP layer did not back-propagate gradients thus the detector achieved competitive results compared to Faster
to convolutional kernels and thus all the parameters before the R-CNN without region-wise fully connected layer operations.
SPP layer were frozen. This significantly limited the learning Another issue with Faster R-CNN was that it used a single deep
capability of deep backbone architectures. Girshick et al. proposed layer feature map to make the final prediction. This made it diffi-
Fast R-CNN [38], a multi-task learning detector which addressed cult to detect objects at different scales. In particular, it was diffi-
these two limitations of SPP-net. Fast R-CNN (like SPP-Net) also cult to detect small objects. In DCNN feature representations, deep
computed a feature map for the whole image and extracted fixed- layer features are semantically-strong but spatially-weak, while
length region features on the feature map. Different from SPP-net, shallow layer features are semantically-weak but spatially-strong.
Fast R-CNN used ROI Pooling layer to extract region features. ROI Lin et al. [39] exploited this property and proposed Feature Pyra-
pooling layer is a special case of SPP which only takes a single mid Networks (FPN) which combined deep layer features with
scale (i.e., only one value of N for the N × N grid) to partition the shallow layer features to enable object detection in feature maps
proposal into fixed number of divisions, and also backpropagated at different scales. The main idea was to strengthen the spatially
error signals to the convolution kernels. After feature extraction, strong shallow layer features with rich semantic information from
feature vectors were fed into a sequence of fully connected layers the deeper layers. FPN achieved significant progress in detecting
before two sibling output layers: classification layer (cls) and multi-scale objects and has been widely used in many other do-
regression layer (reg). Classification layer was responsible for gen- mains such as video detection [53,54] and human pose recognition
erating softmax probabilities over C+1 classes (C classes plus one [55,56].
background class), while regression layer encoded 4 real-valued Most instance segmentation algorithms are extended from
parameters to refine bounding boxes. In Fast RCNN, the feature vanilla object detection algorithms. Early methods [57–59] com-
extraction, region classification and bounding box regression steps monly generated segment proposals, followed by Fast RCNN for
can all be optimized end-to-end, without extra cache space to segments classification. Later, Dai et al. [59] proposed a multi-
store features (unlike SPP Net). Fast R-CNN achieved a much better stage algorithm named “MNC” which divided the whole detection
detection accuracy than R-CNN and SPP-net, and had a better framework into multiple stages and predicted segmentation masks
training and inference speed. from the learned bounding box proposals, which were later cat-
Despite the progress in learning detectors, the proposal gen- egorized by region classifiers. These early works performed bbox
eration step still relied on traditional methods such as Selective and mask prediction in multiple stages. To make the whole process
Search [45] or Edge Boxes [48], which were based on low-level vi- more flexible, He et al. [3] proposed Mask R-CNN, which predicted
sual cues and could not be learned in a data-driven manner. To ad- bounding boxes and segmentation masks in parallel based on the
dress this issue, Faster R-CNN [34] was developed which relied on proposals and reported state-of-the-art results. Based on Mask R-
a novel proposal generator: Region Proposal Network (RPN). This CNN, Huang et al. [60] proposed a mask-quality aware framework,
44 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

Fig. 4. Overview of different two-stage detection frameworks for generic object detection. Red dotted rectangles denote the outputs that define the loss functions. (For
interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

named Mask Scoring R-CNN, which learned the quality of the pre- jects. In the original implementation, each cell was considered to
dicted masks and calibrated the misalignment between mask qual- contain the center of (upto) two objects. For each cell, a prediction
ity and mask confidence score. was made which comprised the following information: whether
Fig. 4 gives an overview of the detection frameworks for several that location had an object, the bounding box coordinates and size
representative two-stage detectors. (width and height), and the class of the object. The whole frame-
work was a single network and it omitted proposal generation step
3.2.2. One-stage detectors which could be optimized in an end-to-end manner. Based on a
Different from two-stage detection algorithms which divide the carefully designed lightweight architecture, YOLO could make pre-
detection pipeline into two parts: proposal generation and region diction at 45 FPS, and reach 155 FPS with a more simplified back-
classification; one-stage detectors do not have a separate stage for bone. However, YOLO faced some challenges: (i) it could detect
proposal generation (or learning a proposal generation). They typ- upto only two objects at a given location, which made it difficult
ically consider all positions on the image as potential objects, and to detect small objects and crowded objects [40]. (ii) only the last
try to classify each region of interest as either background or a tar- feature map was used for prediction, which was not suitable for
get object. predicting objects at multiple scales and aspect ratios.
One of the early successful one-stage detectors based on deep In 2016, Liu etal. proposed another one-stage detector Single-
learning was developed by Sermanet et al. [61] named OverFeat. Shot Mulibox Detector (SSD) [42] which addressed the limitations
OverFeat performed object detection by casting DCNN classifier of YOLO. SSD also divided images into grid cells, but in each grid
into a fully convolutional object detector. Object detection can be cell, a set of anchors with multiple scales and aspect-ratios were
viewed as a ”multi-region classification” problem, and thus Over- generated to discretize the output space of bounding boxes (un-
Feat extended the original classifier into detector by viewing the like predicting from fixed grid cells adopted in YOLO). Each anchor
last FC layers as 1x1 convolutional layers to allow arbitrary input. was refined by 4-value offsets learned by the regressors and was
The classification network output a grid of predictions on each re- assigned (C+1) categorical probabilities by the classifiers. In addi-
gion of the input to indicate the presence of an object. After iden- tion, SSD predicted objects on multiple feature maps, and each of
tifying the objects, bounding box regressors were learned to refine these feature maps was responsible for detecting a certain scale of
the predicted regions based on the same DCNN features of clas- objects according to its receptive fields. In order to detect large ob-
sifier. In order to detect multi-scale objects, the input image was jects and increase receptive fields, several extra convolutional fea-
resized into multiple scales which were fed into the network. Fi- ture maps were added to the original backbone architecture. The
nally, the predictions across all the scales were merged together. whole network was optimized with a weighted sum of localization
OverFeat showed significant speed strength compared with RCNN loss and classification loss over all prediction maps via an end-to-
by sharing the computation of overlapping regions using convolu- end training scheme. The final prediction was made by merging
tional layers, and only a single pass forward through the network all detection results from different feature maps. In order to avoid
was required. However, the training of classifiers and regressors huge number of negative proposals dominating training gradients,
were separated without being jointly optimized. hard negative mining was used to train the detector. Intensive data
Later, Redmon etal. [40] developed a real-time detector called augmentation was also applied to improve detection accuracy. SSD
YOLO (You Only Look Once). YOLO considered object detection as achieved comparable detection accuracy with Faster R-CNN but en-
a regression problem and spatially divided the whole image into joyed the ability to do real-time inference.
fixed number of grid cells (e.g. using a 7 × 7 grid). Each cell was Without proposal generation to filter easy negative samples, the
considered as a proposal to detect the presence of one or more ob- class imbalance between foreground and background is a severe
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 45

Fig. 5. Overview of different one-stage detection frameworks for generic object detection. Red rectangles denotes the outputs that define the objective functions.

problem in one-stage detector. Lin et al. [43] proposed a one-stage the detection performance. During the later years, this approach
detector RetinaNet which addressed class imbalance problem in a had become the default strategy for most object detectors. In this
more flexible manner. RetinaNet used focal loss which suppressed section, we will first briefly introduce the basic concept of deep
the gradients of easy negative samples instead of simply discard- convolutional neural networks and then review some architectures
ing them. Further, they used feature pyramid networks to detect which are widely used for detection.
multi-scale objects at different levels of feature maps. Their pro-
posed focal loss outperformed naive hard negative mining strategy
3.3.1. Basic architecture of a CNN
by large margins.
Deep convolutional neural network (DCNN) is a typical deep
Redmon et al. proposed an improved YOLO version,
neural network and has proven extremely effective in visual un-
YOLOv2 [41] which significantly improved detection performance
derstanding [33,36]. Deep convolutional neural networks are com-
but still maintained real-time inference speed. YOLOv2 adopted a
monly composed of a sequence of convolutional layers, pooling
more powerful deep convolutional backbone architecture which
layers, nonlinear activation layers and fully connected layers (FC
was pre-trained on higher resolution images from ImageNet (from
layers). Convolutional layer takes an image input and convolves
224 × 224 to 448 × 448), and thus the weights learned were
over it by n × n kernels to generate a feature map. The gener-
more sensitive to capturing fine-grained information. In addition,
ated feature map can be regarded as a multi-channel image and
inspired by the anchor strategy used in SSD, YOLOv2 defined
each channel represents different information about the image.
better anchor priors by k-means clustering from the training data
Each pixel in the feature map (named neuron) is connected to a
(instead of setting manually). This helped in reducing optimizing
small portion of adjacent neurons from the previous map, which
difficulties in localization. Finally integrating with Batch Normal-
is called the receptive field. After generating feature maps, a non-
ization layers [62] and multi-scale training techniques, YOLOv2
linear activation layer is applied. Pooling layers are used to sum-
achieved state-of-the-art detection results at that time.
marize the signals within the receptive fields, to enlarge receptive
The previous approaches required designing anchor boxes man-
fields as well as reduce computation cost,.
ually to train a detector. Later a series of anchor-free object de-
With the combination of a sequence of convolutional layers,
tectors were developed, where the goal was to predict keypoints
pooling layers and non-linear activation layers, the deep convo-
of the bounding box, instead of trying to fit an object to an an-
lutional neural network is built. The whole network can be op-
chor. Law and Deng proposed a novel anchor-free framework Cor-
timized via a defined loss function by gradient-based optimiza-
nerNet [63] which detected objects as a pair of corners. On each
tion method (stochastic gradient descent [66], Adam [67], etc.). A
position of the feature map, class heatmaps, pair embeddings and
typical convolutional neural network is AlexNet [33], which con-
corner offsets were predicted. Class heatmaps calculated the prob-
tains five convolutional layers, three max-pooling layers and three
abilities of being corners, and corner offsets were used to regress
fully connected layers. Each convolutional layer is followed by a
the corner location. And the pair embeddings served to group a
ReLU [68] non-linear activation layer.
pair of corners which belong to the same objects. Without rely-
ing on manually designed anchors to match objects, CornerNet ob-
tained significant improvement on MSCOCO datasets. Later there 3.3.2. CNN Backbone for object detection
were several other variants of keypoint detection based one-stage In this section, we will review some architectures which are
detectors [64,65]. widely used in object detection tasks with state-of-the-art results,
Fig. 5 gives an overview of different detection frameworks for such as VGG16 [34,38], ResNet [1,52], ResNeXt [43] and Hour-
several representative one-stage detectors. glass [63].
VGG16 [69] was developed based on AlexNet. VGG16 is com-
3.3. Backbone architecture posed of five groups of convolutional layers and three FC layers.
There are two convolutional layers in the first two groups and
R-CNN [2] showed adopting convolutional weights from models three convolutional layers in the next three groups. Between each
pre-trained on large scale image classification problem could pro- group, a Max Pooling layer is applied to decrease spatial dimen-
vide richer semantic information to train detectors and enhanced sion. VGG16 showed that increasing depth of networks by stacking
46 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

convolutional layers could increase the model’s expression capabil- way it captured multi-scale features and summarized these fea-
ity, and led to a better performance. However, increasing model tures together as an output feature map. Better versions of this
depth to 20 layers by simply stacking convolutional layers led model were developed later with different design of choice of con-
to optimization challenges with SGD. The performance declined volution kernels [76], and introducing residual blocks [77].
significantly and was inferior to shallower models, even during The network structures introduced above were all designed
the training stages. Based on this observation, He et al. [1] pro- for image classification. Typically these models trained on Ima-
posed ResNet which reduced optimization difficulties by introduc- geNet are adopted as initialization of the model used for object
ing shortcut connections. Here, a layer could skip the nonlinear detection. However, directly applying this pre-trained model from
transformation and directly pass the values to the next layer as is classification to detection is sub-optimal due to a potential con-
(thus giving us an implicit identity layer). This is given as: flict between classification and detection tasks. Specifically, (i)
classification requires large receptive fields and wants to maintain
xl+1 = xl + fl+1 (xl , θ ) (6)
spatial invariance. Thus multiple downsampling operation (such
where xl is the input feature in l-th layer and fl+1 denotes opera- as pooling layer) are applied to decrease feature map resolution.
tions on input xl such as convolution, normalization or non-linear The feature maps generated are low-resolution and spatially
activation. fl+1 (xl , θ ) is the residual function to xl , so the feature invariant and have large receptive fields. However, in detection,
map of any deep layer can be viewed as the sum of the activa- high-resolution spatial information is required to correctly local-
tion of shallow layer and the residual function. Shortcut connection ize objects; and (ii) classification makes predictions on a single
creates a highway which directly propagates the gradients from feature map, while detection requires feature maps with multiple
deep layers to shallow units and thus, significantly reduces training representations to detect objects at multiple scales. To bridge
difficulty. With residual blocks effectively training networks, the the difficulties between the two tasks, Li et al. introduced Det-
model depth could be increased (e.g. from 16 to 152), allowing us Net [78] which was designed specifically for detection. DetNet
to train very high capacity models. Later, He et al. [70] proposed kept high resolution feature maps for prediction with dilated
a pre-activation variant of ResNet, named ResNet-v2. Their exper- convolutions to increase receptive fields. In addition, DetNet de-
iments showed appropriate ordering of the Batch Normalization tected objects on multi-scale feature maps, which provided richer
[62] could further perform better than original ResNet. This sim- information. DetNet was pre-trained on large scale classification
ple but effective modification of ResNet made it possible to suc- dataset while the network structure was designed for detection.
cessfully train a network with more than 10 0 0 layers, and still en- Hourglass Network [79] is another architecture, which was not
joyed improved performance due to the increase in depth. Huang designed specifically for image classification. Hourglass Network
et al. argued that although ResNet reduced the training difficulty first appeared in human pose recognition task [79], and was a
via shortcut connection, it did not fully utilize features from previ- fully convolutional structure with a sequence of hourglass mod-
ous layers. The original features in shallow layers were missing in ules. Hourglass module first downsampled the input image via a
element-wise operation and thus could not be directly used later. sequence of convolutional layer or pooling layer, and upsampled
They proposed DenseNet [71], which retained the shallow layer the feature map via deconvolutional operation. To avoid informa-
features, and improved information flow, by concatenating the in- tion loss in downsampling stage, skip connection were used be-
put with the residual output instead of element-wise addition: tween downsampling and upsampling features. Hourglass mod-
ule could capture both local and global information and thus was
xl+1 = xl ◦ fl+1 (xl , θ ) (7)
very suitable for object detection. Currently Hourglass Network is
where ◦ denotes concatenation. Chen [72] et al. argued that in widely used in state-of-the-art detection frameworks [63–65].
DenseNet, the majority of new exploited features from shallow
layers were duplicated and incurred high computation cost. Inte- 3.4. Proposal generation
grating the advantages of both ResNet and DenseNet, they pro-
pose a Dual Path Network (DPN) which divides xl channels into Proposal generation plays a very important role in the object
two parts: xdl and xrl . xdl was used for dense connection computa- detection framework. A proposal generator generates a set of rect-
tion and xrl was used for element-wise summation, with unshared angle bounding boxes, which are potentially objects. These propos-
d and f r . The final result was the con-
residual learning branch fl+1 als are then used for classification and localization refinement. We
l+1
catenated output of the two branches: categorize proposal generation methods into four categories: tra-
ditional computer vision methods, anchor-based supervised learn-
xl+1 = (xrl + fl+1
r
(xrl , θ r )) ◦ (xdl ◦ fl+1
d
(xdl , θ d )) (8)
ing methods, keypoint based methods and other methods. Notably,
Based on ResNet, Xie et al. [73] proposed ResNeXt which con- both one-stage detectors and two-stage detectors generate propos-
siderably reduced computation and memory cost while main- als, the main difference is two-stage detectors generates a sparse
taining comparable classification accuracy. ResNeXt adopted group set of proposals with only foreground or background information,
convolution layers [33] which sparsely connects feature map chan- while one-stage detectors consider each region in the image as a
nels to reduce computation cost. By increasing group number to potential proposal, and accordingly estimates the class and bound-
keep computation cost consistent to the original ResNet, ResNeXt ing box coordinates of potential objects at each location.
captures richer semantic feature representation from the train-
ing data and thus improves backbone accuracy. Later, Howard 3.4.1. Traditional computer vision methods
et al. [74] set the coordinates equal to number of channels of each These methods generate proposals in images using traditional
feature map and developed MobileNet. MobileNet significantly re- computer vision methods based on low-level cues, such as edges,
duced computation cost as well as number of parameters without corners, color, etc. These techniques can be categorized into three
significant loss in classification accuracy. This model was specifi- principles: (i) computing the ‘objectness score’ of a candidate box;
cally designed for usage on a mobile platform. (ii) merging super-pixels from original images; (iii) generating mul-
In addition to increasing model depth, some efforts explored tiple foreground and background segments;
benefits from increasing model width to improve the learning ca- Objectness Score based methods predict an objectness score of
pacity. Szegedy et al. proposed GoogleNet with an inception mod- each candidate box measuring how likely it may contain an ob-
ule [75] which applied different scale convolution kernels (1 × 1, ject. Arbelaez et al. [80] assigned objectness score to proposals
3 × 3 and 5 × 5) on the same feature map in a given layer. This by classification based on visual cues such as color contrast, edge
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 47

density and saliency. Rahtu et al. [81] revisited the idea of Arbe-
laez et al. [80] and introduced a more efficient cascaded learning
method to rank the objectness score of candidate proposals.
Superpixels merging is based on merging superpixels gener-
ated from segmentation results. Selective Search [45] was a pro-
posal generation algorithm based on merging super-pixels. It com-
puted the multiple hierarchical segments generated by segmenta-
tion method [82], which were merged according to their visual fac-
tors (color, areas, etc.), and finally bounding boxes were placed on
the merged segments. Manen et al. [83] proposed a similar idea
to merge superpixels. The difference was that the weight of the
merging function was learned and the merging process was ran-
domized. Selective Search is widely used in many detection frame-
works due to its efficiency and high recall compared to other tra-
ditional methods.
Seed segmentation starts with multiple seed regions, and for Fig. 6. Diagram of RPN [34]. Each position of the feature map connects with a slid-
each seed, foreground and background segments are generated. To ing windows, followed with two sibling branches.
avoid building up hierarchical segmentation, CPMC [84] generated
a set of overlapping segments initialized with diverse seeds. Each
proposal segment was the solution of a binary (foreground or back- tic manner. These design choices may not be optimal, and dif-
ground) segmentation problem. Enreds and Hoiem [85] combined ferent datasets would require different anchor design strategies.
the idea of Selective Search [45] and CPMC [84]. It started with Many efforts have been made to improve the design choice of an-
super-pixels and merged them with new designed features. These chors. Zhang et al. proposed Single Shot Scale-invariant Face De-
merged segments were used as seeds to generate larger segments, tector (S3FD) [87] based on SSD with carefully designed anchors to
which was similar to CPMC. However, producing high quality seg- match the objects. According to the effective receptive field [88] of
mentation masks is very time-consuming and it’s not applicable to different feature maps, different anchor priors were designed. Zhu
large scale datasets. et al. [89] introduced an anchor design method for matching small
The primary advantage of these traditional computer vision objects by enlarging input image size and reducing anchor strides.
methods is that they are very simple and can generate propos- Xie et al. proposed Dimension-Decomposition Region Proposal Net-
als with high recall (e.g. on medium scale datasets such as Pascal work (DeRPN) [90] which decomposed the dimension of anchor
VOC). However, these methods are mainly based on low level vi- boxes based on RPN. DeRPN used an anchor string mechanism to
sual cues such as color or edges. They cannot be jointly optimized independently match objects width and height. This helped match
with the whole detection pipeline. Thus they are unable to exploit objects with large scale variance and reduced the searching space.
the power of large scale datasets to improve representation learn- Ghodrati et al. developed DeepProposals [91] which pre-
ing. On challenging datasets such as MSCOCO [86], traditional com- dicted proposals on the low-resolution deeper layer feature map.
puter vision methods struggled to generate high quality proposals These were then projected back onto the high-resolution shal-
due to these limitations. low layer feature maps, where they are further refined. Redmon
et al. [41] designed anchor priors by learning priors from the train-
3.4.2. Anchor-based methods ing data using k-means clustering. Later, Zhang et al. introduced
One large family of supervised proposal generators is anchor- Single-Shot Refinement Neural Network (RefineDet) [92] which re-
based methods. They generate proposals based on pre-defined an- fined the manually defined anchors in two steps. In the first step,
chors. Ren et al. proposed Region Proposal Network (RPN) [34] to RefineDet learned a set of localization offsets based on the orig-
generate proposals in a supervised way based on deep convolu- inal hand-designed anchors and these anchors were refined by
tional feature maps. The network slid over the entire feature map the learned offsets. In the second stage, a new set of localization
using 3 × 3 convolution filters. For each position, k anchors (or offsets were learned based on the refined anchors from the first
initial estimates of bounding boxes) of varying size and aspect ra- step for further refinement. This cascaded optimization framework
tios were considered. These sizes and ratios allowed for match- significantly improved the anchor quality and final prediction ac-
ing objects at different scales in the entire image. Based on the curacy in a data-driven manner. Cai et al. proposed Cascade R-
ground truth bonding boxes, the object locations were matched CNN [49] which adopted a similar idea as RefineDet by refining
with the most appropriate anchors to obtain the supervision sig- proposals in a cascaded way. Yang et al. [93] modeled anchors
nal for the anchor estimation. A 256−dimensional feature vec- as functions implemented by neural networks which was com-
tor was extracted from each anchor and was fed into two sibling puted from customized anchors. Their method MetaAnchor showed
branches - classification layer and regression layer. Classification comprehensive improvement compared to other manually defined
branch was responsible for modeling objectness score while re- methods but the customized anchors were still designed manually.
gression branch encoded four real-values to refine location of the
bounding box from the original anchor estimation. Based on the 3.4.3. Keypoints-based methods
ground truth, each anchor was predicted to either be an object, Another proposal generation approach is based on keypoint
or just background by the classification branch (See Fig. 6). Later, detection, which can be divided into two families: corner-based
SSD [42] adopted a similar idea of anchors in RPN by using multi- methods and center-based methods. Corner-based methods predict
scale anchors to match objects. The main difference was that SSD bounding boxes by merging pairs of corners learned from the fea-
assigned categorical probabilities to each anchor proposal, while ture map. Denet [94] reformulated the object detection problem in
RPN first evaluated whether the anchor proposal was foreground a probabilistic way. For each point on the feature map, Denet mod-
or background and performed the categorical classification in the eled the distribution of being one of the 4 corner types of objects
next stage. (top-left, top-right, bottom-left, bottom-right), and applied a naive
Despite promising performance, the anchor priors are manu- bayesian classifiers over each corner of the objects to estimate the
ally designed with multiple scales and aspect ratios in a heuris- confidence score of a bounding box. This corner-based algorithm
48 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

eliminated the design of anchors and became a more effective receptive fields and thus are more suitable for detecting small ob-
method to produce high quality proposals. Later based on Denet, jects, while semantic-rich features in deep layers are more robust
Law and Deng proposed CornerNet [63] which directly modeled to illumination, translation and have larger receptive fields (but
categorical information on corners. CornerNet modeled informa- coarse resolutions), and are more suitable for detecting large ob-
tion of top-left and bottom-right corners with novel feature em- jects. When detecting small objects, high resolution representa-
bedding methods and corner pooling layer to correctly match key- tions are required and the representation of these objects may not
points belonging to the same objects, obtaining state-of-the-art re- even be available in the deep layer features, making small object
sults on public benchmarks. For center-based methods, the probabil- detection difficult. Some techniques such as dilated/atrous convolu-
ity of being the center of the objects is predicted on each position tions [52,97] were proposed to avoid downsampling, and used the
of the feature map, and the height and width are directly regressed high resolution information even in the deeper layers. At the same
without any anchor priors. Zhu et al. [95] presented a feature- time, detecting large objects in shallow layers are also non-optimal
selection-anchor-free (FSAF) framework which could be plugged without large enough receptive fields. Thus, handling feature scale
into one-stage detectors with FPN structure. In FSAF, an online issues has become a fundamental research problem within object
feature selection block is applied to train multi-level center-based detection. There are four main paradigms addressing multi-scale
branches attached in each level of the feature pyramid. During feature learning problem: Image Pyramid, Prediction Pyramid, In-
training, FSAF dynamically assigned each object to the most suit- tegrated Features and Feature Pyramid. These are briefly illustrated
able feature level to train the center-based branch. Similar to FSAF, in the Fig. 7.
Zhou et al. proposed a new center-based framework [64] based Image pyramid: An intuitive idea is to resize input images into
on a single Hourglass network [63] without FPN structure. Fur- a number of different scales (Image Pyramid) and to train mul-
thermore, they applied center-based method into higher-level tiple detectors, each of which is responsible for a certain range
problems such as 3D-detection and human pose recognition, and of scales [98–101]. During testing, images are resized to different
all achieved state-of-the-art results. Duan et al. [65] proposed scales followed by multiple detectors and the detection results are
CenterNet, which combined the idea of center-based methods and merged. This can be computationally expensive. Liu et al. [101] first
corner-based methods. CenterNet first predicted bounding boxes learned a light-weight scale-aware network to resize images such
by pairs of corners, and then predicted center probabilities of the that all objects were in a similar scale. This was followed by learn-
initial prediction to reject easy negatives. CenterNet obtained sig- ing a single scale detector. Singh et. al. [98] conducted compre-
nificant improvements compared with baselines. These anchor-free hensive experiments on small object detection. They argued that
methods form a promising research direction in the future. learning a single scale-robust detector to handle all scale objects
was much more difficult than learning scale-dependent detectors
3.4.4. Other methods with image pyramids. In their work, they proposed a novel frame-
There are some other proposal generation algorithms which are work Scale Normalization for Image Pyramids (SNIP) [98] which
not based on keypoints or anchors but also offer competitive per- trained multiple scale-dependent detectors and each of them was
formances. Lu et al. proposed AZnet [96] which automatically fo- responsible for a certain scale objects.
cused on regions of high interest. AZnet adopted a search strat- Integrated features: Another approach is to construct a single
egy that adaptively directed computation resources to sub-regions feature map by combining features in multiple layers and making
which were likely contain objects. For each region, AZnet predicted final predictions based on the new constructed map [50,51,102–
two values: zoom indicator and adjacency scores. Zoom indicator 105]. By fusing spatially rich shallow layer features and semantic-
determined whether to further divide this region which may con- rich deep layer features, the new constructed features contain rich
tain smaller objects and adjacency scores denoted its objectness. information and thus can detect objects at different scales. These
The starting point was the entire image and each divided sub- combinations are commonly achieved by using skip connections
region is recursively processed in this way until the zoom indicator [1]. Feature normalization is required as feature norms of different
is too small. AZnet was better at matching sparse and small objects layers have a high variance. Bell et al. proposed Inside-Outside
compared to RPN’s anchor-object matching approach. Network (ION) [51] which cropped region features from differ-
ent layers via ROI Pooling [38], and combined these multi-scale
region features for the final prediction. Kong et. al. proposed
3.5. Feature representation learning HyperNet [50] which adopted a similar idea as IoN. They carefully
designed high resolution hyper feature maps by integrating inter-
Feature Representation Learning is a critical component in the mediate and shallow layer features to generate proposals and de-
whole detection framework. Target objects lie in complex environ- tect objects. Deconvolutional layers were used to up-sample deep
ments and have large variance in scale and aspect ratios. There is layer feature maps and batch normalization layers were used to
a need to train a robust and discriminative feature embedding of normalize input blobs in their work. The constructed hyper feature
objects to obtain a good detection performance. In this section, we maps could also implicitly encode contextual information from
introduce feature representation learning strategies for object de- different layers. Inspired by fine-grained classification algorithms
tection. Specifically, we identify three categories: multi-scale fea- which integrate high-order representation instead of exploiting
ture learning, contextual reasoning, and deformable feature learn- simple first-order representations of object proposals, Wang et al.
ing. proposed a novel framework Multi-scale Location-aware Kernel
Representation (MLKP) [103] which captured high-order statistics
3.5.1. Multi-scale feature learning of proposal features and generated more discriminative feature
Typical object detection algorithms based on deep convolu- representations efficiently. The combined feature representation
tional networks such as Fast R-CNN [38] and Faster R-CNN [34] use was more descriptive and provides both semantic and spatial
only a single layer’s feature map to detect objects. However, de- information for both classification and localization.
tecting objects across large range of scales and aspect ratios is Prediction pyramid: Liu et al.’s SSD [42] combined coarse
quite challenging on a single feature map. Deep convolutional net- and fine features from multiple layers together. In SSD, predic-
works learn hierarchical features in different layers which cap- tions were made from multiple layers, where each layer was
ture different scale information. Specifically, shallow layer features responsible for a certain scale of objects. Later, many efforts
with spatial-rich information have higher resolution and smaller [106–108] followed this principle to detect multi-scale objects.
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 49

Fig. 7. Four paradigms for multi-scale feature learning. Top Left: Image Pyramid, which learns multiple detectors from different scale images; Top Right: Prediction Pyramid,
which predicts on multiple feature maps; Bottom Left: Integrated Features, which predicts on single feature map generated from multiple features; Bottom Right: Feature
Pyramid which combines the structure of Prediction Pyramid and Integrated Features.

Yang et al. [100] also exploited appropriate feature maps to gen-


erate certain scale of object proposals and these feature maps
were fed into multiple scale-dependent classifiers to predict ob-
jects. In their work, cascaded rejection classifiers were learned
to reject easy background proposals in early stages to accelerate
detection speed. Multi-scale Deep Convolutional Neural Network
(MSCNN) [106] applied deconvolutional layers on multiple feature
maps to improve their resolutions, and later these refined feature
maps were used to make predictions. Liu et al. proposed a Recep-
tive Field Block Net (RFBNet) [108] to enhance the robustness and
receptive fields via a receptive field block (RFB block). RFB block
adopted similar ideas as the inception module [75] which cap-
tured features from multiple scale and receptive fields via multi-
ple branches with different convolution kernels and finally merged Fig. 8. General framework for feature combination. Top-down features are 2 times
them together. up-sampled and fuse with bottom-up features. The fuse methods can be element-
Feature pyramid: To combine the advantage of Integrated Fea- wise sum, multiplication, concatenation and so on. Convolution and normalization
tures and Prediction Pyramid, Lin et al. proposed Feature Pyramid layers can be inserted in to this general framework to enhance semantic informa-
tion and reduce memory cost.
Network (FPN) [39] which integrated different scale features
with lateral connections in a top-down fashion to build a set
of scale invariant feature maps, and multiple scale-dependent veloped high resolution feature maps using a novel transform
classifiers were learned on these feature pyramids. Specifically, block which explicitly explored the inter-scale consistency nature
the deep semantic-rich features were used to strengthen the shal- across multiple detection scales.
low spatially-rich features. These top-down and lateral features
were combined by element-wise summation or concatenation, 3.5.2. Region feature encoding
with small convolutions reducing the dimensions. FPN showed For two-stage detectors, region feature encoding is a critical
significant improvement in object detection, as well as other step to extract features from proposals into fixed length feature
applications, and achieved state-of-the art results in learning vectors. In R-CNN, Girshick et al. [2] cropped region proposals from
multi-scale features. Many variants of FPN were later developed the whole image and resized the cropped regions into fixed sized
[92,109,109–119], with modifications to the feature pyramid block patches (224 × 224) via bilinear interpolation, followed by a deep
(see Fig. 8). Kong et al. [120] and Zhang et. al. [92] built scale in- convolution feature extractor. Their method encoded high resolu-
variant feature maps with lateral connections. Different from FPN tion region features but the computation was expensive.
which generated region proposals followed by categorical classi- Later Girshick et al. [38] and Ren [34] proposed ROI Pool-
fiers, their methods omitted proposal generation and thus were ing layer to encode region features. ROI Pooling divided each re-
more efficient than original FPN. Ren et al. [109] and Jeong et al. gion into n × n cells (e.g. 7 × 7 by default) and only the neu-
[110] developed a novel structure which gradually and selectively ron with the maximum signal would go ahead in the feedfor-
encoded contextual information between different layer features. ward stage. This is similar to max-pooling, but across (potentially)
Inspired by super resolution tasks [121,122], Zhou et al. [111] de- different sized regions. ROI Pooling extracted features from the
50 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

down-sampled feature map and as a result struggled to handle ple, detecting a baseball ball from an image can be challenging for
small objects. Dai [59] proposed ROI Warping layer which encoded a traditional detector (as it may be confused with balls from other
region features via bilinear interpolation. Due to the downsam- sports); but if the contextual information from the rest of the im-
pling operation in DCNN, there can be a misalignment of the ob- age is used (e.g. baseball field, players, bat), it becomes easier to
ject position in the original image and the downsampled feature identify the baseball ball object.
maps, which RoI Pooling and RoI Warping layers are not able to Some representative efforts include ION [51], DeepId [127] and
handle. Instead of quantizing grids border as ROI Warping and improved version of Faster R-CNN [1]. In ION, Bell et al. used re-
ROI Pooling do, He et al. [3] proposed ROI Align layer which ad- current neural network to encode contextual information across
dressed the quantization issue by bilinear interpolation at fraction- the whole image from four directions. Ouyang et al. [127] learned
ally sampled positions within each grid. Based on ROI Align, Jiang a categorical score for each image which is used as contex-
et al. [123] presented Precise ROI Pooing (PrROI Pooling), which tual features concatenated with the object detection results. He
avoided any quantization of coordinates and had a continuous gra- et al. [1] extracted feature embedding of the entire image and con-
dient on bounding box coordinates. catenate it with region features to improve detection results. In ad-
In order to enhance spatial information of the downsampled re- dition, some methods [3,59,129,133–136] exploit global contextual
gion features, Dai et al. [52] proposed Position Sensitive ROI Pooing information via semantic segmentation. Due to precise pixel-level
(PSROI Pooling) which kept relative spatial information of down- annotation, segmentation feature maps capture strong spatial in-
sampled features. Each channel of generated region feature map formation. He et al. [3] and Dai et al. [59] learn unified instance
only corresponded to a subset channels of input region accord- segmentation framework and optimize the detector with pixel-
ing to its relative spatial position. Based on PSROI Pooling, Zhai level supervision. They jointly optimized detection and segmen-
et al. [124] presented feature selective networks to learn robust tation objectives as a multi-task optimization. Though segmenta-
region features by exploiting disparities among sub-region and as- tion can significantly improve detection performance, obtaining the
pect ratios. The proposed network encoded sub-region and aspect pixel-level annotation is very expensive. Zhao et al. [133] opti-
ratio information which were selectively pooled to refine initial re- mized detectors with pseudo segmentation annotation and showed
gion features by a light-weight head. promising results. Zhang et al.’s work Detection with Enriched Se-
Later, more algorithms were proposed to well encode re- mantics (DES) [134], introduced contextual information by learn-
gion features from different viewpoints. Zhu et al. proposed Cou- ing a segmentation mask without segemtation annotations. It also
pleNet [125] which extracted region features by combining outputs jointly optimized object detection and segmentation objectives and
generated from both ROI Pooling layer and PSROI Pooling layer. enriched original feature map with a more discriminative feature
ROI Pooling layer extracted global region information but struggled map.
for objects with high occlusion while PSROI Pooling layer focused Region Context Reasoning encodes contextual information sur-
more on local information. CoupleNet enhanced features generated rounding regions and learns interactions between the objects with
from ROI Pooling and PSROI Pooling by element-wise summation their surrounding area. Directly modeling different locations and
and generated more powerful features. Later Dai et al. proposed categories objects relations with the contextual is very challenging.
Deformable ROI Pooling [97] which generalized aligned RoI pooling Chen et al. proposed Spatial Memory Network (SMN) [130] which
by learning an offset for each grid and adding it to the grid center. introduced a spatial memory based module. The spatial memory
The sub-grid start with a regular ROI Pooling layer to extract ini- module captured instance-level contexts by assembling object in-
tial region features and the extracted features were used to regress stances back into a pseudo “image” representations which were
offset by an auxiliary network. Deformable ROI Pooling can auto- later used for object relations reasoning. Liu et al. proposed Struc-
matically model the image content without being constrained by ture Inference Net (SIN) [137] which formulated object detection as
fixed receptive fields. a graph inference problem by considering scene contextual infor-
mation and object relationships. In SIN, each object was treated as
3.5.3. Contextual reasoning a graph node and the relationship between different objects were
Contextual information plays an important role in object de- regarded as graph edges. Hu et al. [138] proposed a lightweight
tection. Objects often tend to appear in specific environments and framework relation network which formulated the interaction be-
sometimes also coexist with other objects. For each example, birds tween different objects between their appearance and image loca-
commonly fly in the sky. Effectively using contextual information tions. The new proposed framework did not need additional anno-
can help improve detection performance, especially for detecting tation and showed improvements in object detection performance.
objects with insufficient cues (small object, occlusion etc.) Learn- Based on Hu et al., Gu et al. [139] proposed a fully learnable ob-
ing the relationship between objects with their surrounding con- ject detector which proposed a general viewpoint that unified ex-
text can improve detector’s ability to understand the scenario. For isting region feature extraction methods. Their proposed method
traditional object detection algorithms, there have been several ef- removed heuristic choices in ROI pooling methods and automati-
forts exploring context [126], but for object detection based on cally select the most significant parts, including contexts beyond
deep learning, context has not been extensively explored. This is proposals. Another method to encode contextual information is to
because convolutional networks implicitly already capture contex- implicitly encode region features by adding image features sur-
tual information from hierarchical feature representations. How- rounding region proposals and a large number of approaches have
ever, some recent efforts [1,3,3,59,106,127–131] still try to ex- been proposed based on this idea [106,131,140–143]. In addition
ploit contextual information. Some works [132] have even shown to encode features from region proposals, Gidaris et al. [131] ex-
that in some cases context information may even harm the de- tracted features from a number of different sub-regions of the
tection performance. In this section we review contextual reason- original object proposals (border regions, central regions, contex-
ing for object detection from two aspects: global context and region tual regions etc.) and concatenated these features with the origi-
context. nal region features. Similar to their method, [106] extracted local
Global context reasoning refers to learning from the context in contexts by enlarging the proposal window size and concatenat-
the whole image. Unlike traditional detectors which attempt to ing these features with the original ones. Zeng et al. [142] pro-
classify specific regions in the image as objects, the idea here is posed Gated Bi-Directional CNN (GBDNet) which extracted fea-
to use the contextual information (i.e., information from the rest tures from multi-scale subregions. Notably, GBDNet learned a
of the image) to classify a particular region of interest. For exam- gated function to control the transmission of different region in-
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 51

formation because not all contextual information is helpful for Fast R-CNN [38], negative samples were randomly sampled from
detection. these 2k proposals and the ratio of positive and negative was fixed
as 1:3 in each mini-batch, to further reduce the adverse effects of
3.5.4. Deformable feature learning class imbalance. Random sample can address class imbalance issue
A good detector should be robust to nonrigid deformation but are not able to fully utilize information from negative propos-
of objects. Before the deep learning era, Deformable Part based als. Some negative proposals may contain rich context information
Models (DPMs) [28] had been successfully used for object de- about the images, and some hard proposals can help to improve
tection. DPMs represented objects by multiple component parts detection accuracy. To address this, Liu et al. [42] proposed hard
using a deformable coding method, making the detector robust negative sampling strategy which fixed the foreground and back-
to nonrigid object transformation. In order to enable detectors ground ratio but sampled most difficult negative proposals for up-
based on deep learning to model deformations of object parts, dating the model. Specifically, negative proposals with higher clas-
many researchers have developed detection frameworks to explic- sification loss were selected for training.
itly model object parts [97,127,144,145]. DeepIDNet [127] developed To address difficulty imbalance, most sampling strategies are
a deformable-aware pooling layer to encode the deformation infor- based on carefully designed loss functions. For obejct detection, a
mation across different object categories. Dai et al. [97] and Zhu multi-class classifier is learned over C+1 categories (C target cate-
et al. [144] designed deformable convolutional layers which auto- gories plus one background category). Assume the region is labeled
matically learned the auxiliary position offsets to augment infor- with ground truth class u, and p is the output discrete probability
mation sampled in regular sampling locations of the feature map. distribution over C+1 classes ( p = { p0 , . . . , pC }). The loss function
is given by:
4. Learning strategy Lcls ( p, u ) = − log pu (9)

In contrast to image classification, object detection requires op- Lin et al. proposed a novel focal loss [43] which suppressed signals
timizing both localization and classification tasks, which makes it from easy samples. Instead of discarding all easy samples, they as-
more difficult to train robust detectors. In addition, there are sev- signed an importance weight to each sample w.r.t its loss value
eral issues that need to be addressed, such as imbalance sampling, as:
localization, acceleration etc. Thus there is a need to develop inno- LFL = −α (1 − pu )γ log( pu ) (10)
vative learning strategies to train effective and efficient detectors.
where α and γ were parameters to control the importance
In this section, we review some of the learning strategies for object
weight. The gradient signals of easy samples got suppressed which
detection.
led the training process to focus more on hard proposals. Li
et al. [147] adopt a similar idea from focal loss and propose a novel
4.1. Training stage
gradient harmonizing mechanism (GHM). The new proposed GHM
not only suppressed easy proposals but also avoided negative im-
In this section, we review the learning strategies for training
pact of outliers. Shrivastava et al. [148] proposed an online hard
object detectors. Specifically we discuss, data augmentation, imbal-
example mining strategy which was based on a similar principle
ance sampling, cascade learning, localization refinement and some
as Liu et al.’s SSD [42] to automatically select hard examples for
other learning strategies.
training. Different from Liu et al., online hard negative mining only
considered difficulty information but ignored categorical informa-
4.1.1. Data augmentation.
tion, which meant the ratio of foreground and background was not
Data augmentation is important for nearly all deep learning
fixed in each mini-batch. They argued that difficult samples played
methods as they are often data-hungry and more training data
a more important role than class imbalance in object detection
leads to better results. In object detection, in order to increase
task.
training data as well as generate training patches with multiple vi-
sual properties, Horizontal flips of training images is used in train- 4.1.3. Localization refinement
ing Faster R-CNN detector [38]. A more intensive data augmenta- An object detector must provide a tight localization prediction
tion strategy is used in one-stage detectors including rotation, ran- (bbox or mask) for each object. To do this, many efforts refine the
dom crops, expanding and color jittering [42,106,146]. This data preliminary proposal prediction to improve the localization. Pre-
augmentation strategy has shown significant improvement in de- cise localization is challenging because predictions are commonly
tection accuracy. focused on the most discriminative part of the objects, and not
necessarily the region containing the object. In some scenarios, the
4.1.2. Imbalance sampling detection algorithms are required to make high quality predictions
In object detection, imbalance of negative and positive samples (high IoU threshold) See Fig. 9 for an illustration of how a detec-
is a critical issue. That is, most of the regions of interest estimated tor may fail in a high IoU threshold regime. A general approach for
as proposals are in fact just background images. Very few of them localization refinement is to generate high quality proposals (See
are positive instances (or objects). This results in problem of imbal- Section 3.4). In this section, we will review some other methods
ance while training detectors. Specifically, two issues arise, which for localization refinement. In R-CNN framework, the L-2 auxiliary
need to be addressed: class imbalance and difficulty imbalance. bounding box regressors were learned to refine localizations, and
The class imbalance issue is that most candidate proposals belong in Fast R-CNN, the smooth L1 regressors were learned via an end-
to the background and only a few of proposals contain objects. This to-end training scheme as:
results in the background proposals dominating the gradients dur- 
ing training. The difficulty imbalance is closely related to the first Lreg (t c , v ) = SmoothL1(tic − vi ) (11)
issue, where due to the class imbalance, it becomes much easier i∈{x,y,w,h}

to classify most of the background proposals easily, while the ob- 


0.5x2 if |x| < 1
jects become harder to classify. A variety of strategies have been SmoothL1(x ) = (12)
|x| − 0.5 otherwise
developed to tackle the class imbalance issue. Two-stage detectors
such as R-CNN and Fast R-CNN will first reject majority of nega- where the predicted offset is given by t c = (txc , tyc , tw
c , t c ) for each
h
tive samples and keep 20 0 0 proposals for further classification. In target class, and v denotes ground truth of object bounding
52 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

cade R-CNN [49] utilized cascade learning methods in refining ob-


ject locations. They built multi-stage bounding box regressors and
bounding box predictions were refined in each stage trained with
different quality metrics. Cheng et al. [132] observed the failure
cases of Faster RCNN, and noticed that even though the localiza-
tion of objects was good, there were several classification errors.
They attributed this to sub-optimal feature representation due to
sharing of features and joint multi-task optimization, for classifi-
cation and regression; and they also argued that the large recep-
tive field of Faster RCNN induce too much noise in the detection
process. They found that vanilla RCNN was robust to these issues.
Thus, they built a cascade detection system based on Faster RCNN
Fig. 9. Example of failure case of detection in high IoU threshold. Purple box is
ground truth and yellow box is prediction. In low IoU requirement scenario, this
and RCNN to complement each other. Specifically, A set of initial
prediction is correct while in high IoU threshold, it’s a false positive due to insuf- predictions were obtained from a well trained Faster RCNN, and
ficient overlap with objects. (For interpretation of the references to color in this these predictions were used to train RCNN to refine the results.
figure legend, the reader is referred to the web version of this article.)
4.1.5. Others
There are some other learning strategies which offer interest-
boxes(v = (vx , vy , vw , vh )). x, y, w, h denote bounding box center,
ing directions, but have not yet been extensively explored. We split
width and height respectively.
these approaches into four categories: adversarial learning, training
Beyond the default localization refinement, some methods
from scratch and knowledge distillation.
learn auxiliary models to further refine localizations. Gidaris
Adversarial learning. Adversarial learning has shown signif-
et al. [131] introduced an iterative bounding box regression
icant advances in generative models. The most famous work
method, where an R-CNN was applied to refine learned pre-
applying adversarial learning is generative adversarial network
dictions. Here the predictions were refined multiple times. Gi-
(GAN) [154] where a generator is competing with a discriminator.
daris et al. [149] proposed LocNet which modeled the distribution
The generator tries to model data distribution by generating fake
of each bounding box and refined the learned predictions. Both
images using a noise vector input and use these fake images to
these approaches required a separate component in the detection
confuse the discriminator, while the discriminator competes with
pipeline, and prevent joint optimization.
the generator to identify the real images from fake images. GAN
Some other efforts [150,151] focus on designing a unified
and its variants [155–157] have shown effectiveness in many do-
framework with modified objective functions. In MultiPath Net-
mains and have also found applications in object detection. Li
work, Zagoruyko et al. [150] developed an ensemble of classifiers
et al. [158] proposed a new framework Perceptual GAN for small
which were optimized with an integral loss targeting various qual-
object detection. The learnable generator learned high-resolution
ity metrics. Each classifier was optimized for a specific IoU thresh-
feature representations of small objects via an adversarial scheme.
old and the final prediction results were merged from these clas-
Specifically, its generator learned to transfer low-resolution small
sifiers. Tychsen et al. proposed Fitness-NMS [152] which learned
region features into high-resolution features and competed with
novel fitness score function of IoU between proposals and objects.
the discriminator which identified real high-resolution features. Fi-
They argued that existing detectors aimed to find qualified predic-
nally the generator learned to generate high quality features for
tions instead of best predictions and thus highly quality and low
small objects. Wang et al. [159] proposed A-Fast-R-CNN which was
quality proposals received equal importance. Fitness-IoU assigned
trained by generated adversarial examples. They argued the diffi-
higher importance to highly overlapped proposals. They also de-
cult samples were on long tail so they introduced two novel blocks
rived a bounding box regression loss based on a set of IoU up-
which automatically generated features with occlusion and defor-
per bounds to maximum the IoU of predictions with objects. In-
mation. Specifically, a learned mask was generated on region fea-
spired by CornerNet [63] and DeNet [94], Lu et al. [151] proposed
tures followed by region classifiers. In this case, the detectors could
a Grid R-CNN which replaced linear bounding box regressor with
receive more adversarial examples and thus become more robust.
the principle of locating corner keypoints corner-based mechanism.
Training from scratch. Modern object detectors heavily rely on
pre-trained classification models on ImageNet, however, the bias of
4.1.4. Cascade learning loss functions and data distribution between classification and de-
Cascade learning is a coarse-to-fine learning strategy which tection can have an adversarial impact on the performance. Fine-
collects information from the output of the given classifiers to tuning on detection task can relieve this issue, but cannot fully get
build stronger classifiers in a cascaded manner. Cascade learning rid of the bias. Besides, transferring a classification model for de-
strategy was first used by Viola and Jones [17] to train the ro- tection in a new domain can lead to more challenges (from RGB
bust face detectors. In their models, a lightweight detector first to MRI data etc.). Due to these reasons, there is a need to train
rejects the majority easy negatives and feeds hard proposals to detectors from scratch, instead of relying on pretrained models.
train detectors in next stage. For deep learning based detection The main difficulty of training detectors from scratch is the train-
algorithms, Yang et al. [153] proposed CRAFT (Cascade Region- ing data of object detection is often insufficient and may lead to
proposal-network And FasT-rcnn) which learned RPN and region overfitting. Different from image classification, object detection re-
classifiers with a cascaded learning strategy. CRAFTS first learned quires bounding box level annotations and thus, annotating a large
a standard RPN followed by a two-class Fast RCNN which rejected scale detection dataset requires much more effort and time (Ima-
the majority easy negatives. The remaining samples were used to geNet has 10 0 0 categories for image classification while only 200
build the cascade region classifiers which consisted of two Fast RC- of them have detection annotations).
NNs. Yang et al. [100] introduced layer-wise cascade classifiers for There are some works [107,160,161] exploring training object
different scale objects in different layers. Multiple classifiers were detectors from scratch. Shen et al. [107] first proposed a novel
placed on different feature maps and classifiers on shallower lay- framework DSOD (Deeply Supervised Object Detectors) to train
ers would reject easy negatives. The remaining samples would be detectors from scratch. They argued deep supervision with a
fed into deeper layers for classification. RefineDet [92] and Cas- densely connected network structure could significantly reduce op-
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 53

box regressors will pull these proposals close to the same object
and thus lead to the same problem. The duplicate predictions are
regarded as false positives and will receive penalties in evaluation,
so NMS is needed to remove these duplicate predictions. Specifi-
cally, for each category, the prediction boxes are sorted according
to the confidence score and the box with highest score is selected.
This box is denoted as M. Then IoU of other boxes with M is cal-
culated, and if the IoU value is larger than a predefined threshold
test , these boxes will are removed. This process is repeated for all
remaining predictions. More formally, the confidence score of box
B which overlaps with M larger than test will be set to zero:
Fig. 10. Duplicate predictions are eliminated by NMS operation. The most-confident 
box is kept, and all other boxes surrounding it will be removed. ScoreB IoU(B, M ) < test
ScoreB = (13)
0 IoU(B, M ) ≥ test

timization difficulties. Based on DSOD, Shen et al. [162] proposed However, if an object just lies within test of M, NMS will result
a gated recurrent feature pyramid which dynamically adjusted in a missing prediction, and this scenario is very common in clus-
supervision intensities of intermediate layers for objects with dif- tered object detection. Navaneeth et al. [165] introduced a new al-
ferent scales. They defined a recurrent feature pyramid structure to gorithm Soft-NMS to address this issue. Instead of directly elimi-
squeeze both spatial and semantic information into a single pre- nating the prediction B, Soft-NMS decayed the confidence score of
diction layer, which further reduced parameter numbers leading B as a continuous function F(F can be linear function or Gaussian
to faster convergence. In addition, the gate-control structure on function) of its overlaps with M. This is given by:
feature pyramids adaptively adjusted the supervision at different

ScoreB IoU(B, M ) < test
scales based on the size of objects. Their method was more pow- ScoreB = (14)
F (IoU(B, M )) IoU(B, M ) ≥ test
erful than original DSOD. However, later He et al. [160] validated
the difficulty of training detectors from scratch on MSCOCO and Soft-NMS avoided eliminating prediction of clustered objects and
found that the vanilla detectors could obtain a competitive perfor- showed improvement in many common benchmarks. Hosong et al
mance with at least 10K annotated images. Their findings proved [166]. introduced a network architecture designed to perform NMS
no specific structure was required for training from scratch which based on confidence scores and bounding boxes, which was opti-
contradicted the previous work. mized separately from detector training in a supervised way. They
Knowledge distillation. Knowledge distillation is a training strat- argued the reason for duplicate predictions was that the detector
egy which distills the knowledge in an ensemble of models into deliberately encouraged multiple high score detections per object
a single model via teacher-student training scheme. This learning instead of rewarding one high score. Based on this, they designed
strategy was first used in image classification [163]. In object the network following two motivations: (i) a loss penalizing double
detection, some works [132,164] also investigate this training detections to push detectors to predict exactly one precise detec-
scheme to improve detection performance. Li et al. [164] proposed tion per object; (ii) joint processing of detections nearby to give
a light weight detector whose optimization was carefully guided the detector information whether an object is detected more than
by a heavy but powerful detector. This light detector could achieve once. The new proposed model did not discard detections but in-
comparable detection accuracy by distilling knowledge from stead reformulated NMS as a re-scoring task that sought to de-
the heavy one, meanwhile having faster inference speed. Cheng crease the score of detections that cover objects that already have
et al. [132] proposed a Faster R-CNN based detector which was been detected.
optimized via teacher-student training scheme. An R-CNN model
is used as teacher network to guide the training process. Their 4.2.2. Model acceleration
framework showed improvement in detection accuracy compared Application of object detection for real world application re-
with traditional single model optimization strategy. quires the algorithms to function in an efficient manner. Thus,
evaluating detectors on efficiency metrics is important. Although
4.2. Testing stage current state-of-the-art algorithms [1,167] can achieve very strong
results on public datasets, their inference speeds make it difficult
Object detection algorithms make a dense set of predictions to apply them into real applications. In this section we review sev-
and thus these predictions cannot be directly used for evaluation eral works on accelerating detectors. Two-stage detectors are usu-
due to heavy duplication. In addition, some other learning strate- ally slower than one-stage detectors because they have two stages
gies are required to further improve the detection accuracy. These - one proposal generation and one region classification, which
strategies improve the quality of prediction or accelerate the infer- makes them computationally more time consuming than one-stage
ence speed. In this section, we introduce these strategies in testing detectors which directly use one network for both proposal gener-
stage including duplicate removal, model acceleration and other ef- ation and region classification. R-FCN [52] built spatially-sensitive
fective techniques. feature maps and extracted features with position sensitive ROI
Pooling to share computation costs. However, the number of chan-
4.2.1. Duplicate removal nels of spatially-sensitive feature maps significantly increased with
Non maximum suppression (NMS) is an integral part of ob- the number of categories. Li et al. [168] proposed a new frame-
ject detection to remove duplicate false positive predictions (See work Light Head R-CNN which significantly reduced the number
Fig. 10). Object detection algorithms make a dense set of predic- of channels in the final feature map (from 1024 to 16) instead of
tions with several duplicate predictions. For one-stage detection al- sharing all computation. Thus, though computation was not shared
gorithms which generate a dense set of candidate proposals such across regions, but the cost could be neglected.
as SSD [42] or DSSD (Deconvolutional Single Shot Detector) [112], From the aspect of backbone architecture, a major computa-
the proposals surrounding the same object may have similar confi- tion cost in object detection is feature extraction [34]. A simple
dence scores, leading to false positives. For two-stage detection al- idea to accelerate detection speed is to replace the detection back-
gorithms which generates a sparse set of proposals, the bounding bone with a more efficient backbone, e.g., MobileNet [74,169] was
54 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

an efficient CNN model with depth-wise convolution layers which Current face detection algorithms based on deep learning are
was also adopted into many works such as [170] and [171]. mainly extended from generic detection frameworks such as Fast
PVANet [104] was proposed as a new network structure with R-CNN and SSD. These algorithms focus more on learning robust
CReLu [172] layer to reduce non-linear computation and acceler- feature representations. In order to handle extreme scale variance,
ated inference speed. Another approach is to optimize models off- multi-scale feature learning methods discussed before have been
line, such as model compression and quantization [173–179] on widely used in face detection. Sun et al. [183] proposed a Fast
the learned models. Finally, NVIDIA Corporation2 released an ac- R-CNN based framework which integrated multi-scale features for
celeration toolkit TensorRT3 which optimized the computation of prediction and converted the resulting detection bounding boxes
learned models for deployment and thus significantly sped up the into ellipses as the regions of human faces are more elliptical
inference. than rectangular. Zhang et al. [87] proposed one-stage S3FD which
found faces on different feature maps to detect faces at a large
4.2.3. Others range of scales. They made predictions on larger feature maps
Other learning strategies in testing stage mainly comprise the to capture small-scale face information. Notably, a set of anchors
transformation of input image to improve the detection accuracy. were carefully designed according to empirical receptive fields
Image pyramids [1,92] are a widely used technique to improve de- and thus provided a better match to the faces. Based on S3FD,
tection results, which build a hierarchical image set at different Zhang et al. [188] proposed a novel network structure to capture
scales and make predictions on all of these images. The final detec- multi-scale features in different stages. The new proposed feature
tion results are merged from the predictions of each image. Zhang agglomerate structure integrated features at different scales in a
et al. [87,92] used a more extensive image pyramid structure to hierarchical way. Moreover, a hierarchical loss was proposed to
handle different scale objects. They resized the testing image to reduce the training difficulties. Single Stage Headless Face Detector
different scales and each scale was responsible for a certain scale (SSH) [189] was another one-stage face detector which combined
range of objects. Horizontal Flipping [3,92] was also used in the different scale features for prediction. Hu et al. [99] gave a detailed
testing stage and also showed improvement. These learning strate- analysis of small face detection and proposed a light weight face
gies largely improved the capability of detector to handle different detector consisting of multiple RPNs, each of which was respon-
scale objects and thus were widely used in public detection com- sible for a certain range of scales. Their method could effectively
petitions. However, they also increase computation cost and thus handle face scale variance but it was slow for real world usage.
were not suitable for real world applications. Unlike this method, Hao et al. [190] proposed a Scale Aware Face
network which addresses scale issues without incurring significant
5. Applications computation costs. They learned a scale aware network which
modeled the scale distribution of faces in a given image and
Object detection is a fundamental computer vision task and guided zoom-in or zoom-out operations to make sure that the
there are many real world applications based on this task. Dif- faces were in desirable scale. The resized image was fed into a
ferent from generic object detection, these real world applications single scale light weight face detector. Wang et al. [191] followed
commonly have their own specific properties and thus carefully- RetinaNet [43] and utilized more dense anchors to handle faces
designed detection algorithms are required. In this section, we will in a large range of scales. Moreover, they proposed an attention
introduce several real world applications such as face detection function to account for context information, and to highlight the
and pedestrian detection. discriminative features. Zhang et al. [192] proposed a deep cas-
caded multi-task face detector with cascaded structure (MTCNN).
5.1. Face detection MTCNN had three stages of carefully designed CNN models to
predict faces in a coarse-to-fine style. Further, they also proposed
Face detection is a classical computer vision problem to detect a new online hard negative mining strategy to improve the result.
human faces in the images, which is often the first step towards Samangouei et al. [193] proposed a Face MegNet which allowed
many real-world applications with human beings, such as face ver- information flow of small faces without any skip connections by
ification, face alignment and face recognition. There are some crit- placing a set of deconvolution layers before RPN and ROI Pooling
ical differences between face detection and generic detection: (i) to build up finer face representations.
the range of scale for objects in face detection is much larger than In addition to multi-scale feature learning, some frameworks
objects in generic detection. Moreover occlusion and blurred cases were focused on contextual information. Face objects have strong
are more common in face detection; (ii) Face objects contain strong physical relationships with the surrounding contexts (commonly
structural information, and there is only one target category in face appearing with human bodies) and thus encoding contextual
detection. Considering these properties of face detection, directly information became an effective way to improve detection accu-
applying generic detection algorithms is not an optimal solution as racy. Zhang et al. [194] proposed FDNet based on ResNet with
there could be some priors that can exploited to improve face de- larger deformable convolutional kernels to capture image context.
tection. Zhu et al. [195] proposed a Contextual Multi-Scale Region-based
In early stages of research before the deep learning era, face Convolution Neural Network (CMS-RCNN) in which multi-scale in-
detection [20,180–182] was mainly based on sliding windows, and formation was grouped both in region proposal and ROI detection
dense image grids were encoded by hand-crafted features followed to deal with faces at various range of scale. In addition, contextual
by training classifiers to find and locate objects. Notably, Viola and information around faces is also considered in training detectors.
Jones [20] proposed a pioneering cascaded classifiers using Ad- Notably, Tang et al. [185] proposed a state-of-the-art context
aBoost with Haar features for face detection and obtained excel- assisted single shot face detector, named PyramidBox to handle
lent performance with high real time prediction speed. After the the hard face detection problem. Observing the importance of the
progresses of deep learning in image classification, face detectors context, they improved the utilization of contextual information
based on deep learning significantly outperformed traditional face in the following three aspects: (i) first, a novel context anchor
detectors [183–187]. was designed to supervise high-level contextual feature learning
by a semi-supervised method, dubbed as PyramidAnchors; (ii) the
2
https://www.nvidia.com/en-us/. Low-level Feature Pyramid Network was developed to combine
3
https://developer.nvidia.com/tensorrt. adequate high-level context semantic features and low-level facial
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 55

features together, which also allowed the PyramidBox to predict hard proposals were then classified by a large deep networks.
faces at all scales in a single shot; and (iii) they introduced a Zhang et al. [212] proposed a decision tree based framework. In
context sensitive structure to increase the capacity of prediction their method, multiscale feature maps were used to extract pedes-
network to improve the final accuracy of output. In addition, they trian features, which were later fed into boosted decision trees for
used the method of data-anchor-sampling to augment the training classification. In contrast to the FC layers, boosted decision trees
samples across different scales, which increased the diversity applied a bootstrapping strategy for mining hard negative samples
of training data for smaller faces. Yu et al. [196] introduced a and achieved a better performance. Also to reduce the impact of
context pyramid maxout mechanism to explore image contexts large variance in scales, Li et al. [8] proposed Scale-aware Fast
and devised an efficient anchor based cascade framework for face R-CNN (SAF RCNN) which inserted multiple built-in networks
detection which optimized anchor-based detector in cascaded into the whole detection framework. The proposed SAF RCNN
manner. Zhang et al. [197] proposed a two-stream contextual CNN detected different scale pedestrian instances using different sub-
to adaptively capture body part information. In addition, they nets. Further, Yang et al. [100] inserted Scale Dependent Pooling
proposed to filter easy non-face regions in the shallow layers and (SDP) and Cascaded Rejection Classifiers (CRC) into Fast RCNN
leave difficult samples to deeper layers. to handle pedestrians at different scales. According to the height
Beyond efforts on designing scale-robust or context-assistant of the instances, SDP extracted region features from a suitable
detectors, Wang et al. [191] developed a framework from the scale feature map, while CRC rejected easy negative samples in
perspective of loss function design. Based on vanilla Faster R- shallower layers. Wang et al. [213] proposed a novel Repulsion
CNN framework, they replaced original softmax loss with a cen- Loss to detect pedestrians in a crowd. They argued that detecting a
ter loss which encouraged detectors to reduce the large intra-class pedestrian in a crowd made it very sensitive to the NMS threshold,
variance in face detection. They explored multiple technologies which led to more false positives and missing objects. The new
in improving Faster R-CNN such as fixed-ratio online hard neg- proposed repulsion loss pushed the proposals into their target
ative mining, multi-scale training and multi-scale testing, which objects but also pulled them away from other objects and their
made vanilla Faster R-CNN adaptable to face detection. Later, Wang target proposals. Based on their idea, Zhang et al. [214] proposed
et al. [198] proposed Face R-FCN which was based on vanilla R- an Occlusion-aware R-CNN (OR-CNN) which was optimized by
FCN. Face R-FCN distinguished the contribution of different fa- an Aggression Loss. The new loss function encouraged the pro-
cial parts and introduced a novel position-sensitive average pool- posals to be close to the objects and other proposals with the
ing to re-weight the response on final score maps. This method same targeted proposals. Mao et al. [215] claimed that properly
achieved state-of-the-art results on many public benchmarks such aggregating extra features into pedestrian detector could boost the
as FDDB [199] and WIDER FACE [200]. detection accuracy. In their paper, they explored different kinds
of extra features useful in improving accuracy and proposed a
5.2. Pedestrian detection new method to use these features. The new proposed component
- HyperLearner aggregated extra features into a vanilla DCNN
Pedestrian detection is an essential and significant task in any detector in a jointly optimized fashion and no extra input was
intelligent video surveillance system. Different from generic object required for the inference stage.
detection, there are some properties of pedestrian detection differ- For pedestrian detection, one of the most significant challenges
ent from generic object detection: (i) Pedestrian objects are well is to handle occlusion [214,216–226]. A straightforward method is
structured objects with nearly fixed aspect ratios (about 1.5), but to use part-based models which learn a series of part detectors
they also lie at a large range of scales; (ii) Pedestrian detection is and integrate the results of part detectors to locate and classify ob-
a real world application, and hence the challenges such as crowd- jects. Tian et al. [216] proposed DeepParts which consisted of mul-
ing, occlusion and blurring are commonly exhibited. For example, tiple part-based detectors. During training, the important pedes-
in the CityPersons dataset, there are a total of 3157 pedestrian trian parts were automatically selected from a part pool which was
annotations in the validation subset, among which 48.8% overlap composed of parts of the human body (at different scales), and for
with another annotated pedestrian with Intersection over Union each selected part, a detector was learned to handle occlusions. To
(IoU) above 0.1. Moreover, 26.4% of all pedestrians have consid- integrate the inaccurate scores of part-based models, Ouyang and
erable overlap with another annotated pedestrian with IoU above Wang [223] proposed a framework which modeled visible parts as
0.3. The highly frequent crowd occlusion harms the performance hidden variables in training the models. In their work, the visible
of pedestrian detectors; (iii) There are more hard negative samples relationship of overlapping parts were learned by discriminative
(such as traffic light, Mailbox etc.) in pedestrian detection due to deep models, instead of being manually defined or even being as-
complicated contexts. sumed independent. Later, Ouyang et al. [225] addressed this issue
Before the deep learning era, pedestrian detection algorithms from another aspect. They proposed a mixture network to capture
[19,201–204] were mainly extended from Viola Jones frame- unique visual information which was formed by crowded pedes-
works [20] by exploiting Integral Channel Features with a sliding trians. To enhance the final predictions of single-pedestrian detec-
window strategy to locate objects, followed by region classifiers tors, a probabilistic framework was learned to model the relation-
such as SVMs. The early works were mainly focused on designing ship between the configurations estimated by single-pedestrian
robust feature descriptors for classification. For example, Dalal and and multi-pedestrian detectors. Zhang et al. [214] proposed an
Triggs [19] proposed the histograms of oriented gradient (HOG) occlusion-aware ROI Pooling layer which integrated the prior struc-
descriptors, while Paisitkriangkrai et al. [204] designed a feature ture information of pedestrian with visibility prediction into the
descriptor based on low-level visual cues and spatial pooling fea- final feature representations. The original region was divided into
tures. These methods show promising results on pedestrian detec- five parts and for each part, a sub-network enhanced the original
tion benchmarks but were mainly based on hand-crafted features. region feature via a learned visibility score for better representa-
Deep learning based methods for pedestrian detection tions. Zhou et al. [222] proposed Bi-box which simultaneously es-
[8–10,205–211] showed excellent performance and achieved state- timated pedestrian detection as well as visible parts by regressing
of-the-art results on public benchmarks. Angelova et al [10] pro- two bounding boxes, one for the full body and the other for visible
posed a real-time pedestrian detection framework using a cascade part. In addition, a new positive-instance sampling criterion was
of deep convolutional networks. In their work, a large number of proposed to bias positive training examples with large visible area,
easy negatives were rejected by a tiny model and the remaining which showed effectiveness in training occlusion-aware detectors.
56 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

Fig. 11. Some examples of Pascal VOC, MSCOCO, Open Images and LVIS.

5.3. Others Pascal VOC2012 [29] is a mid scale dataset for object detection
which shares the same 20 categories with Pascal VOC2007. There
There are some other real applications with object detection are three image splits in VOC2012: training, validation and test
techniques, such as logo detection and video object detection. with 5717, 5823 and 10991 images respectively. The annotation in-
Logo detection is an important research topic in e-commerce formation of VOC2012 test set is not available.
systems. Compared to generic detection, logo instance is much MSCOCO [86] is a large scale dataset for with 80 categories.
smaller with strong non-rigid transformation. Further, there are There are three image splits in MSCOCO: training, validation and
few logo detection baselines available. To address this issue, Su test with 118287, 50 0 0 and 40,670 images respectively. The anno-
et al. [15] adopted the learning principle of webly data learning tation information of MSCOCO test set is not available.
which automatically mined information from noisy web images Open Images [234] contains 1.9M images with 15M objects of
and learns models with limited annotated data. Su et al. [14] de- 600 categories. The 500 most frequent categories are used to eval-
scribed an image synthesising method to successfully learn a de- uate detection benchmarks, and more than 70% of these categories
tector with limited logo instances. Hoi et al. [13] collected a large have over 10 0 0 training samples.
scale logo dataset from an e-commerce website and conducted a LVIS [235] is a new collected benchmark with 164,0 0 0 images
comprehensive analysis on the problem logo detection. and 10 0 0+ categories. It is a new dataset without any existing
Existing detection algorithms are mainly designed for still im- results so we leave the details of LVIS in future work section
ages and are suboptimal for directly applying in videos for ob- (Section 9).
ject detection. To detect objects in videos, there are two ma- ImageNet [37] is also a important dataset with 200 categories.
jor differences from generic detection: temporal and contextual However, the scale of ImageNet is huge and the object scale range
information. The location and appearance of objects in video is similar to VOC datasets, so it is not a commonly used bench-
should be temporally consistent between adjacent frames. More- marks for detection algorithms.
over, a video consists of hundreds of frames and thus contains Evaluation metrics: The details of evaluation metrics are shown
far richer contextual information compared to a single still im- in Tab. 1, both detection accuracy and inference speed are used
age. Han et al. [54] proposed a Seq-NMS which associates de- to evaluate detection algorithms. For detection accuracy, mean
tection results of still images into sequences. Boxes of the same Average Precision (mAP) is used as evaluation metric for all these
sequence are re-scored to the average score across frames, and challenges. The mAP is the mean value of AP, which is calculated
other boxes along the sequence are suppressed by NMS. Kang separately for each class based on recall and precision. Assume the
et al. proposed Tubelets with Convolutional Neural Networks (T- detector returns a set of predictions, we sample top γ predictions
CNN) [53] which was extended from Faster RCNN and incorpo- by confidence in decreasing order, which is denoted as Dγ . Next
rated the temporal and contextual information from tubelets (box we calculate the number of true positive (TPγ ) and false positive
sequence over time). T-CNN propagated the detection results to the (FPγ ) from sampled Dγ by the metric introduced in Section 2.
adjacent frames by optical flow, and generated tubelets by apply- Based on TPγ and FPγ , recall (Rγ ) and precision (Pγ ) are easy
ing tracking algorithms from high-confidence bounding boxes. The to obtain. AP is the region area under the curve of precision and
boxes along the tubelets were re-scored based on tubelets classifi- recall, which is also easy to compute by varying the value of
cation. parameter γ . Finally mAP is computed by averaging the value of
There are also many other real-world applications based on ob- AP across all classes. For VOC2012, VOC2007 and ImageNet, IoU
ject detection such as vehicle detection [227–229], traffic-sign de- threshold of mAP is set to 0.5, and for MSCOCO, more comprehen-
tection [230,231] and skeleton detection [232,233]. sive evaluation metrics are applied. There are six evaluation scores
which demonstrates different capability of detection algorithms,
6. Detection benchmarks including performance on different IoU thresholds and on differ-
ent scale objects. Some examples of listed datasets (Pascal VOC,
In this section we will show some common benchmarks of MSCOCO, Open Images and LVIS) are shown in Fig. 11.
generic object detection, face detection and pedestrian detection.
We will first present some widely used datasets for each task and 6.2. Face detection benchmarks
then introduce the evaluation metrics.
In this section, we introduce several widely used face detection
6.1. Generic detection benchmarks datasets (WIDER FACE, FDDB and Pascal Face) and the commonly
used evaluation metrics.
Pascal VOC2007 [29] is a mid scale dataset for object detection WIDER FACE [200]. WIDER FACE has totally 32,203 images with
with 20 categories. There are three image splits in VOC2007: train- about 400 k faces for a large range of scales. It has three subsets:
ing, validation and test with 2501, 2510 and 5011 images respec- 40% for training, 10% for validation, and 50% for test. The annota-
tively. tions of training and validation sets are online available. According
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 57

Table 1
Summary of common evaluation metrics for various detection tasks including generic object detection, face detection and pedestrian detection.

Alias Meaning Definition and description

FPS Frame per second The number of images processed per second.
 IoU threshold The IoU threshold to evaluate localization.
Dγ All Predictions Top γ predictions returned by the detectors by confidence in decreasing order.
TPγ True Positive Correct predictions from sampled predictions Dγ .
FPγ False Positive False predictions from sampled predictions Dγ .
Pγ Precision The fraction of TPγ out of Dγ .
Rγ Recall The fraction of TPγ out of all positive samples.
AP Average Precision Region area under curve of Rγ and Pγ by varying the value of parameter γ .
mAP mean AP Average score of AP across all classes.
TPR True Positive Rate The fraction of positive rate over false positives.
FPPI FP Per Image The fraction of false positive for each image.
MR log-average missing rate Average miss rate over different FPPI rates evenly spaced in log-space

Generic Object Detection

mAP mean Average Precision VOC2007 mAP at 0.50 IoU threshold over all 20 classes.
VOC2012 mAP at 0.50 IoU threshold over all 20 classes.
OpenImages mAP at 0.50 IoU threshold over 500 most frequent classes.
MSCOCO • APcoco : mAP averaged over ten : {0.5: 0.05: 0.95};
• AP50 : mAP at 0.50 IoU threshold;
• AP75 : mAP at 0.75 IoU threshold;
• APS : APcoco for small objects of area smaller than 322 ;
• APM : APcoco for objects of area between 322 and 962 ;
• APL : APcoco for large objects of area bigger than 962 ;

Face detection

mAP mean Average Precision Pascal Face mAP at 0.50 IoU threshold.
AFW mAP at 0.50 IoU threshold.
WIDER FACE • mAPeasy : mAP for easy level faces;
• mAPmid : mAP for mid level faces;
• mAPhard : mAP for hard level faces;
TPR True Positive Rate FDDB • TPRdis with 1k FP at 0.50 IoU threshold, with bbox level.
• TPRcont with 1k FP at 0.50 IoU threshold, with eclipse level.

Pedestrian Detection

mAP mean Average Precision KITTI • mAPeasy : mAP for easy level pedestrians;
• mAPmid : mAP for mid level pedestrians;
• mAPhard : mAP for hard level pedestrians;
MR log-average miss rate CityPersons MR: ranging from 1e−2 to 100 FPPI
Caltech MR: ranging from 1e−2 to 1e0 FPPI
ETH MR: ranging from 1e−2 to 1e0 FPPI
INRIA MR: ranging from 1e−2 to 1e0 FPPI

to the difficulty of detection tasks, it has three splits: Easy, Medium 13,0 0 0 ignored regions, both bounding box annotation of all per-
and Hard. sons and annotation of visible parts are provided.
FDDB [199]. The Face Detection Data set and Benchmark (FDDB) Caltech [259] is a popular and challenging datasets for pedes-
is a well-known benchmark with 5171 faces in 2845 images. Com- trian detection, which comes from approximately 10 h 30 Hz VGA
monly face detectors will first be trained on a large scale dataset video recorded by a car traversing the streets in the greater Los
(WIDERFACE etc.) and tested on FDDB. Angeles metropolitan area. The training and testing sets contains
PASCAL FACE [29]. This dataset was collected from PASCAL per- 42,782 and 4024 frames, respectively.
son layout test set, with 1335 labeled faces in 851 images. Similar ETH [260] contains 1804 frames in three video clips and com-
to FDDB, it’s commonly used as test set only. monly it’s used as test set to evaluate performance of the models
Evaluation Metrics. As Table 1 shown, the evaluation metric for trained on the large scale datasets (CityPersons dataset etc.).
WIDER FACE and PASCAL FACE is mean average precision (mAP) INRIA [19] contains images of high resolution pedestrians col-
with IoU threshold as 0.5, and for WIDER FACE the results of each lected mostly from holiday photos, which consists of 2120 images,
difficulty level will be reported. For FDDB, true positive rate (TPR) including 1832 images for training and 288 images. Specifically,
at 1k false positives are used for evaluation. There are two an- there are 614 positive images and 1218 negative images in the
notation types to evaluate FDDB dataset: bounding box level and training set.
eclipse level. KITTI [261] contains 7481 labeled images of resolution
1250 × 375 and another 7518 images for testing. The person class
6.3. Pedestrian detection benchmarks in KITTI is divided into two subclasses: pedestrian and cyclist, both
evaluated by mAP method. KITTI contains three evaluation metrics:
In this section we will first introduce five widely used datasets easy, moderate and hard, with difference in the min. bounding box
(Caltech, ETH, INRIA, CityPersons and KITTI) for pedestrian object height, max. occlusion level, etc.
detection and then introduce their evaluation metrics. Evaluation Metrics. For CityPersons, INRIA and ETH, the log-
CityPersons [257] is a new and challenging pedestrian de- average miss rate (MR) over 9 points ranging from 1e−2 to 1e0 FPPI
tection dataset on top of the semantic segmentation dataset (False Positive Per Image) is used to evaluate the performance of
CityScapes [258], of which 50 0 0 images are captured in several the detectors (lower is better). For KITTI, standard mean average
cities in Germany. A total of 35,0 0 0 persons with an additional precision is used as evaluation metric with 0.5 IoU threshold.
58 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

Table 2
Detection results on PASCAL VOC dataset. For VOC2007, the models are trained on VOC2007 and VOC2012 trainval sets and tested
on VOC2007 test set. For VOC2012, the models are trained on VOC2007 and VOC2012 trainval sets plus VOC2007 test set and
tested on VOC2012 test set by default. Since Pascal VOC datasets are well tuned and thus the number of detection frameworks for
VOC reduces in recent years.

Method Backbone Proposed Year Input size(Test) mAP (%)

VOC2007 VOC2012

Two-stage Detectors:
R-CNN [2] VGG-16 2014 Arbitrary 66.0a 62.4b
SPP-net [2] VGG-16 2014 ~ 600 × 1000 63.1a –
Fast R-CNN [38] VGG-16 2015 ~ 600 × 1000 70.0 68.4
Faster R-CNN [34] VGG-16 2015 ~ 600 × 1000 73.2 70.4
MR-CNN [131] VGG-16 2015 Multi-Scale 78.2 73.9
Faster R-CNN [1] ResNet-101 2016 ~ 600 × 1000 76.4 73.8
R-FCN [52] ResNet-101 2016 ~ 600 × 1000 80.5 77.6
OHEM [148] VGG-16 2016 ~ 600 × 1000 74.6 71.9
HyperNet [50] VGG-16 2016 ~ 600 × 1000 76.3 71.4
ION [51] VGG-16 2016 ~ 600 × 1000 79.2 76.4
CRAFT [153] VGG-16 2016 ~ 600 × 1000 75.7 71.3b
LocNet [149] VGG-16 2016 ~ 600 × 1000 78.4 74.8b
R-FCN w DCN [97] ResNet-101 2017 ~ 600 × 1000 82.6 –
CoupleNet [125] ResNet-101 2017 ~ 600 × 1000 82.7 80.4
DeNet512(wide) [94] ResNet-101 2017 ~ 512 × 512 77.1 73.9
FPN-Reconfig [115] ResNet-101 2018 ~ 600 × 1000 82.4 81.1
DeepRegionLet [140] ResNet-101 2018 ~ 600 × 1000 83.3 81.3
DCN+R-CNN [132] ResNet-101+ResNet-152 2018 Arbitrary 84.0 81.2

One-stage Detectors:
YOLOv1 [40] VGG16 2016 448 × 448 66.4 57.9
SSD512 [42] VGG-16 2016 512 × 512 79.8 78.5
YOLOv2 [41] Darknet 2017 544 × 544 78.6 73.5
DSSD513 [112] ResNet-101 2017 513 × 513 81.5 80.0
DSOD300 [107] DS/64-192-48-1 2017 300 × 300 77.7 76.3
RON384 [120] VGG-16 2017 384 × 384 75.4 73.0
STDN513 [111] DenseNet-169 2018 513 × 513 80.9 –
RefineDet512 [92] VGG-16 2018 512 × 512 81.8 80.1
RFBNet512 [108] VGG16 2018 512 × 512 82.2 –
CenterNet [64] ResNet101 2019 512 × 512 78.7 -
CenterNet [64] DLA [64] 2019 512 × 512 80.7 -
a
This entry reports the model is trained with VOC2007 trainval sets only.
b
This entry reports the model are trained with VOC2012 trainval sets only.

7. State-of-the-art for object detection tection algorithms proposed in recent years and explore the poten-
tial leads by introducing some relevant topics such as few-shot de-
Generic object detection: Pascal VOC20 07, VOC20 07 and MSCOCO tection and life-long detection. Zhao et al. [269] review the existing
are three most commonly used datasets for evaluating detection deep learning based detectors and also provide the benchmarks of
algorithms. Pascal VOC2012 and VOC2007 are mid scale datasets generic detection and real applications. Jiao et al. [266] cover a se-
with 2 or 3 objects per image and the range of object size in VOC ries of general detection algorithms and introduce the state-of-the-
dataset is not large. For MSCOCO, there are nearly 10 objects per art methods to explore novel solutions and directions to develop
image and the majority objects are small objects with large scale the new detectors.
ranges, which leads to a very challenge task for detection algo- Compared with these surveys, our work not only reviews the
rithms. In Tables 2 and 3 we give the benchmarks of VOC2007, existing representative detectors, but also makes comprehensive
VOC2012 and MSCOCO over the recent few years. analysis on general components and learning strategy of different
Face detection: WIDER FACE is currently the most commonly detectors. We aim to fully explore the factors which impact de-
used benchmark for evaluating face detection algorithms. High tection tasks, which are not covered in most existing surveys. Liu
variance of face scales and large number of faces per image make et al. [265] also give a comprehensive understanding of generic
WIDER FACE the hardest benchmark for face detection, with three object detection as well as the analysis of detector components
evaluation metrics: easy, medium and hard. In Table 4 we give the and learning strategies. However, their work only focus on generic
benchmarks of WIDER FACE over the recent few years. detection but ignore the importance of detection in real-world
Pedestrian detection: CityPersons is a new but challenging applications. In our survey, we also give a comprehensive un-
benchmark for pedestrian detection. The dataset is split into dif- derstanding of the limitations and strategies to adapt generic
ferent subsets according to the height and visibility level of the detection algorithms into real-world applications. Furthermore, we
objects, and thus it’s able to evaluate the detectors in a more com- organize the state-of-the-art algorithms for both generic detection
prehensive manner. The results are listed in Tab. 5, where MR is and real-world applications to facilitate the future research. Finally,
used for evaluation (lower is better). based on the tendency of the latest work proposed within the past
one year, we discuss the future direction of object detection.
8. Related surveys

There are some other surveys which is parallel to our 9. Concluding remarks and future directions
work [265–269]. Sultana et al. [267] review the existing deep
learning based detectors and their training settings. Agarwal Object detection has been actively investigated and new state-
et al. [268] review the connection between deep learning and de- of-the-art results have been reported almost every few months.
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 59

Table 3
Detection performance on the MS COCO test-dev data set. “++” denotes applying inference strategy such as multi scale test,
horizontal flip, etc.

Method Backbone Year AP AP50 AP75 APS APM APL

Two-Stage Detectors:
Fast R-CNN [38] VGG-16 2015 19.7 35.9 − − − −
Faster R-CNN [34] VGG-16 2015 21.9 42.7 − − − −
OHEM [148] VGG-16 2016 22.6 42.5 22.2 5.0 23.7 37.9
ION [51] VGG-16 2016 23.6 43.2 23.6 6.4 24.1 38.3
OHEM++ [148] VGG-16 2016 25.5 45.9 26.1 7.4 27.7 40.3
R-FCN [52] ResNet-101 2016 29.9 51.9 − 10.8 32.8 45.0
Faster R-CNN+++ [1] ResNet-101 2016 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [39] ResNet-101 2016 36.2 59.1 39.0 18.2 39.0 48.2
DeNet-101(wide) [94] ResNet-101 2017 33.8 53.4 36.1 12.3 36.1 50.8
CoupleNet [125] ResNet-101 2017 34.4 54.8 37.2 13.4 38.1 50.8
Faster R-CNN by G-RMI [167] Inception-ResNet-v2 2017 34.7 55.5 36.7 13.5 38.1 52.0
Deformable R-FCN [52] Aligned-Inception-ResNet 2017 37.5 58.0 40.8 19.4 40.1 52.5
Mask-RCNN [3] ResNeXt-101 2017 39.8 62.3 43.4 22.1 43.2 51.2
umd_det [236] ResNet-101 2017 40.8 62.4 44.9 23.0 43.4 53.2
Fitness-NMS [152] ResNet-101 2017 41.8 60.9 44.9 21.5 45.0 57.5
DCN w Relation Net [138] ResNet-101 2018 39.0 58.6 42.9 − − −
DeepRegionlets [140] ResNet-101 2018 39.3 59.8 − 21.7 43.7 50.9
C-Mask RCNN [141] ResNet-101 2018 42.0 62.9 46.4 23.4 44.7 53.8
Group Norm [237] ResNet-101 2018 42.3 62.8 46.2 − − −
DCN+R-CNN [132] ResNet-101+ResNet-152 2018 42.6 65.3 46.5 26.4 46.1 56.4
Cascade R-CNN [49] ResNet-101 2018 42.8 62.1 46.3 23.7 45.5 55.2
SNIP++ [98] DPN-98 2018 45.7 67.3 51.1 29.3 48.8 57.1
SNIPER++ [146] ResNet-101 2018 46.1 67.0 51.6 29.6 48.9 58.1
PANet++ [238] ResNeXt-101 2018 47.4 67.2 51.8 30.1 51.7 60.0
Grid R-CNN [151] ResNeXt-101 2019 43.2 63.0 46.6 25.1 46.5 55.2
DCN-v2 [144] ResNet-101 2019 44.8 66.3 48.8 24.4 48.1 59.6
DCN-v2++ [144] ResNet-101 2019 46.0 67.9 50.8 27.8 49.1 59.5
TridentNet [239] ResNet-101 2019 42.7 63.6 46.5 23.9 46.6 56.6
TridentNet [239] ResNet-101-Deformable 2019 48.4 69.7 53.5 31.8 51.3 60.3

Single-Stage Detectors:
SSD512 [42] VGG-16 2016 28.8 48.5 30.3 10.9 31.8 43.5
RON384++ [120] VGG-16 2017 27.4 49.5 27.1 − − −
YOLOv2 [41] DarkNet-19 2017 21.6 44.0 19.2 5.0 22.4 35.5
SSD513 [112] ResNet-101 2017 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [112] ResNet-101 2017 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet800++ [43] ResNet-101 2017 39.1 59.1 42.3 21.8 42.7 50.2
STDN513 [111] DenseNet-169 2018 31.8 51.0 33.6 14.4 36.1 43.4
FPN-Reconfig [115] ResNet-101 2018 34.6 54.3 37.3 − − −
RefineDet512 [92] ResNet-101 2018 36.4 57.5 39.5 16.6 39.9 51.4
RefineDet512++ [92] ResNet-101 2018 41.8 62.9 45.7 25.6 45.1 54.1
GHM SSD [147] ResNeXt-101 2018 41.6 62.8 44.2 22.3 45.1 55.3
CornerNet511 [63] Hourglass-104 2018 40.5 56.5 43.1 19.4 42.7 53.9
CornerNet511++ [63] Hourglass-104 2018 42.1 57.8 45.3 20.8 44.8 56.7
M2Det800 [116] VGG-16 2019 41.0 59.7 45.0 22.1 46.5 53.8
M2Det800++ [116] VGG-16 2019 44.2 64.6 49.3 29.2 47.9 55.1
ExtremeNet [240] Hourglass-104 2019 40.2 55.5 43.2 20.4 43.2 53.1
CenterNet-HG [64] Hourglass-104 2019 42.1 61.1 45.9 24.1 45.5 52.8
FCOS [241] ResNeXt-101 2019 42.1 62.1 45.2 25.6 44.9 52.0
FSAF [95] ResNeXt-101 2019 42.9 63.8 46.3 26.6 46.2 52.7
CenterNet511 [65] Hourglass-104 2019 44.9 62.4 48.1 25.6 47.4 57.4
CenterNet511++ [65] Hourglass-104 2019 47.0 64.5 50.7 28.9 49.9 58.9

However, there are still many open challenges. Below we discuss (ii) Effective encoding of contextual information. Contexts can
several open challenges and future directions. contribute or impede visual object detection results, as objects in
(i) Scalable proposal generation strategy. As claimed in the visual world have strong relationships, and contexts are crit-
Section 3.4, currently most detectors are anchor-based meth- ical to better understand the visual worlds. However, little effort
ods, and there are some critical shortcomings which limit the has been focused on how to correctly use contextual information.
detection accuracy. Current anchor priors are mainly manually How to incorporate contexts for object detection effectively can be
designed which is difficult to match multi-scale objects and the a promising future direction.
matching strategy based on IoU is also heuristic. Although some (iii) Detection based on Auto Machine Learning (AutoML). To de-
methods have been proposed to transform anchor-based methods sign an optimal backbone architecture for a certain task can sig-
into anchor-free methods (e.g. methods based on keypoints), there nificantly improve the results but also requires huge engineer-
are still some limitations (high computation cost etc.) with large ing effort. Thus to learn backbone architecture directly on the
space to improve. From Fig. 2, developing anchor-free methods datasets is a very interesting and important research direction.
becomes a very hot topic in object detection [63,65,95,240,241], From Fig. 2, inspired by the pioneering AutoML work on image
and thus designing an efficient and effective proposal generation classification [270,271], more relevant work has been proposed to
strategy is potentially a very important research direction in the address detection problems via AutoML [272,273], such as learning
future. FPN structure [273] and learning data augmentation policies [274],
60 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

Table 4 (v) Low-shot object detection. Training detectors with limited la-
Detection results on WIDER FACE dataset. The models are trained
beled data is dubbed as Low-shot detection. Deep learning based
on WIDER FACE training sets and tested on WIDER FACE
test set. detectors often have huge amount of parameters and thus are
data-hungry, which require large amount of labeled data to achieve
Method Year mAP (%)
satisfactory performance. However, labeling objects in images with
Easy Medium Hard bounding box level annotation is very time-consuming. Low-shot
ACF-WIDER[242] 2014 69.5 58.8 29.0 learning has been actively studied for classification tasks, but only
Faceness [243] 2015 71.6 60.4 31.5 a few studies are focused on detection tasks. For example, Multi-
Two-stage CNN [200] 2016 65.7 58.9 30.4 modal Self-Paced Learning for Detection (MSPLD) [275] addresses
LDCF+ [244] 2016 79.7 77.2 56.4 the low-shot detection problem in a semi-supervised learning
CMS-CNN [195] 2016 90.2 87.4 64.3
MSCNN [106] 2016 91.7 90.3 80.9
setting where a large-scale unlabeled dataset is available. Rep-
ScaleFace [245] 2017 86.7 86.6 76.4 Met [276] adopts a Deep Metric Learning (DML) structure, which
HR [99] 2017 92.3 91.0 81.9 jointly learns feature embedding space and data distribution of
SHH [189] 2017 92.7 91.5 84.4 training set categories. However, RepMet was only tested on
Face R-CNN [191] 2017 93.2 91.6 82.7
datasets with similar concepts (animals). Low-Shot Transfer Detec-
S3FD [87] 2017 93.5 92.1 85.8
Face R-FCN [198] 2017 94.3 93.1 87.6 tor (LSTD) [277] addresses low-shot detection based on transfer
FAN [246] 2017 94.6 93.6 88.5 learning which transfers the knowledge form large annotated ex-
FANet [188] 2017 94.7 93.9 88.7 ternal datasets to the target set by knowledge regularization. LSTD
FDNet [247] 2018 95.0 93.9 87.8 still suffers from overfitting. There is still a large room to improve
PyramidBox [185] 2018 95.6 94.6 88.7
SRN [186] 2018 95.9 94.8 89.6
the low-shot detection tasks.
DSFD [187] 2018 96.0 95.3 90.0 (vi) Backbone architecture for detection task. It has become a
DFS [248] 2018 96.3 95.4 90.7 common practice to adopt weights of classification models pre-
SFDet [249] 2019 94.8 94.0 88.3 trained on a large scale dataset for detection. However, there still
CSP [250] 2019 94.9 94.4 89.9
exists conflicts between classification and detection tasks [78], and
PyramidBox++ [251] 2019 95.6 95.2 90.9
VIM-FD [252] 2019 96.2 95.3 90.2 thus directly adopting a pretrained network may not result in the
ISRN [253] 2019 96.3 95.4 90.3 optimal solution. From Table 3, most state-of-the-art detection al-
RetinaFace [254] 2019 96.3 95.6 91.4 gorithms are based on classification backbones, and only a few of
AlnnoFace [255] 2019 96.5 95.7 91.2 them try different selections (such as CornerNet based on Hour-
RefineFace [256] 2019 96.6 95.8 91.4
glass Net). Thus, developing a detection-aware backbone architec-
ture is also an important research direction for the future.
(vii) Other research issues. In addition, there are some other
Table 5
open research issues, such as large batch learning [278] and incre-
Detection results on CityPersons dataset. The models are trained on
CityPersons training sets and tested on CityPersons test set. mental learning [279]. Batch size is a key factor in DCNN training
There are four evaluation metrics: Reasonable (R.), Small (S.), Heavy but has not been well studied for detection. For incremental learn-
(H.) and All (A.), which are related to the height and visibility level ing, detection algorithms still suffer from catastrophic forgetting if
of the objects. adapted to a new task without initial training data. These open and
Method Year R. S. H. A. fundamental research issues also deserve more attention for future
FRCNN [38] 2015 12.97 37.24 50.47 43.86
work.
MS-CNN [106] 2016 13.32 15.86 51.88 39.94 In this survey, we give a comprehensive survey of recent ad-
RepLoss [213] 2017 11.48 15.67 52.59 39.17 vances in deep learning techniques for object detection tasks. The
Ada-FRCN [257] 2017 12.97 37.24 50.47 43.86 main contents of this survey are divided into three major cate-
OR-CNN [214] 2018 11.32 14.19 51.43 40.19
gories: object detector components, machine learning strategies,
HBAN [262] 2019 11.26 15.68 39.54 38.77
MGAN [263] 2019 9.29 11.38 40.97 38.86 real-world applications and benchmark evaluations. We have re-
APD [264] 2019 8.27 11.03 35.45 35.65 viewed a large body of representative articles in recent literature,
and presented the contributions on this important topic in a struc-
tured and systematic manner. We hope this survey can give read-
ers a comprehensive understanding of object detection with deep
which show significant improvement over the baselines. However, learning and potentially spur more research work on object detec-
the required computation resource for AutoML is unaffordable to tion techniques and their applications.
most researchers (more than 100 GPU cards to train a single
model). Thus, developing a low-computation framework shall have Declaration of Competing Interest
a large impact for object detection. Further, new structure poli-
cies (such as proposal generation and region encoding) of detection The authors declare that they have no known competing finan-
task can be explored in the future. cial interests or personal relationships that could have appeared to
(iv) Emerging benchmarks for object detection. Currently MSCOCO influence the work reported in this paper.
is the most commonly used detection benchmark testbed. How-
ever, MSCOCO has only 80 categories, which is still too small to CRediT authorship contribution statement
understand more complicated scenes in real world. Recently, a new
benchmark dataset LVIS [235] has been proposed in order to col- Xiongwei Wu: Conceptualization, Methodology, Software, In-
lect richer categorical information. LVIS contains 164,0 0 0 images vestigation, Writing - original draft, Writing - review & editing.
with 10 0 0+ categories, and there are total of 2.2 million high- Doyen Sahoo: Writing - review & editing, Investigation, Validation.
quality instance segmentation masks. Further, LVIS simulates the Steven C.H. Hoi: Supervision.
real-world low-shot scenario where a large number of categories
References
are present but per-category data is sometimes scarce. LVIS will
open a new benchmark for more challenging detection, segmenta- [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
tion and low-shot learning tasks in near future. in: Proceedings of the CVPR, 2016.
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 61

[2] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for ac- [38] R. Girshick, Fast R-CNN, in: Proceedings of the ICCV, 2015.
curate object detection and semantic segmentation, in: Proceedings of the [39] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyra-
CVPR, 2014. mid networks for object detection, in: Proceedings of the CVPR, 2017.
[3] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the [40] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified,
ICCV, 2017. real-time object detection, in: Proceedings of the CVPR, 2016.
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image [41] J. Redmon, A. Farhadi, Yolo90 0 0: better, faster, stronger, in: Proceedings of
segmentation with deep convolutional nets and fully connected CRFS, 2014. the CVPR, 2017.
arXiv: 1412.7062. [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD:
[5] Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: face recognition with very deep Single shot multibox detector, in: Proceedings of the ECCV, 2016.
neural networks, 2015. arXiv: 1502.00873. [43] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object
[6] Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint detection, in: Proceedings of the ICCV, 2017.
identification-verification, in: Proceedings of the NeurIPS, 2014. [44] S. Fidler, R. Mottaghi, A. Yuille, R. Urtasun, Bottom-up segmentation for top–
[7] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, Sphereface: deep hypersphere down detection, in: Proceedings of the CVPR, 2013.
embedding for face recognition, in: Proceedings of the CVPR, 2017. [45] J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search
[8] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, S. Yan, Scale-aware fast R-CNN for pedes- for object recognition, in: Proceedings of the IJCV, 2013.
trian detection, in: Proceedings of the IEEE Transactions on Multimedia, 2018. [46] J. Kleban, X. Xie, W.-Y. Ma, Spatial pyramid mining for logo detection in natu-
[9] J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedes- ral scenes, in: Proceedings of the 2008 IEEE International Conference on Mul-
trians, in: Proceedings of the CVPR, 2015. timedia and Expo, 2008.
[10] A. Angelova, A. Krizhevsky, V. Vanhoucke, A.S. Ogale, D. Ferguson, Real-time [47] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional
pedestrian detection with deep network cascades., in: Proceedings of the networks for visual recognition, in: Proceedings of the ECCV, 2014.
BMVC, 2015. [48] C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in:
[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-s- Proceedings of the ECCV, 2014.
cale video classification with convolutional neural networks, in: Proceedings [49] Z. Cai, N. Vasconcelos, Cascade R-CNN: delving into high quality object detec-
of the CVPR, 2014. tion, in: Proceedings of the CVPR, 2018.
[12] H. Mobahi, R. Collobert, J. Weston, Deep learning from temporal coherence [50] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: towards accurate region proposal
in video, in: Proceedings of the Annual International Conference on Machine generation and joint object detection, in: Proceedings of the CVPR, 2016.
Learning, 2009. [51] S. Bell, C. Lawrence Zitnick, K. Bala, R. Girshick, Inside-outside net: detect-
[13] S.C. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, Q. Wu, Logo-net: large-scale ing objects in context with skip pooling and recurrent neural networks, in:
deep logo detection and brand recognition with deep region-based convolu- Proceedings of the CVPR, 2016.
tional networks, 2015. arXiv: 1511.02462. [52] J. Dai, Y. Li, K. He, J. Sun, R-FCN: object detection via region-based fully con-
[14] H. Su, X. Zhu, S. Gong, Deep learning logo detection with data expansion by volutional networks, in: Proceedings of the NeurIPS, 2016.
synthesising context, in: Proceedings of the 2017 IEEE Winter Conference on [53] K. Kang, W. Ouyang, H. Li, X. Wang, Object detection from video tubelets with
Applications of Computer Vision (WACV), 2017. convolutional neural networks, in: Proceedings of the CVPR, 2016.
[15] H. Su, S. Gong, X. Zhu, Scalable deep learning logo detection, 2018. arXiv: [54] W. Han, P. Khorrami, T.L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li,
1803.11417. S. Yan, T.S. Huang, SEQ-NMS for video object detection, 2016. arXiv: 1602.
[16] A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object 08465.
detection, in: Proceedings of the ICCV, 2009. [55] M. Rayat Imtiaz Hossain, J. Little, Exploiting temporal information for 3D hu-
[17] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple man pose estimation, in: Proceedings of the ECCV, 2018.
features, in: Proceedings of the CVPR, 2001. [56] G. Pavlakos, X. Zhou, K.G. Derpanis, K. Daniilidis, Coarse-to-fine volumetric
[18] H. Harzallah, F. Jurie, C. Schmid, Combining efficient object localization and prediction for single-image 3D human pose, in: Proceedings of the CVPR,
image classification, in: Proceedings of the ICCV, 2009. 2017.
[19] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: [57] P.O. Pinheiro, T.-Y. Lin, R. Collobert, P. Dollár, Learning to refine object seg-
Proceedings of the CVPR, 2005. ments, in: Proceedings of the ECCV, 2016.
[20] P. Viola, M.J. Jones, Robust real-time face detection, in: Proceedings of the [58] P.O. Pinheiro, R. Collobert, P. Dollár, Learning to segment object candidates,
IJCV, 2004. in: Proceedings of the NeurIPS, 2015.
[21] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceed- [59] J. Dai, K. He, J. Sun, Instance-aware semantic segmentation via multi-task net-
ings of the ICCV, 1999. work cascades, in: Proceedings of the CVPR, 2016.
[22] R. Lienhart, J. Maydt, An extended set of haar-like features for rapid object [60] Z. Huang, L. Huang, Y. Gong, C. Huang, X. Wang, Mask scoring R-CNN, in:
detection, in: Proceedings of the International Conference on Image Process- Proceedings of the CVPR, 2019.
ing, 2002. [61] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: in-
[23] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: Pro- tegrated recognition, localization and detection using convolutional networks,
ceedings of the ECCV, 2006. 2013. arXiv: 1312.6229.
[24] M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector ma- [62] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training
chines, in: Proceedings of the IEEE Intelligent Systems and their applications, by reducing internal covariate shift, in: Proceedings of the ICML, 2015.
1998. [63] H. Law, J. Deng, Cornernet: detecting objects as paired keypoints, in: Proceed-
[25] D. Opitz, R. Maclin, Popular ensemble methods: an empirical study, in: Pro- ings of the ECCV, 2018.
ceedings of the Artificial Intelligence Research, 1999. [64] X. Zhou, D. Wang, P. Krähenbühl, Objects as points, 2019. arXiv: 1904.07850.
[26] Y. Freund, R.E. Schapire, et al., Experiments with a new boosting algorithm, [65] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, Centernet: keypoint triplets
in: Proceedings of the ICML, 1996. for object detection, 2019. arXiv: 1904.08189.
[27] Y. Yu, J. Zhang, Y. Huang, S. Zheng, W. Ren, C. Wang, K. Huang, T. Tan, Object [66] H. Robbins, S. Monro, A stochastic approximation method, The Annals of
detection by context and boosted hog-LBP, in: Proceedings of the PASCAL VOC Mathematical Statistics, 1951.
Challenge, 2010. [67] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, 2014. arXiv:
[28] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Discriminatively 1412.6980.
trained mixtures of deformable part models, in: Proceedings of the PASCAL [68] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann ma-
VOC Challenge, 2008. chines, in: Proceedings of the ICML, 2010.
[29] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal [69] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
visual object classes (VOC) challenge, in: Proceedings of the IJCV, 2010. image recognition, 2014. arXiv: 1409.1556.
[30] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detec- [70] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks,
tion with discriminatively trained part-based models, in: Proceedings of the in: Proceedings of the ECCV, Springer, 2016.
TPAMI, 2010. [71] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected con-
[31] D.G. Lowe, Distinctive image features from scale-invariant keypoints, in: Pro- volutional networks., in: Proceedings of the CVPR, 2017.
ceedings of the IJCV, 2004. [72] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, J. Feng, Dual path networks, in: Proceed-
[32] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation ings of the NeurIPS, 2017, pp. 4467–4475.
invariant texture classification with local binary patterns, in: Proceedings of [73] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations
the TPAMI, 2002. for deep neural networks, in: Proceedings of the CVPR, 2017.
[33] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep [74] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An-
convolutional neural networks, in: Proceedings of the NeurIPS, 2012. dreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mo-
[34] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object de- bile vision applications, 2017. arXiv: 1704.04861.
tection with region proposal networks, in: Proceedings of the NeurIPS, 2015. [75] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
[35] K. Fukushima, S. Miyake, Neocognitron: a self-organizing neural network houcke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of
model for a mechanism of visual pattern recognition, in: Proceedings of the the CVPR, 2015.
Competition and Cooperation in Neural Nets, 1982. [76] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the incep-
[36] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to tion architecture for computer vision, in: Proceedings of the CVPR, 2016.
document recognition, in: Proceedings of the IEEE, 1998. [77] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet
[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale and the impact of residual connections on learning., in: Proceedings of the
hierarchical image database, in: Proceedings of the CVPR, 2009. AAAI, 2017.
62 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

[78] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Detnet: a backbone network [117] Z. Li, F. Zhou, FSSD: feature fusion single shot multibox detector, 2017. arXiv:
for object detection, in: Proceedings of the ECCV, 2018. 1712.00960.
[79] A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose esti- [118] K. Lee, J. Choi, J. Jeong, N. Kwak, Residual features and unified prediction net-
mation, in: Proceedings of the ECCV, 2016. work for single stage detection, 2017. arXiv: 1707.05031.
[80] B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows, [119] L. Cui, MDSSD: multi-scale deconvolutional single shot detector for small ob-
in: Proceedings of the TPAMI, 2012. jects, 2018. arXiv: 1805.07009.
[81] E. Rahtu, J. Kannala, M. Blaschko, Learning a category independent object de- [120] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, Y. Chen, RON: reverse connection with
tection cascade, in: Proceedings of the ICCV, 2011. objectness prior networks for object detection, in: Proceedings of the CVPR,
[82] P.F. Felzenszwalb, D.P. Huttenlocher, Efficient graph-based image segmenta- 2017.
tion, in: Proceedings of the IJCV, 2004. [121] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks
[83] S. Manen, M. Guillaumin, L. Van Gool, Prime object proposals with random- for single image super-resolution, in: Proceedings of the CVPR Workshops,
ized prim’s algorithm, in: Proceedings of the CVPR, 2013. 2017.
[84] J. Carreira, C. Sminchisescu, CPMC: automatic object segmentation using con- [122] W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, D. Rueckert,
strained parametric min-cuts, in: Proceedings of the TPAMI, 2011. Z. Wang, Real-time single image and video super-resolution using an efficient
[85] I. Endres, D. Hoiem, Category-independent object proposals with diverse sub-pixel convolutional neural network, in: Proceedings of the CVPR, 2016.
ranking, in: Proceedings of the TPAMI, 2014. [123] B. Jiang, R. Luo, J. Mao, T. Xiao, Y. Jiang, Acquisition of localization confidence
[86] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, for accurate object detection, in: Proceedings of the ECCV, 2018.
C.L. Zitnick, Microsoft coco: common objects in context, in: Proceedings of [124] Y. Zhai, J. Fu, Y. Lu, H. Li, Feature selective networks for object detection, in:
the ECCV, 2014. Proceedings of the CVPR, 2018.
[87] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, S.Z. Li, S3FD: single shot scale-in- [125] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, Couplenet: coupling global
variant face detector, in: Proceedings of the ICCV, 2017. structure with local parts for object detection, in: Proceedings of the ICCV,
[88] W. Luo, Y. Li, R. Urtasun, R. Zemel, Understanding the effective receptive field 2017.
in deep convolutional neural networks, in: Proceedings of the NeurIPS, 2016. [126] C. Galleguillos, S. Belongie, Context based object categorization:a critical sur-
[89] C. Zhu, R. Tao, K. Luu, M. Savvides, Seeing small faces from robust anchor vey, in: Proceedings of the Computer Vision and Image Understanding, 2010.
perspective, in: Proceedings of the CVPR, 2018. [127] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang,
[90] L.J.Z.X. Lele Xie, Y. Liu, DERPN: taking a further step toward more general C.-C. Loy, et al., Deepid-net: deformable deep convolutional neural networks
object detection, in: Proceedings of the AAAI, 2019. for object detection, in: Proceedings of the CVPR, 2015.
[91] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, L. Van Gool, Deepproposal: [128] W. Chu, D. Cai, Deep feature based contextual model for object detection, in:
hunting objects by cascading deep convolutional layers, in: Proceedings of Proceedings of the Neurocomputing, 2018.
the ICCV, 2015. [129] Y. Zhu, R. Urtasun, R. Salakhutdinov, S. Fidler, SEGDEEPM: exploiting segmen-
[92] S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network tation and context in deep neural networks for object detection, in: Proceed-
for object detection, in: Proceedings of the CVPR, 2018. ings of the CVPR, 2015.
[93] T. Yang, X. Zhang, Z. Li, W. Zhang, J. Sun, Metaanchor: learning to detect ob- [130] X. Chen, A. Gupta, Spatial memory for context reasoning in object detection,
jects with customized anchors, in: Proceedings of the NeurIPS, 2018. in: Proceedings of the ICCV, 2017.
[94] L. Tychsen-Smith, L. Petersson, Denet: scalable real-time object detection [131] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic
with directed sparse sampling, in: Proceedings of the ICCV, 2017. segmentation-aware CNN model, in: Proceedings of the ICCV, 2015.
[95] C. Zhu, Y. He, M. Savvides, Feature selective anchor-free module for sin- [132] B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, T. Huang, Revisiting RCNN: on
gle-shot object detection, in: Proceedings of the CVPR, 2019. awakening the classification power of faster RCNN, in: Proceedings of the
[96] Y. Lu, T. Javidi, S. Lazebnik, Adaptive object detection using adjacency and ECCV, 2018.
zoom prediction, in: Proceedings of the CVPR, 2016. [133] X. Zhao, S. Liang, Y. Wei, Pseudo mask augmented object detection, in: Pro-
[97] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional ceedings of the CVPR, 2018.
networks, in: Proceedings of the ICCV, 2017. [134] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, A.L. Yuille, Single-Shot Object
[98] B. Singh, L.S. Davis, An analysis of scale invariance in object detection–snip, Detection with Enriched Semantics, CVPR, 2018.
in: Proceedings of the CVPR, 2018. [135] A. Shrivastava, A. Gupta, Contextual priming and feedback for faster R-CNN,
[99] P. Hu, D. Ramanan, Finding tiny faces, in: Proceedings of the CVPR, 2017. in: Proceedings of the ECCV, 2016.
[100] F. Yang, W. Choi, Y. Lin, Exploit all the layers: Fast and accurate CNN object [136] B. Li, T. Wu, L. Zhang, R. Chu, Auto-context R-CNN, 2018. arXiv: 1807.02842.
detector with scale dependent pooling and cascaded rejection classifiers, in: [137] Y. Liu, R. Wang, S. Shan, X. Chen, Structure inference net: object detection
Proceedings of the CVPR, 2016. using scene-level context and instance-level relationships, in: Proceedings of
[101] Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, X. Tang, Recurrent scale approximation the CVPR, 2018.
for object detection in CNN, in: Proceedings of the ICCV, 2017. [138] H. Hu, J. Gu, Z. Zhang, J. Dai, Y. Wei, Relation networks for object detection,
[102] A. Shrivastava, R. Sukthankar, J. Malik, A. Gupta, Beyond skip connections: in: Proceedings of the CVPR, 2018.
top-down modulation for object detection, 2016. arXiv: 1612.06851. [139] J. Gu, H. Hu, L. Wang, Y. Wei, J. Dai, Learning region features for object detec-
[103] H. Wang, Q. Wang, M. Gao, P. Li, W. Zuo, Multi-scale location-aware kernel tion, in: Proceedings of the ECCV, 2018.
representation for object detection, in: Proceedings of the CVPR, 2018. [140] H. Xu, X. Lv, X. Wang, Z. Ren, R. Chellappa, Deep regionlets for object detec-
[104] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, M. Park, PVANET: deep but lightweight tion, in: Proceedings of the ECCV, 2018.
neural networks for real-time object detection, 2016. arXiv: 1608.08021. [141] Z. Chen, S. Huang, D. Tao, Context refinement for object detection, in: Pro-
[105] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for ceedings of the ECCV, 2018.
biomedical image segmentation, in: Proceedings of the International Confer- [142] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, Gated bi-directional CNN for
ence on Medical image Computing and Computer-Assisted Intervention, 2015. object detection, in: Proceedings of the ECCV, 2016.
[106] Z. Cai, Q. Fan, R.S. Feris, N. Vasconcelos, A unified multi-scale deep convolu- [143] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, S. Yan, Attentive contexts for ob-
tional neural network for fast object detection, in: Proceedings of the ECCV, ject detection, in: Proceedings of the IEEE Transactions on Multimedia, 2017.
2016. [144] X. Zhu, S. Lin, H. Hu, J. Dai, Deformable convnets v2: more deformable, better
[107] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, X. Xue, Dsod: learning deeply super- results, in: Proceedings of the CVPR, 2019.
vised object detectors from scratch, in: Proceedings of the ICCV, 2017. [145] R. Girshick, F. Iandola, T. Darrell, J. Malik, Deformable part models are convo-
[108] S. Liu, D. Huang, Y. Wang, Receptive field block net for accurate and fast ob- lutional neural networks, in: Proceedings of the CVPR, 2015.
ject detection, in: Proceedings of the ECCV, 2018. [146] B. Singh, M. Najibi, L.S. Davis, Sniper: Efficient multi-scale training, in: Pro-
[109] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y.-W. Tai, L. Xu, Accurate sin- ceedings of the NeurIPS, 2018.
gle stage detector using recurrent rolling convolution, in: Proceedings of the [147] Y.L. Buyu Li, X. Wang, Gradient harmonized single-stage detector, in: Proceed-
CVPR, 2017. ings of the AAAI, 2019.
[110] J. Jeong, H. Park, N. Kwak, Enhancement of SSD by concatenating feature [148] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors
maps for object detection, 2017. arXiv: 1705.09587. with online hard example mining, in: Proceedings of the CVPR, 2016.
[111] P. Zhou, B. Ni, C. Geng, J. Hu, Y. Xu, Scale-transferrable object detection, in: [149] S. Gidaris, N. Komodakis, Locnet: Improving localization accuracy for object
Proceedings of the CVPR, 2018. detection, in: Proceedings of the CVPR, 2016.
[112] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A.C. Berg, DSSD: deconvolutional single [150] S. Zagoruyko, A. Lerer, T.-Y. Lin, P.O. Pinheiro, S. Gross, S. Chintala, P. Dollár, A
shot detector, 2017. arXiv: 1701.06659. multipath network for object detection, in: Proceedings of the BMVC, 2016.
[113] S. Woo, S. Hwang, I.S. Kweon, Stairnet: Top-down semantic aggregation for [151] X. Lu, B. Li, Y. Yue, Q. Li, J. Yan, Grid R-CNN, in: Proceedings of the CVPR,
accurate one shot detection, in: Proceedings of the 2018 IEEE Winter Confer- 2019.
ence on Applications of Computer Vision (WACV), 2018. [152] L. Tychsen-Smith, L. Petersson, Improving object localization with fitness
[114] H. Li, Y. Liu, W. Ouyang, X. Wang, Zoom out-and-in network with recursive NMS and bounded IOU loss, 2017. arXiv: 1711.00164.
training for object proposal, 2017. arXiv: 1702.05711. [153] B. Yang, J. Yan, Z. Lei, S.Z. Li, Craft objects from images, in: Proceedings of the
[115] T. Kong, F. Sun, W. Huang, H. Liu, Deep feature pyramid reconfiguration for CVPR, 2016.
object detection, in: Proceedings of the ECCV, 2018. [154] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
[116] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, H. Ling, M2DET: a sin- A. Courville, Y. Bengio, Generative adversarial nets, in: Proceedings of the
gle-shot object detector based on multi-level feature pyramid network, in: NeurIPS, 2014.
Proceedings of the AAAI, 2019. [155] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation us-
ing cycle-consistent adversarial networks, in: Proceedings of the ICCV, 2017.
X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64 63

[156] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with [193] P. Samangouei, M. Najibi, L. Davis, R. Chellappa, Face-magnet: magnifying fea-
deep convolutional generative adversarial networks, 2015. arXiv: 1511.06434. ture maps to detect small faces, 2018. arXiv: 1803.05258.
[157] A. Brock, J. Donahue, K. Simonyan, Large scale GAN training for high fidelity [194] C. Zhang, X. Xu, D. Tu, Face detection using improved faster RCNN, 2018.
natural image synthesis, 2018. arXiv: 1809.11096. arXiv: 1802.02142.
[158] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial [195] C. Zhu, Y. Zheng, K. Luu, M. Savvides, CMS-RCNN: contextual multi-scale re-
networks for small object detection, in: Proceedings of the CVPR, 2017. gion-based CNN for unconstrained face detection, in: Proceedings of the Deep
[159] X. Wang, A. Shrivastava, A. Gupta, A-fast-RCNN: hard positive generation via Learning for Biometrics, 2017.
adversary for object detection, in: Proceedings of the CVPR, 2017. [196] B. Yu, D. Tao, Anchor cascade for efficient face detection, 2018. arXiv: 1805.
[160] R.G. Kaiming He, P. DollÃ!‘ro, Rethinking imagenet pre-training, 2018. arXiv: 03363.
1811.08883. [197] K. Zhang, Z. Zhang, H. Wang, Z. Li, Y. Qiao, W. Liu, Detecting faces using inside
[161] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, T. Mei, Scratchdet: explor- cascaded contextual CNN, in: Proceedings of the ICCV, 2017.
ing to train single-shot object detectors from scratch, in: Proceedings of the [198] Y. Wang, X. Ji, Z. Zhou, H. Wang, Z. Li, Detecting faces using region-based fully
CVPR, 2019. convolutional networks, 2017. arXiv: 1709.05256.
[162] Z. Shen, H. Shi, R. Feris, L. Cao, S. Yan, D. Liu, X. Wang, X. Xue, T.S. Huang, [199] V. Jain, E. Learned-Miller, FDDB: A Benchmark for Face Detection in Un-
Learning object detectors from scratch with gated recurrent feature pyramids, constrained Settings, Technical Report UM-CS-2010-009, University of Mas-
2017. arXiv: 1712.00886. sachusetts, Amherst, 2010.
[163] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, [200] S. Yang, P. Luo, C.-C. Loy, X. Tang, Wwider face: a face detection benchmark,
2015. arXiv: 1503.02531. in: Proceedings of the CVPR, 2016.
[164] Q. Li, S. Jin, J. Yan, Mimicking very efficient network for object detection, in: [201] J. Han, W. Nam, P. Dollar, Local decorrelation for improved detection, in: Pro-
Proceedings of the CVPR, 2017. ceedings of the NeurIPS, 2014.
[165] N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS – improving object de- [202] P. Dollár, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: Proceed-
tection with one line of code, in: Proceedings of the ICCV, 2017. ings of the BMVC, 2009.
[166] J. Hosang, R. Benenson, B. Schiele, Learning non-maximum suppression, in: [203] P. Dollár, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for object
Proceedings of the CVPR, 2017. detection, in: Proceedings of the TPAMI, 2014.
[167] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wo- [204] C.P. Papageorgiou, M. Oren, T. Poggio, A general framework for object detec-
jna, Y. Song, S. Guadarrama, et al., Speed/accuracy trade-offs for modern con- tion, in: Proceedings of the ICCV, 1998.
volutional object detectors, in: Proceedings of the CVPR, 2017. [205] G. Brazil, X. Yin, X. Liu, Illuminating pedestrians via simultaneous detection
[168] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Light-head R-CNN:in defense & segmentation, 2017. arXiv: 1706.08564.
of two-stage object detector, 2017. arXiv: 1711.07264. [206] X. Du, M. El-Khamy, J. Lee, L. Davis, Fused DNN: a deep neural network fusion
[169] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Inverted residuals approach to fast and robust pedestrian detection, in: Proceedings of the IEEE
and linear bottlenecks: mobile networks for classification, detection and seg- Winter Conference on Applications of Computer Vision (WACV), 2017.
mentation, 2018. arXiv: 1801.04381. [207] S. Wang, J. Cheng, H. Liu, M. Tang, PCN: part and context information for
[170] A. Wong, M.J. Shafiee, F. Li, B. Chwyl, Tiny SSD: A tiny single-shot detection pedestrian detection with CNNS, 2018. arXiv: 1804.04483.
deep convolutional neural network for real-time embedded object detection, [208] D. Xu, W. Ouyang, E. Ricci, X. Wang, N. Sebe, Learning cross-modal deep
2018. arXiv: 1802.06488. representations for robust pedestrian detection, in: Proceedings of the CVPR,
[171] Y. Li, J. Li, W. Lin, J. Li, Tiny-DSOD: lightweight object detection for resource- 2017.
restricted usages, 2018. arXiv: 1807.11013. [209] R. Benenson, M. Omran, J. Hosang, B. Schiele, Ten years of pedestrian detec-
[172] D. Almeida, H. Lee, K. Sohn, D. Shang, Understanding and improving convolu- tion, what have we learned? in: Proceedings of the ECCV, 2014.
tional neural networks via concatenated rectified linear units, in: Proceedings [210] Z. Cai, M. Saberian, N. Vasconcelos, Learning complexity-aware cascades for
of the ICML, 2016. deep pedestrian detection, in: Proceedings of the ICCV, 2015.
[173] Y.D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, D. Shin, Compression of deep con- [211] P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with
volutional neural networks for fast and low power mobile applications, Com- unsupervised multi-stage feature learning, in: Proceedings of the CVPR, 2013.
puter Science, 2015. [212] L. Zhang, L. Lin, X. Liang, K. He, Is faster R-CNN doing well for pedestrian
[174] Y. He, X. Zhang, J. Sun, Channel pruning for accelerating very deep neural detection? in: Proceedings of the ECCV, 2016.
networks, in: Proceedings of the ICCV, 2017. [213] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, C. Shen, Repulsion loss: detecting
[175] Y. Gong, L. Liu, M. Yang, L. Bourdev, Compressing deep convolutional net- pedestrians in a crowd, in: Proceedings of the CVPR, 2018.
works using vector quantization, Computer Science, 2014. [214] S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Occlusion-aware R-CNN: detecting
[176] Y. Lin, S. Han, H. Mao, Y. Wang, W.J. Dally, Deep gradient compression: Re- pedestrians in a crowd, in: Proceedings of the ECCV, 2018.
ducing the communication bandwidth for distributed training, 2017. arXiv: [215] J. Mao, T. Xiao, Y. Jiang, Z. Cao, What can help pedestrian detection? in: Pro-
1712.01887. ceedings of the CVPR, 2017.
[177] J. Wu, L. Cong, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural net- [216] Y. Tian, P. Luo, X. Wang, X. Tang, Deep learning strong parts for pedestrian
works for mobile devices, in: Proceedings of the CVPR, 2016. detection, in: Proceedings of the CVPR, 2015.
[178] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural net- [217] W. Ouyang, X. Wang, Joint deep learning for pedestrian detection, in: Pro-
works with pruning, trained quantization and Huffman coding, in: Proceed- ceedings of the ICCV, 2013.
ings of the Fiber, 2015. [218] M. Mathias, R. Benenson, R. Timofte, L. Van Gool, Handling occlusions with
[179] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for Franken-classifiers, in: Proceedings of the ICCV, 2013.
efficient neural network, in: Proceedings of the NeurIPS, 2015. [219] W. Ouyang, X. Zeng, X. Wang, Modeling mutual visibility relationship in
[180] E. Osuna, R. Freund, F. Girosit, Training support vector machines: an applica- pedestrian detection, in: Proceedings of the CVPR, 2013.
tion to face detection, in: Proceedings of the CVPR, 1997. [220] G. Duan, H. Ai, S. Lao, A structural filter approach to human detection, in:
[181] M. Rätsch, S. Romdhani, T. Vetter, Efficient face detection by a cascaded sup- Proceedings of the ECCV, 2010.
port vector machine using haar-like features, in: Proceedings of the Joint Pat- [221] M. Enzweiler, A. Eigenstetter, B. Schiele, D.M. Gavrila, Multi-cue pedestrian
tern Recognition Symposium, 2004. classification with partial occlusion handling, in: Proceedings of the CVPR,
[182] S. Romdhani, P. Torr, B. Scholkopf, A. Blake, Computationally efficient face de- 2010.
tection, in: Proceedings of the ICCV, 2001. [222] C. Zhou, J. Yuan, Bi-box regression for pedestrian detection and occlusion es-
[183] X. Sun, P. Wu, S.C. Hoi, Face detection using deep learning: an improved faster timation, in: Proceedings of the ECCV, 2018.
RCNN approach, in: Proceedings of the Neurocomputing, 2018. [223] W. Ouyang, X. Wang, A discriminative deep model for pedestrian detection
[184] Y. Liu, M.D. Levine, Multi-path region-based convolutional neural network for with occlusion handling, in: Proceedings of the CVPR, 2012.
accurate detection of unconstrained” hard faces”, in: Proceedings of the 14th [224] S. Tang, M. Andriluka, B. Schiele, Detection and tracking of occluded people,
Conference on Computer and Robot Vision (CRV), 2017, 2017. in: Proceedings of the IJCV, 2014.
[185] X. Tang, D.K. Du, Z. He, J. Liu, Pyramidbox: a context-assisted single shot face [225] W. Ouyang, X. Wang, Single-pedestrian detection aided by multi-pedestrian
detector, in: Proceedings of the ECCV, 2018. detection, in: Proceedings of the CVPR, 2013.
[186] C. Chi, S. Zhang, J. Xing, Z. Lei, S.Z. Li, X. Zou, Selective refinement network [226] V.D. Shet, J. Neumann, V. Ramesh, L.S. Davis, Bilattice-based logical reasoning
for high performance face detection, 2018. arXiv: 1809.02693. for human detection, in: Proceedings of the CVPR, 2007.
[187] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, F. Huang, Dsfd: [227] Y. Zhou, L. Liu, L. Shao, M. Mellor, Dave: a unified framework for fast vehicle
Dual shot face detector, in: Proceedings of the CVPR, 2019. detection and annotation, in: Proceedings of the ECCV, 2016.
[188] J. Zhang, X. Wu, J. Zhu, S.C. Hoi, Feature agglomeration networks for single [228] T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, L. Fei-Fei, Fine-grained car de-
stage face detection, 2017. arXiv: 1712.00721. tection for visual census estimation, in: Proceedings of the AAAI, 2017.
[189] M. Najibi, P. Samangouei, R. Chellappa, L. Davis, SSH: single stage headless [229] S. Majid Azimi, Shuffledet: real-time vehicle detection network in on-board
face detector, in: Proceedings of the ICCV, 2017. embedded UAV imagery, in: Proceedings of the ECCV, 2018.
[190] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, X. Hu, Scale-aware face detection, in: Pro- [230] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and
ceedings of the CVPR, 2017. classification in the wild, in: Proceedings of the CVPR, 2016.
[191] H. Wang, Z. Li, X. Ji, Y. Wang, Face R-CNN, 2017. arXiv: 1706.01061. [231] A. Pon, O. Adrienko, A. Harakeh, S.L. Waslander, A hierarchical deep architec-
[192] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using ture and mini-batch selection method for joint traffic sign and light detection,
multi-task cascaded convolutional networks, in: Proceedings of the IEEE Sig- in: Proceedings of the Conference on Computer and Robot Vision (CRV), 2018.
nal Processing Letters, 2016. [232] W. Ke, J. Chen, J. Jiao, G. Zhao, Q. Ye, Srn: side-output residual network for
object symmetry detection in the wild, in: Proceedings of the CVPR, 2017.
64 X. Wu, D. Sahoo and S.C.H. Hoi / Neurocomputing 396 (2020) 39–64

[233] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, X. Bai, Object skeleton extrac- [267] F. Sultana, A. Sufian, P. Dutta, A review of object detection models based on
tion in natural images by fusing scale-associated deep side outputs, in: Pro- convolutional neural network, 2019. arXiv: 1905.01614.
ceedings of the CVPR, 2016. [268] S. Agarwal, J.O.D. Terrail, F. Jurie, Recent advances in object detection in the
[234] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, age of deep convolutional neural networks, 2018. arXiv: 1809.03193.
S. Popov, M. Malloci, T. Duerig, et al., The open images dataset v4: Unified im- [269] Z.-Q. Zhao, P. Zheng, S.-t. Xu, X. Wu, Object detection with deep learning:
age classification, object detection, and visual relationship detection at scale, a review, in: Proceedings of the IEEE Transactions on Neural Networks and
2018. arXiv: 1811.00982. Learnzing systems, 2019.
[235] A. Gupta, P. Dollar, R. Girshick, LVIS: a dataset for large vocabulary instance [270] B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures
segmentation, in: Proceedings of the CVPR, 2019. for scalable image recognition, in: Proceedings of the CVPR, 2018.
[236] N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS–improving object detec- [271] M. Tan, Q.V. Le, Efficientnet: rethinking model scaling for convolutional neu-
tion with one line of code, in: Proceedings of the ICCV, 2017. ral networks, 2019. arXiv: 1905.11946.
[237] Y. Wu, K. He, Group normalization, in: Proceedings of the ECCV, 2018. [272] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, J. Sun, Detnas: neural architecture
[238] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance seg- search on object detection, 2019. arXiv: 1903.10979.
mentation, in: Proceedings of the CVPR, 2018. [273] G. Ghiasi, T.-Y. Lin, Q.V. Le, NAS-FPN: learning scalable feature pyramid archi-
[239] Y. Li, Y. Chen, N. Wang, Z. Zhang, Scale-aware trident networks for object tecture for object detection, in: Proceedings of the CVPR, 2019.
detection, 2019. arXiv: 1901.01892. [274] B. Zoph, E.D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, Q.V. Le, Learning data aug-
[240] X. Zhou, J. Zhuo, P. Krahenbuhl, Bottom-up object detection by grouping ex- mentation strategies for object detection, 2019. arXiv: 1906.11172.
treme and center points, in: Proceedings of the CVPR, 2019. [275] X. Dong, L. Zheng, F. Ma, Y. Yang, D. Meng, Few-example object detection
[241] Z. Tian, C. Shen, H. Chen, T. He, FCOS: fully convolutional one-stage object with model communication, in: Proceedings of the TPAMI, 2018.
detection, 2019. arXiv: 1904.01355. [276] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, S. Pankanti, R. Feris,
[242] B. Yang, J. Yan, Z. Lei, S.Z. Li, Aggregate channel features for multi-view face A. Kumar, R. Giries, A.M. Bronstein, Repmet: representative-based metric
detection, in: Proceedings of the Biometrics (IJCB), 2014. learning for classification and one-shot object detection, in: Proceedings of
[243] S. Yang, P. Luo, C.C. Loy, X. Tang, From facial parts responses to face detection: the CVPR, 2019.
a deep learning approach, in: Proceedings of the ICCV, 2015. [277] H. Chen, Y. Wang, G. Wang, Y. Qiao, LSTD: a low-shot transfer detector for
[244] E. Ohn-Bar, M.M. Trivedi, To boost or not to boost? On the limits of boosted object detection, in: Proceedings of the AAAI, 2018.
trees for object detection, in: Proceedings of the ICPR, 2016. [278] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, J. Sun, Megdet: a large
[245] S. Yang, Y. Xiong, C.C. Loy, X. Tang, Face detection through scale-friendly deep mini-batch object detector, in: Proceedings of the CVPR, 2018.
convolutional networks, 2017. arXiv: 1706.02863. [279] K. Shmelkov, C. Schmid, K. Alahari, Incremental learning of object detectors
[246] Y.Y. J. Wang, G. Yu, Face attention network: an effective face detector for the without catastrophic forgetting, in: Proceedings of the ICCV, 2017.
occluded faces, 2017. arXiv: 1711.07246.
[247] X. Xu, C. Zhang, D. Tu, Face detection using improved faster RCNN, 2018. Xiongwei WU received the bachelor’s degree in computer
arXiv: 1802.02142. science from Zhejiang University, Zhejiang, P.R. China. He
[248] W. Tian, H. Shen, W. Deng, Z. Wang, Learning better features for face de- is currently the Ph.D. student in the School of Informa-
tection with feature fusion and segmentation supervision, 2018. arXiv: 1811. tion Systems, Singapore Management University, Singa-
08557. pore, supervided by Prof. Steven Hoi. His research direc-
[249] S. Zhang, L. Wen, H. Shi, Z. Lei, Single-shot scale-aware network for real-time tions mainly focus on object detection and deep learning.
face detection, in: Proceedings of the IJCV, 2019.
[250] W. Liu, S. Liao, W. Ren, W. Hu, High-level semantic feature detection: a new
perspective for pedestrian detection, in: Proceedings of the CVPR, 2019.
[251] Z. Li, X. Tang, J. Han, J. Liu, Pyramidbox++: high performance detector for
finding tiny face, 2019. arXiv: 1904.00386.
[252] Y. Zhang, X. Xu, X. Liu, Robust and high performance face detector, 2019.
arXiv: 1901.02350.
[253] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, S.Z. Li, Improved
selective refinement network for face detection, 2019. arXiv: 1901.06651. Doyen SAHOO is a Research Scientist at Salesforce Re-
[254] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, S. Zafeiriou, Retinaface: single-stage search Asia. Prior to this, he was serving as Adjunct fac-
dense face localisation in the wild, 2019. arXiv: 1905.00641. ulty in Singapore Management University, and was also
[255] F. Zhang, X. Fan, G. Ai, J. Song, Y. Qin, J. Wu, Accurate face detection for high a Research Fellow at the Living Analytics Research Cen-
performance, 2019. arXiv: 1905.00641. ter. He works on Online Learning, Deep Learning and re-
[256] S. Zhang, C. Chi, Z. Lei, S.Z. Li, Refineface: refinement neural network for high lated machine learning applications. He obtained his Ph.D.
performance face detection, 2019. arXiv: 1909.04376. from Singapore Management University, and B.Eng from
[257] S. Zhang, R. Benenson, B. Schiele, Citypersons: a diverse dataset for pedes- Nanyang Technological University.
trian detection, in: Proceedings of the CVPR, 2017.
[258] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene
understanding, in: Proceedings of the CVPR, 2016.
[259] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation
of the state of the art, in: Proceedings of the TPAMI, 2012.
[260] A. Ess, B. Leibe, L. Van Gool, Depth and appearance for mobile scene analysis, Prof. Steven C.H. HOI is currently Managing Director
in: Proceedings of the ICCV, 2007. of Salesforce Research Asia at Salesforce, located in
[261] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the kitti Singapore. He has been also a tenured Associate Professor
dataset, IJRR, 2013. of the School of Information Systems at Singapore Man-
[262] H.M. Ruiqi Lu, Semantic head enhanced pedestrian detection in a crowd, agement University, Singapore. Prior to joining SMU, he
2019. arXiv: 1911.11985. was a tenured Associate Professor of the School of Com-
[263] Y. Pang, J. Xie, M.H. Khan, R.M. Anwar, F.S. Khan, L. Shao, Mask-guided atten- puter Engineering at Nanyang Technological University,
tion network for occluded pedestrian detection, in: Proceedings of the ICCV, Singapore. He received his Bachelor degree in Computer
2019. Science from Tsinghua University, Beijing, China, in 2002,
[264] Y. Hu, J. Xie, J. Zhang, L. Lin, Y. Li, S.C. Hoi, Attribute-aware pedestrian detec- and both his Master and Ph.D. degrees in Computer
tion in a crowd, 2019. arXiv: 1910.09188. Science and Engineering from the Chinese University of
[265] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep Hong Kong, in 2004 and 2006, respectively.
learning for generic object detection: a survey, 2018. arXiv: 1809.02165.
[266] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, R. Qu, A survey of deep learning-
based object detection, arXiv:1907.09408.

You might also like