Navigation in Virtual Environments
Navigation in Virtual Environments
.
-� ...
etc. The impact of learning framework on perception as well
...
as navigation is transformational, and it has made signifi-
cant advances in autonomous systems [13]. Recently, deep
仁
L
learning-based models are widely used in relevant works of
`,
.
、
扩、·
.
一
environment perception, such as monocular depth estimation
飞妒
`
.
[28], ego-motion prediction [25], objective detection [4], and
(a) Sparse map; (b) Semi-dense map; semantic segmentation [29]. Furthermore, to improve the
tracking, localization and mapping performance of current vS-
LAM methods in some complex environments (e.g., low light
or night-time scenes), attempts have been made to incorporate
vSLAM with deep learning and satisfactory results have been
obtained [30]. For example, some related works [31], [32]
incorporated learning based semantic understanding into the
vSLAM to reconstruct the semantic maps of surroundings,
as shown in Fig. 2 (d), thereby getting a high-level under-
(c) Dense map; (d) Semantic map. standing of surroundings. Moreover, related work in [33] has
demonstrated that reinforcement learning exhibits good per-
Fig. 2. The environments are represented in different types. (a): Sparse map
is produced by ORB-SLAM2 [16]. (b): Semi-dense map is produced by LSD- formance in robotic navigation. It resolved and implemented
SLAM [17]. (c): Dense map is produced by DTAM [18]. (d): Semantic map the navigation problems in an end-to-end manner. In addition,
is produced by DA-RNN [19]. reinforcement learning enables robots to learn and imitate
humans to make decisions. Unlike some well-written reviews
[13], [15], [34], this survey mainly focuses on surveying the
Perceiving the environment. A good perception and under- learning-based perception, including self-state perception and
standing of the surrounding environment are indispensable for environment perception, as well as the representative results
autonomous systems. vSLAM algorithms have benn widely for reinforcement learning-based navigation in autonomous
applied to model the environments into different types based systems.
on the actual requirements, including sparse map [16], semi- The rest of the paper is organized as follows: Section II
dense map [17], and dense map [18], as shown in Fig. 2 (a)- introduces related works on visual perception, including a brief
(c). Although the geometric structures of surroundings in these review of traditional vSLAM methods, deep learning-based
representations are clearly perceived and modeled, a high-level visual perception, and methods combining deep learning with
information of these objects, like the semantic information, is vSLAM. Section III provides an overview of the reinforcement
still lacking. learning-based visual navigation. Section IV summarizes the
Perceiving their own state. The state of an autonomous deficiencies and challenges of existing learning systems for
vehicle is described by its position and orientation. Under- visual perception and navigation, and provides some ideas
standing their current state in real time is important for about future directions. Finally, this survey is concluded in
autonomous systems, which is the main precondition of au- Section V.
tonomous control. Although current vSLAM algorithms play
a crucial role in self-localization and ego-motion estimation, II. AUTONOMOUS V ISUAL P ERCEPTION
there are still some strong assumptions imposed in current
vSLAM systems, such as the static scene hypothesis and the In autonomous systems, determining a comprehensive un-
photometric consistency hypothesis. derstanding of the environment and its current state is one
of the basic and important perception tasks, which can be
Visual navigation. The ability of autonomous navigation is
efficiently solved by vSLAM algorithms or sub-topics of
also essential in autonomous systems. When an autonomous
vSLAM algorithms. Some classic SLAM methods are well
vehicle is assigned a destination, it requires the capabilities
summarized and discussed in [13], [34]. Cadena et al. [13]
of planning a reasonable path or trajectory. Poor or untimely
reviewed the related works on SLAM over the last 30 years
planning may lead to terrible results, such as collision and
in detail. They revisited and answered several important and
crash. Therefore, the ability of human-like planning is the
meaningful questions related to SLAM and stated that “SLAM
future development direction, and it is possible to achieve this
is necessary for autonomous robots”. Different from previous
intention with the help of learning framework. Since traditional
review papers, in this section, we mainly focus on the applica-
motion planning methods have been well summarized in [9],
tion of deep learning algorithms in perception by subdividing
this review mainly focuses on the aspect of reinforcement
them into three types.
learning-based navigation in autonomous systems.
Learning-based methods for visual perception and nav-
igation. With the development in learning framework [20], A. Geometric methods-based visual perception
deep learning and reinforcement learning have demonstrated SLAM is a common perception method in current au-
outstanding performance in image processing [21], [22], nat- tonomous systems. Compared with the SLAM systems that use
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3
TABLE I
A SUMMARY OF MAJOR GEOMETRIC V SLAM METHODS . “M ONO .” DENOTES THE MONOCULAR CAMERA , AND “ STEREO ” STANDS FOR STEREO CAMERA .
Lidar sensors [74], [75], visual sensors such as RGB cameras recently, ORB-SLAM3 [73] is proposed to support different
[49], [76] can provide more environmental information, and kinds of sensors, like monocular, stereo, RGB-D, and IMU
they have been extensively investigated in recent years owing sensors, and it also supports a variety of camera models. ORB-
to their portability. Therefore, we briefly summarize different SLAM3 system is much more versatile, accurate and robust
types of vSLAM methods in a chronological order first, as than previous work. However, the performance of feature-
presented in Table I. Their categories of optimization, maps, based methods relies on the correct matching, and they will
and sensors are enumerated in detail. From Table I, we find fail to initialize and track in low-texture and repeated-texture
that filtering-based vSLAM methods have been widely studied scenes [17] because of mismatch, suffers from the divergence
in the initial stage owing to their low computational burden. in the optimization algorithm, and accumulation of drift. Direct
With the development in computer science, optimization-based methods cancel the process of feature extraction and matching,
vSLAM methods have become popular in recent years due and the photometric information of pixels is directly used
to their higher accuracy. Meanwhile, dense maps are usually for pose and depth calculations during tracking and mapping
constructed by direct methods based on RGB-D sensors, like [17], [83]. Direct methods regard the pose estimation as a
[44], [45], [77], etc. In addition, new sensors, such as event nonlinear optimization problem and iteratively optimize the
cameras, and multi-sensor data fusion are attracting significant initial motion guess by minimizing the photometric error
attention and research prospects [50], [54], [78], [79]. In this [17]. Therefore, direct methods rely heavily on the luminosity
section, we communicate the basic principles of the three consistency assumption [84]. Semi-direct methods first estab-
classical monocular vSLAM solutions, including feature-based lish feature correspondences on the basis of direct methods,
methods [52], direct methods [17], and semi-direct methods which is the main difference from other methods [49], [57].
[49]. The main difference between these three methods is the The principle of epipolar line constraint is applied to match
pose optimization by minimizing either the reprojection error, the same features on the epipolar line. After matching the
photometric error, or both [76]. features, the solved pose is optimized by minimizing the
reprojection error. Therefore, semi-direct methods handle the
Feature-based methods have dominated vSLAM for a long
tracking problem by minimizing the photometric error and
time, and different man-made features (like SIFT [80], SURF
the reprojection error. Similar to direct methods, semi-direct
[81] and ORB [82]) have been designed to improve their robust
methods have a high requirement on image quality and are
tracking and mapping in different scenarios. The feature-based
sensitive to photometric changes.
methods can be divided into three parts, including image input,
feature extraction and matching, tracking and mapping. Most Although the architecture of vSLAM algorithms has been
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4
very maturely over the past 30 years and the three kinds of the are already close to that of supervised methods. After several
above-mentioned approaches have achieved good performance years of development, the framework of monocular depth
in normal indoor/outdoor scenes, their tracking robustness estimation has become very mature. Recent work focuses on
and localization/re-localization accuracy in many complex sce- improving the deficiencies of the existing unsupervised frame-
narios (like high-dynamic/large scale/night-time environments work, like static scenario assumptions [152] and photometric
or across weather/across-seasons conditions) still need to be consistency assumptions [129].
further improved [13], [84], [85]. The smoothness loss is one of the most widely used
In conclusion, although traditional geometry-based vSLAM constraints [28], [122], [123] to promote the smoothness of
methods have achieved amazing results in environmental the surface depth of the object, for example, the depth of
mapping and self-localization, these methods still have some the adjacent points on the road surface varies by gradient.
shortcomings. For example, feature-based methods cannot However, the existing methods do not impose smoothness
adapt to low texture area; direct methods need a good ini- constraints after distinguishing different targets in the scenario,
tialization; semi-direct methods are sensitive to luminosity; resulting in smoothing edge areas that should be sharp in
traditional vSLAM/VO methods cannot handle changing light- the estimated depth map. To address this problem, Yin et
ing/weather/season conditions; monocular vSLAM/VO meth- al. [153] proposed a novel geometric constraint to improve
ods suffer from scale ambiguity and so on [84], [86], [87]. the accuracy of depth estimation as well as the geometric
With the continuous development of deep learning in image shape in the predicted depth map by considering the surface
processing, applying the latest deep learning systems to the normal. Instead of using additional constraints to get a clear
existing vSLAM to handle the current problems in vSLAM geometric structure in monocular depth estimation, the method
methods is evolving into a popular research field. proposed in [96] predicted a 2D displacement field of the given
depth map to re-sample pixels around the occlusion boundaries
into sharper reconstructions. A recent study [131] showed that
B. Deep learning-based visual perception
incorporating sequence information into monocular framework
With the development in deep learning, utilizing deep neural is helpful to improve depth prediction, when the sequence in-
networks to address computer vision tasks has evolved into formation is available. Instead of estimating the accurate depth
a popular research field in recent years. Many sub-topics of of each pixel, predicting the relative depth of pixels in the
vSLAM for environment perception have been extensively image is also crucial for scene perception and understanding
studied based on deep learning, such as monocular depth and [154], which can also obtain good results in recovering metric
ego-motion prediction, which will be specified in the following depth. When considering the widely application scenarios of
sections. high-resolution depth maps, like object detection and semantic
1) Learning-based monocular depth perception: Depth is segmentation, Miangoleh et al. [155] proposed to infer high-
one of the most important information for autonomous systems resolution depth maps from images based on pre-trained depth
in scene reconstruction, self-localization, obstacle avoidance models.
and so on. Although the active depth sensors are available for Since the supervised signal of unsupervised methods is
depth perception, image-based techniques are often preferred mainly based on the view reconstruction loss [28], view
thanks to the increasing availability of standard cameras on reconstruction relies heavily on static scenario assumptions.
most consumer devices [144]. Structure-from-motion (SfM) Therefore, these methods fail to predict depth for moving
[145], [146] and stereo matching [147], [148] are two of the objects. To deal with this challenge, Godard et al. [156]
most popular methods to recover the depth from sequential designed an “Auto-Masking” to selectively eliminate pixels
or left and right images [147], and the depth is calculated that keep the same position with same RGB value between
by the triangulation and continuously optimized by projection adjacent frames in the sequence. However, this method can
cost and matching cost. However, the above methods rely only eliminate the influence of objects moving at equivalent
on the assumption that multiple observation of the scene are relative translation to the camera, while other dynamic objects
available [99], which means that the above methods are not will still have negative influence on the unsupervised training
well applicable to estimating depth from a single image. process. Therefore, with the help of semantic segmentation,
Estimating the depth from a single image is an ill-posed Klingner et al. [157] divided the dynamic and static objects
problem [88], which requires significant man-made prior by the correspondence of class labels between frames, which
knowledge when handled by traditional geometric methods is calculated by projection. Then, they eliminated the effects
[149], [150]. Deep neural networks can recover pixel level of these dynamic regions on view reconstruction loss.
depth information from single images in an end-to-end manner There are also some novel studies that improve the accuracy
based on the prior knowledge learnt from ground truth depth of monocular depth estimation by utilizing the novel network
labels or geometric relationships between images [151]. Since framework, such as proposing novel depth network [158]
both ground-truth based supervised methods and geometry or using novel attention mechanism [95], [159]. Introducing
based unsupervised methods have been well summarized in traditional geometry is also a good way, Wang et al. [110] tried
[151], in this paper, we will focus on the latest work, starting to get a better pose estimation by using direct methods during
with issues that remain unresolved in monocular depth estima- training. The direct method was used to further optimize the
tion. As shown in [151], monocular depth estimation has made output of the pose network before training, thereby getting a
great progress in recent few years, and unsupervised methods more accurate pose and depth estimation. Depth estimation
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5
TABLE II
A SUMMARY OF DEEP LEARNING - BASED MONOCULAR DEPTH AND EGO - MOTION ESTIMATION .
based on novel cameras, like event-based camera [160], fish- DeFeat-Net to simultaneously learn the cross-domain dense
eye camera [152] and panorama camera [161], is attracting feature representations of frames. Moreover, a robust feature
increase attention because of its advantages, like low latency reconstruction consistency instead of view reconstruction con-
and wide field-of-view. Inspired by the high performance of sistency is used as the main supervised signal for the training
HRNet [162], Zhou et al. [133] introduced the HRNet into of framework, thereby being able to adapt special scenarios.
the unsupervised monocular depth estimation task and got a Based on the auto-encoder depth network pre-trained on day
satisfactory results. time, Vankadari et al. [127] used an additional night-time
encoder to encode the images of night time. A PatchGAN-
Monocular depth estimation in special scenarios, such as based adversarial discriminator was designed to constrain the
adverse weather conditions and night-time scenes, is gradually consistency between the features among the images of day
being focused. Because of the complex luminosity changes and time and night time, which are encoded by two encoders re-
photometric inconsistency at night, the previous unsupervised spectively. Hence, the pre-trained decoder can directly recover
frameworks driven by view reconstruction consistency [28], a depth map of a night-time image from features encoded by
[122], [123] cannot be applied to the night-time scene [129] the night-time encoder. Zhao et al. [163] proposed to use a
directly. Recent studies have tried to address this problem cyclegan-based domain adaptation framework to get an end-to-
by using warped feature consistency [129] or cross-domain end night-time depth model from a pretrained day-time model,
feature adaptation [127], which achieved good accuracy in and it got a better results in night and even rainy night. Instead
night-time depth estimation. Spencer et al. [129] designed a
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6
of using adaptation methods, Wang et al. [164] leveraged a to improve temporal consistency over long sequences, just
Mapping-Consistent Image Enhancement module to deal with like the local optimization (bundle adjustment) that widely
the low visibility and a Statistics-Based Mask (SBM) to tackle used in traditional VO methods. Chi et al. [115] studied
textureless regions, so their work can directly train the model the performance difference between feature-level collaboration
on night-time image sequences. and loss-level joint optimization for multi-task learning (depth,
2) Learning-based monocular ego-motion perception: Vi- pose and optical flow), and feature-level collaboration shows
sual odometry (VO) is the process of estimating the ego- much greater performance improvement for all three tasks.
motion of an agent (e.g., vehicle, human, and robot) by Therefore, they designed a single network to integrate all
using the input of a single or multiple attached cameras [34]. the three tasks, and the pose component regresses pose from
Geometry-based monocular VO methods handle the localiza- both images and estimated disparity map and optical flow.
tion and tracking by minimizing the photometric error [59] or Inspired by bundle adjustment, Wei [142] proposed a deep
reprojection error [52] on sequential images. The difference learning framework that iteratively improves both depth and
between traditional VO and vSLAM is that VO system lacks pose based on the cost volume explicitly built to measure
the loop-closure detection and global optimization [34]. With photo-consistency and geometric-consistency. Zhuang et al.
the development of deep learning systems, using the features [143] presented an uncertainty based probabilistic framework
extracted by deep neural networks to regress the ego-motion that integrating pose predictions from deep neural networks
in an end-to-end manner is becoming a hot application in and solutions from geometric feature-based solvers (5-point
recent years [25]. Compared with traditional VO methods, method and bundle adjustment). Instead of estimating poses
pose networks do not require complex parameter tuning, such from images, Zhao et al. [166] recovered relative pose by
as the settings of key frames and features [25]. Moreover, directly solving the fundamental matrix from dense optical
pose networks can learn the scale information from the ground flow correspondence, which was predicted by an optical flow
truth during training, so these methods solve the monocular network, and the results demonstrated the effectiveness of the
scale ambiguity problem that widely existed in traditional framework in pose estimation. Jiao et al. [116] obtained the
monocular VO methods [52], [59]. Konda et al. [134] first pose between frames by minimizing the reprojection error,
estimated the motion information through deep learning-based since the optical flow and depth are predicted by deep neural
methods by formulating pose prediction as a classification networks.
problem. Alex et al. [25] first demonstrated the ability of The traditional methods have proved that combining visual
convolutional neural networks (CNNs) on 6-DOF pose regres- information with inertial information is helpful for improving
sion. A deep CNN framework called PoseNet was designed the visual localization accuracy [53], [167], [168]. However,
for regressing monocular camera pose that could operate in these visual-inertial odometry (VIO) methods suffer from
different scenes in real-time. In [135], Costante et al. also used accurate calibration between sensors, time-stamp synchroniza-
a deep CNN to learn high-level feature representation, and tion between inertial and visual data, and effective inertial
the major difference from [25] is that the dense optical flow and visual information fusion [53], [167], [168]. Researchers
was calculated and used to estimate the ego-motion instead of believe that inertial information is also helpful in learning-
feeding RGB images into the CNN directly. Considering the based methods. Therefore, Clark et al. [139] proposed the first
dynamics and relations between adjacent pose transformations, end-to-end VIO framework based on deep learning without
Wang et al. [136] and Xue et al. [137] used recurrent neural the need for time-stamp alignment and manual calibration
networks (RNNs) for camera localization. Then, Xue et al. between different sensors. They used the CNN architecture to
[138] further extended their work by incorporating two helpful extract visual features and long short-term memory (LSTM)
modules named “Memory” and “Refining” into VO tasks, to extract the inertial features, and fused their features using a
which outperformed the previous deep learning-based VO core LSTM processing module for pose regression. For a better
methods [137]. integration of visual and inertial features extracted by the deep
As the learning system is constantly evolving, introducing neural networks, Chen et al. [140] presented a selective sensor
new learning architecture to current tasks has been a good way fusion framework based on the attention mechanism, which
to improve the ability of pose network in high-level feature ex- autonomously selects the most useful features extracted from
traction and pose regression. Xue et al. [141] proposed to con- images or inertial measurement unit (IMU) by deep neural
struct a view graph to excavate the information of the whole network. Therefore, even in the case of poor image quality,
given sequence for absolute camera pose estimation, and a their algorithm can get accurate poses with the help of inertial
graph neural network was applied to model the total graph. data.
Li et al. [126] introduced online meta-learning algorithms We briefly summarize the deep learning-based monocular
into previous learning framework, so that their method can depth and ego-motion estimation according to their published
continuously adapt to unseen environments in a self-supervised years, the training data, the training mode, and the missions,
manner. Considering the error accumulation problem com- as shown in Table II. From the table, we find that attention
monly suffered by previous learning-based methods, Zou et has been paid increasingly to unsupervised methods these
al. [165] tried to aggregate long-term temporal information by years, because unsupervised methods do not require expensive
using Conv-LSTM (convolutional long short term memory) ground truth [28]. Besides, considering the in-depth relation-
to model long-term temporal dependency. Meanwhile, long- ship between projection and optical flow between frames,
range constraints based on long-range image snippets are used researchers always extend the unsupervised pose and depth es-
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7
timation framework with optical flow estimation [121], [122]. et al. [177] presented a CNN-SVO pipeline that leveraging the
Recently, scene flow (optical flow in 3D space) estimation SVO [49] with depth prediction network to improve the map-
[114], [116] is getting more attention, which is trained together ping and tracking of SVO. Czarnowski et al. [178] proposed
with depth and pose network in an unsupervised manner. Since a real-time probabilistic dense vSLAM system that integrates
optical flow, scene flow, depth and pose are tightly coupled, learned priors (depth) over geometry with classical vSLAM
the training strategy will have an impact on the performance formulations in a probabilistic factor-graph formulation, and
of each task [116]. Therefore, the multi-task frameworks have got a better accuracy than [169] in both trajectory and depth
become popular in recent years, and the geometric relationship estimation. Combining depth estimation with vSLAM has been
between these tasks (flow, segmentation, mask) has been proven to effectively improve the performance of traditional
exploited to improve the performance of the network. Besides, monocular vSLAM. Moreover, vSLAM can also be used to
the data from multi-sensors (like camera, IMU) has also been promote the accuracy of depth networks. For example, Tawari
added to join the network to provide additional information et al. [179] proposed a self-improving framework. On the
[109], [140], thus promoting the training of networks. one hand the predicted depth was used to perform RGB-D
feature-based vSLAM. On the other hand, the pose calculated
by RGB-D feature-based vSLAM instead of that predicted by
C. Deep learning with vSLAM
pose network was leveraged to train the depth network, thereby
The methods combining vSLAM with deep learning have leading to more accurate depth estimation. The above works
also been extensively studied and have lead to notable im- have shown how to integrate vSLAM with depth prediction
provements to traditional vSLAM methods, like tackling the via a deep neural network, and it is a promising direction to
scale ambiguity of monocular vSLAM [169], [170], improving address inherent limitations of traditional vSLAM, especially
the robust tracking and accurate mapping of vSLAM [171], with respect to estimating the absolute scale and obtaining
[172], strengthening the adaptability of vSLAM in different dense depths.
environments [30], [84], [173], and extending the semantical 2) Learning-based pose estimation and vSLAM: Although
perception of the environments [174], [175]. pose networks have achieved real-time performance and satis-
1) Learning-based monocular depth estimation and vS- factory accuracy, the existing learning based pose estimation
LAM: Depth information plays an important role in tradi- methods do not include the mapping thread [25], [136],
tional vSLAM methods, and sensor-based and triangulation- which is important for the perception of the environmental
based methods are two basic ways to obtain the depth of structure. Besides, traditional direct methods rely heavily on
features. With the development of deep learning in the field the initial guess of pose during tracking, resulting in instable
of monocular depth estimation, researchers are trying to use initialization and inaccurate tracking [84], [88]. Therefore,
deep learning-based methods as an alternative to the traditional combining learning-based pose estimation with traditional
depth calculation methods of vSLAM. The combination of vSLAM is a good way to overcome the above deficiencies
deep learning-based depth estimation and traditional vSLAM [172], [180]. Zhao et al. [172] designed a self-supervised
methods has been proved to be effective in obtaining the depth pose prediction network and incorporated it into DSO [88].
of features and overcoming the monocular scale ambiguity, They considered the output of the pose network as the initial
thereby improving mapping and replacing the RGB-D sensors pose guess of direct VO, which replaced the constant motion
[169], [176]. Depth prediction was first introduced to dense model used in DSO; then, the initial pose was improved by
monocular vSLAM by Laina et al. [176]. Since the mapping the nonlinear optimization in DSO. This method got a more
process reduces the dependence on feature extraction and robust initialization and tracking than traditional DSO when
matching, this method has the potential to reconstruct low- testing on the KITTI odometry sequences [181]. Yang et al.
texture scenes. Moreover, this work showed that the depth [180] also focused on this field, and they proposed a novel
estimation network can replace the depth sensors (such as framework for monocular VO that exploits deep networks on
RGB-D) and can be used for dense reconstruction. After three levels - deep depth, pose and uncertainty estimation,
that, a real-time dense vSLAM framework was proposed in which not only improve the robust initialization and tracking
[169]. They used the LSD-SLAM [17] as the baseline and of DSO in the challenging scenarios with photometric changes
fused the depth estimation and semantic information. Unlike but also assist in recovering the scale information of monocular
the work by Laina et al. [176], where the depth estimation VO. Different from the above frameworks, Wagstaff et al.
was directly used in vSLAM, Tateno et al. [169] considered [182] proposed to use a deep neural network to correct the pose
the predicted depth map as the initial guess of LSD-SLAM, estimated by traditional VO frameworks, and a self-supervised
and further refined the predicted depth value by the local or deep pose correction network is designed to estimate a pose
global optimization algorithms in vSLAM. This method not correction rather than the full inter-frame pose. Teed et al.
only got a higher pose accuracy than LSD-SLAM, but also [183] proposed a new deep learning-based SLAM system with
overcame the issue of scale inconsistency in dense monocular strong performance and generalization, called DROID-SLAM,
reconstruction. Similarly, Yang et al. [171] proposed a novel and a GRU based update operator is proposed for depth and
semi-supervised disparity estimation network and incorporated pose update.
it into direct sparse odometry (DSO) [88], thereby achieving a 3) Learning-based image enhancement and vSLAM: Cur-
better accuracy to monocular DSO and attaining a comparable rent monocular vSLAM methods have achieved good robust-
performance to previous stereo DSO methods. Recently, Loo ness under specific scenarios, such as outdoor sunny scenes
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8
were deleted. Similarly, Cui et al. [193] combined the results for vSLAM executes some basic tasks, such as localization and
of semantic segmentation from SegNet with ORB-SLAM2. path planning, it is still insufficient for some advanced tasks,
They proposed a new method, called Semantic Optical Flow such as human-robot interaction, 3D object detection, and
(SOF), to improve the detection of dynamic features and rea- tracking. Therefore, high-level and expressive representations
sonably remove the dynamic features during tracking. Unlike will play a key role in the perception of autonomous systems.
[173], [193], [195], [200], [202] that directly detect and delete To obtain high-level perception, an object-level environment
the dynamic features, recent studies tried to further estimate representation [209] was proposed in 2011 by modeling the
and utilize the dynamic objects in the scenes [203], [204]. objects in advance and matching them in a global point cloud
Huang et al. [203] proposed a stereo VO framework that map. Salas et al. [210] extended this work in [209]. They
not only estimated the motion of camera but also clustered created an object database to store the 3D models generated
the surrounding objects. A sliding window optimization was by Kinectfusion [44] and computed the global descriptor of
used to solve the motions of camera and surrounding dynamic every object model for quick matching based on [211]. They
objects. Yang et al. [204] dug deeper into the relationship also demonstrated that object-level mapping is useful for
between the motion of camera and surrounding objects, and accurate relocalization and loop detection. Contrary to building
found that the two parts can improve each other. Since both the models in advance, Sunderhauf et al. [212] proposed an
dynamic and static objects can provide long-range geometric online modeling method for generating the point cloud models
and scale constraint, it is helpful to improve the camera pose of objects, along with a novel framework for vSLAM by
estimation and constrain the monocular drift. combining object detection with data association to obtain
Scale recovery and visual localization: Scale ambiguity semantic maps. However, traditional geometry-based high-
has always been a big challenge for monocular vSLAM, level environment perception requires modeling and matching
which brings great uncertainty to accurate trajectory prediction objects in the environment in advance, which leads to the
and mapping [87]. Because objects in reality have their own complexity of the whole process, i.e. only some objects can
inherent properties, like the height of cars, these properties be modeled and recognized in these methods.
can be used for monocular vSLAM to get the absolute scale In comparison to an object-level maps, pixel-level semantic
information of scenes. Therefore, semantic information can be maps-based on learning systems are more precise because
utilized to build a bridge between objects and their properties, they present the semantic information of each point in the
and it has shown its effectiveness in monocular vSLAM for maps . To improve the accuracy of segmentation and semantic
scale recovery and assisting localization. Semantic information mapping, conditional random fields (CRFs) have been widely
introduces the size information of objects in the environment used in related works. A voxel-CRF model was presented in
into the vSLAM framework to handle the problem of monoc- [213] to associate the semantic information with 3D geometric
ular scale ambiguity. Frost et al. [205] represented objects structure, and a dense voxel-based map with semantic labels
in the environment as spheres and recovered the scale from was constructed. For consistent 3D semantic reconstruction,
the detected objects with a known radius. Similarly, in [170], Hermans et al. [214] proposed a novel 2D-3D label transfer
Sucar et al. recovered the scale by setting the prior height of method based on CRFs and Bayesian updates. Considering
the object (car). A detection method was used to detect this the intrinsic relationship between geometry and semantics,
object and compute the height, and the scale was solved by Kundu et al. [215] utilized the constraints and jointly opti-
the ratio of the calculated height to the prior height. For lo- mized semantic segmentation with 3D reconstruction based
calization, Stenborg et al. [206] proposed a novel method that on CRFs. Gan et al. [32] focused on the continuity of maps
locates the camera based on semantically segmented images, and valid queries at different resolutions, and exploited the
which is different from traditional localization methods based sparse Bayesian inference for accurate multi-class classifica-
on features. To obtain more accurate localization, Bowman et tion and dense probabilistic semantic mapping. With the help
al. [207] first integrated the geometric, semantic, and IMU of semantic maps, autonomous systems can obtain a high-level
information into a single optimization framework and then as- understanding of their surroundings, and they can easily know
sociated scale information with semantic information. Lianos “which and where is the desk”.
et al. [208] utilized the semantic information of the scenes to With the development in deep neural networks, several
establish mid-term constraints in the tracking process, thereby detection and segmentation methods based on deep learning
reducing the monocular drift in VO. are proposed. Methods for object detection and image seg-
High-level semantic perception: Autonomous systems mentation have been reviewed in [4] and [29]. Leveraging
need to be able to perform high-level tasks, while the point deep learning-based image segmentation to perform semantic
cloud maps built by traditional vSLAM cannot fully meet mapping is also a hot topic. In [216], Li et al. combined
the requirements of these tasks. Therefore, a multi-level un- the LSD-SLAM [17] with CNN-based image segmentation to
derstanding of their surroundings is essential. For instance, reconstruct a semi-dense semantic map. Cheng et al. [174]
autonomous vehicles should have an understanding of the integrated a CRF-RNN-based segmentation algorithm with
areas that are drivable and those that have obstacles. However, ORB-SLAM [52], and built a dense semantic point-cloud
the environments modeled by traditional vSLAM methods are map by using RGB-D data. Deep learning-based semantic
represented by point clouds, which only contain the location segmentation with dense SLAM frameworks have also been
of the point and cannot provide any high-level information applied to construct dense semantic maps. Mccormac et al.
about 3D objects. Although the current metric representation [175] incorporated CNN-based semantic prediction into state-
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10
of-the-art dense vSLAM method, ElasticFusion [55]. They of the types of entities in the environment. In map-building-
considered the multi-view segmentation result of the same 3D based navigation, robots use different sensors to perceive the
point and fuse semantic information in a probabilistic manner. environment and update the map. For example, in [225], the
When/where can we integrate learning methods to aid tra- robot accomplished long-distance navigation with the help
ditional frameworks, like vSLAM? There are two main ways of a topological map. Specifically, the global environment
of using deep learning to improve traditional frameworks: was built as a topological map and described by graphics
one is to enhance the quality of inputs through the learning during navigation. An appearance-based system and a visual
systems, like image enhancement and vSLAM; the other is to servoing strategy qualitatively estimated the position of the
embed the learning systems into traditional frameworks, like robot and kept it on a specific trajectory employing omnidi-
pose estimation, depth estimation and vSLAM. For example, rectional cameras. In mapless navigation, robots do not have
considering that traditional vSLAM cannot well adapt to any environment information and navigate with the perceived
challenging low-light environments, learning methods are used information without maps. Saeedi et al. [226] presented a
to enhance the stability of feature tracking of vSLAM by general-purpose 3-D trajectory-tracking system. This system
improving the quality of input images [30]. Since dynamic could be applied to unknown indoor and outdoor environments
objects will affect the feature matching, which in turn affects without the need of mapping the scene, odometry or the
the pose and depth solution of vSLAM, learning systems sensors other than vision sensors.
are used to detect dynamic objects and help to eliminate the Reinforcement learning based visual navigation: Since
dynamic features [195]. Therefore, the basic idea is to analyze reinforcement learning is suitable for continuous motion plan-
the limitations and shortcomings of traditional vSLAM, and ning tasks in complex environments, reinforcement learning
then introduce learning systems to improve the traditional based navigation has been preliminarily studied recently. Com-
vSLAM framework. In addition, we should also note that the pared to traditional control methods, when using reinforcement
introduction of the learning systems also brings some problems learning algorithms to address navigation problems, sufficient
to the entire framework, such as the increase of computation, theoretical knowledge is not required, and the proposed model
the dataset dependence of the learning systems, etc, and there tends to solve the problem end-to-end. By defining better
also remain problems we need to address in the future. state space representations in complex and infinite environ-
ments, reinforcement learning algorithms can be simplified
and navigation efficiency will be improved. Jaradat et al.
III. AUTONOMOUS VISUAL NAVIGATION
[227] used Q-learning to handle the problem of mobile robot
After perceiving the surroundings and state, autonomous navigating in an unknown dynamic environment. Owing to
robots will plan appropriate trajectories according to the mis- the infinite number of states in a dynamic environment, the
sions, their own state as well as the environmental information. authors limited the number of states based on a new definition
A survey of geometry-based motion and control planning for of the state space to ensure that the navigation speed was
autonomous vehicles is proposed in [9]. Therefore, in this improved. Similarly, Shi et al. [228] utilized Q-learning to
section, we mainly focus on autonomous visual navigation predict partial missing QR codes in order to ensure image-
based on reinforcement learning, as shown in Fig. 4. We based visual servoing. Since the QR code has a large number
first present visual navigation methods, and introduce three of feature points, the authors proposed to take its rotation and
main deep reinforcement learning methods. Then, we review translation between the current image and the desired image
deep reinforcement learning-based visual navigation scenarios, as the state space to simplify the computational complexity of
methods and environments. reinforcement learning.
Navigation can be defined as a process of accurately de- During the training period, adding auxiliary tasks, such as
termining one’s location, planning, and following a route value function [229], reward prediction [230], map recon-
from one place to another. With the help of the advanced struction [231], and edge segmentation [232], can improve
sensors and navigation algorithms, vision has been introduced the reinforcement learning efficiency. Jaderberg et al. [230]
into navigation [217], [218]. Compared with other naviga- proposed a novel unsupervised reinforcement and auxiliary
tion methods, such as magnetic navigation [219], inertial learning algorithm. The algorithm predicted and controlled
navigation [220], laser navigation [221] and GPS navigation the features of the sensorimotor stream by treating them as
[222], visual navigation has a relatively low cost and general pseudo-rewards for reinforcement learning. Moreover, during
simulation platforms. Therefore, visual navigation has become the training process, the agent was allowed to perform ad-
a mainstream research approach for researchers. Traditional ditional tasks, such as pixel control, reward prediction and
visual navigation of mobile robots is generally based on value function replay. In [231], the agent only used the visual
three main methods: map-based navigation, map-building- information (images of the monocular camera) for navigation
based navigation, and mapless navigation [223]. search (finding the apple in the maze). The study considered
Map-based navigation requires the global map of the current two auxiliary tasks. In the first task, a low-dimensional depth
environment to make decisions for navigation. For exam- map was reconstructed at each time step, which is beneficial
ple, in [224], the robot used a generic map to accomplish for obstacle avoidance and short-term path planning. The other
symbolic navigation. Specifically, the robot was not guided task involved loop closure detection, wherein the agent learned
to the locations with specific coordinates but with symbolic to detect whether the current location had been visited within
commands. Symbolic commands are the general description the currently running trajectory. The experiments in these
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11
Multi-modal Multi-task
Map-based Value-based
Domain shift
Map-building-based Policy-based
Memory-inference
Mapless Actor-critic
High-fidelity simulation
studies proved that co-training can significantly improve the can accomplish more complex tasks, for example, navigating
learning speed and performance of the model. to different targets in a scene without retraining [240].
Recently, multi-modal reinforcement learning has become Deep reinforcement learning algorithms can be divided
a hot point and cutting edge, which combines multi-modal into two types: value-based and policy-based. Value-based
information, such as language and video, with vision as inputs algorithms learn the value function or the approximation of the
in the reinforcement learning model. To deal with navigation value function, and then select a policy based on the value.
issues, visual language navigation (VLN) [233] has been Deep Q-Network (DQN) is the first value-based algorithm.
widely used in recent years. VLN is a task that guides the Tai et al. [241] first built an exploring policy for robotics
embedded agent to execute natural language instructions in based on DQN, in order to explore a corridor environment
a 3D environment. It requires a deep understanding of the with the depth information from an RGB-D sensor only. There
linguistic semantics, visual perception, and most importantly, are many extensions of DQN in order to improve stability and
the alignment of the two. Most existing methods are based efficiency during training. Dueling DQN [242] can directly
on sequence-to-sequence architecture [234]–[236]. That is, learn the value of state through the advantage function, which
instructions are encoded as word sequences, and navigation makes it learn faster than DQN when some of the actions do
trajectories are decoded as a series of actions, which are not affect the environment. On the other hand, double DQN
enhanced by attention mechanism and beam search. Therefore, [243] can train two Q networks at the same time and choose
connecting cross-modality training data is a key to improve a smaller Q value to reduce the overestimation error, which
training efficiency. Wang et al. [237] summarized VLN tasks equips double DQN with stable performance. In that way,
and studied how to solve the three key challenges of VLN, combining dueling DQN with double DQN is a good choice.
namely cross-modal grounding, ill-posed feedback, and the Zeng et al. [244] utilized dueling double DQN with multi-step
generalization problems. Chaplot et al. [238] proposed a learning to handle coverage-aware UAV navigation problem.
dual-attention unit to disentangle the knowledge of words in Specifically, the signal measured on the UAV was used to
the textual representations and visual concepts in the visual directly train the action-value function of the navigation policy,
representations, and align them with each other. The fixed thus greatly maintaining the relative stability of the target and
alignment enables the learned knowledge transferred across improving the learning efficiency. The original DQN can only
tasks. In response to the first and second challenges, the be applied in tasks with a discrete action space. In order to
authors proposed the reinforced cross-modal matching (RCM) extend to continuous control, many policy-based algorithms
method, which used reinforcement learning to connect local have been developed. Policy-based algorithms learn directly
and global scenarios. In response to the third challenge, based on the policy without the reward. For example, deep
self-supervised imitation learning (SIL) was proposed, which deterministic policy gradients (DDPG) [245] and normalized
helped the agent to get better policies by imitating its best advantage function (NAF) [246] are policy-based algorithms
performance from the past. that have been widely used. In comparison to NAF, DDPG
However, reinforcement learning-based navigation is limited needs less training parameters. Liu et al. [247] navigated a
to small action space and sample space, and it is generally in a group of agents to provide long-term communications cover-
discrete situation. Moreover, more complex tasks closer to the age, which only used one agent to output control decisions
actual situation tend to have a large state space and continuous for all agents by employing DDPG. However, the DDPG
action space. algorithm requires researchers to spend a lot of time iterating
Deep reinforcement learning based visual navigation: and manually adjusting rewards in practice. To address this
has achieved promising results recently by combining the per- problem, one way is to use some expansion of DDPG to
ceptual ability of deep learning with the continuous decision improve sampling efficiency [248], [249]. Tai et al. [248]
ability of reinforcement learning. Compared to reinforcement presented a model using asynchronous multithreading DDPG
learning based navigation, deep reinforcement learning meth- to collect data, which helped to improve sampling efficiency.
ods equip robots with the ability to learn high-dimensional data The mapless motion planner could be trained end-to-end
[239] to ensure precise perception and positioning, so that they without any features designed by human or prior demonstra-
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12
tions. Similarly, Zhang et al. [249] proposed asynchronous unmanned vehicle navigation without relying on maps, GPS,
episodic DDPG, which improved learning efficiency with and other auxiliary tools. The authors put unmanned vehicles
less training time in computationally complex environments. in complex scenes of city scale and collected real-world data
Episodic control and a novel type of noise were introduced for training. To accomplish the tasks, a multi-city navigation
to the asynchronous framework in order to improve sample network with LSTM was proposed. The method processed
efficiency while increasing data throughput. Another solution images, extracted features, remembered and understood the
is to introduce AutoRL, an evolutionary automation layer environment, and finally generated the navigation policies.
around reinforcement learning, which helps to optimize the By using deep reinforcement learning methods, agents can
reward and the neural network hyperparameter while learning automatically learn the characteristics of the data collected by
navigation policies. Chiang et al. [250] introduced AutoRL the sensors without human intervention. On this basis, agents
to simultaneously train a group of agents using DDPG for are able to formulate a navigation policy to ensure navigation
several generations. Each agent had a slightly different reward in more complex environments, especially in real world. In the
function and hyperparameter to optimize the real goal-reaching field of navigation that is biased toward obstacle avoidance,
the destination. the methods used in [258], [259] obtained satisfactory gen-
Actor-critic (AC) algorithm [251] combines two types of eralization performance. Therefore, the models trained solely
deep reinforcement learning algorithms mentioned above. That in virtual environments are possible to be transferred to real
is, the actor network chooses the proper action in a con- robots. Chen et al. [258] presented a novel approach to train
tinuous action space, while the critic network implements action policies to acquire navigation skills for wheel-legged
single-step-update, which improves the learning efficiency. robots using deep reinforcement learning. It is crucial that
In other words, it learns both the value function and the domain randomization was introduced to increase the diversity
policy function. Asynchronous Advantage Actor-Critic (A3C) in training samples, improve the generalization ability, thereby
network [252], an improvement of the AC network with mul- focusing on the task-related aspects of observation. Therefore,
tithreading method, as an on-policy learning algorithm, uses it has been used in real environments with more complicated
newly collected samples for each gradient step. A3C network types of obstacles and movements. Xie et al. [259] proposed a
performs interactive learning with the environment in multiple new network structure, consisting of two parts, to deal with the
threads simultaneously, thereby avoiding over-fitting of the obstacle avoidance problems. First, the convolutional residual
training data. When robots autonomously exploring unknown network was used to extract the depth information. Then the
cluttered environments, A3C equips the robots with the ability reinforcement learning structure could efficiently learn how
to gain cross-target generalization. In order to gain cross-target to avoid obstacles in a simulator even with very noisy depth
generalization ability, Zhu et al. [240] took both the target information predicted from the RGB images.
and scene images as inputs of the deep reinforcement learning To improve the performance of deep reinforcement learning
network; then, the agent followed the output action to navigate networks, training data should be essentially considered in
to a target. During the training process, a new observation was experiments. Sufficient and variable training data are the basis
valued through the A3C network to ensure that the agent did of convincing results during the training process, while in the
not need to retrain the new target. Moreover, Duron et al. real-world, training data are always unobtainable or missing.
[253] added semantic network to the visual network proposed To handle this problem, simulation frameworks can be utilized
in [240] to learn context from the objects present in the scene. to train agents. For example, in [240], the first simulation
A3C network takes the features from the joint embedding layer framework, called AI2-THOR (The House Of inteRactions),
as inputs and then outputted the next action and the Q-value for was developed to provide an environment with high-quality 3D
the current state. Besides, off-policy learning algorithm, such scenes as well as physics engines. Therefore, the robot in a
as soft actor-critic (SAC) [254], aims to reuse past experience, simulation environment can effectively collect several training
which provides for both sample-efficient learning and stability. samples, which improves the data utilization. Specifically,
Jesus et al. [255] applied SAC to learn continuous action space Wu et al. [260] analyzed the cross-target and cross-scene
policies and maximize the entropy of the policy in the mobile generalization ability of the target-driven navigation models
robotics exploration problem. on AI2-THOR. The evaluation, which was conducted in 120
With the development in deep reinforcement learning al- synthetic scenes from four categories, including kitchen, living
gorithms, the problem of vanishing gradient arises. That is, room, bedroom and bathroom, greatly exceeded some relative
as the number of hidden layers in neural networks increases, baselines.
the classification accuracy in the training process decreases. After training in simulation, it is difficult to ensure that the
The LSTM architectures [256] is a good way to tackle this agent achieves similar performances between the virtual scene
problem. When the input data is time-varying, LSTM can and the real scene because of the domain shift. One possible
capture the long-term dependencies of sequential data. Mnih solution is to add the vSLAM map in the navigation process,
et al. [252] used LSTM units to make better decisions by which helps to narrow the difference in performance between
considering the previous state characteristics. In real-word simulation and real environment. On the basis of [250], Francis
navigation, training data are more variable and unpredictable et al. [261] introduced the vSLAM map to robot navigation,
than those in simulation experiments. Therefore, LSTM plays in order to reconstruct the motion probability map. Since the
a vital role in generating good navigation policies. Mirowski vSLAM map is noisy, it can compensate for the difference
et al. [257] only used the visual information as input for in performance between the robots in the virtual and the
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 13
real environment due to the different levels of noise. From • Geometry assist in perception: Utilizing the geometric
another perspective, constructing an exploration framework prior built by a learning framework or knowledge graph
bridges the gap between simulation and the real environment. in the perception of autonomous systems is helpful and
Li et al. [232] constructed a framework consisting of mapping, a promising direction with broad development prospects.
decision and planning. Each module was independent and can For example, semantic labels predicted by deep learning
be achieved by a variety of methods. Compared to traditional are used to correlate with the knowledge graph of objects
end-to-end deep reinforcement learning methods with raw to obtain prior geometric information, such as the size of
sensor data as input and control policy as output, the proposed objects; therefore, the detailed scale, structure, and 3D
deep reinforcement learning algorithm based on framework information can be obtained.
learned faster and equipped itself with better generalization • Representation of the environment based on deep
performance in different maps. learning: Representing the environment based on deep
learning is another challenge and a promising direction.
IV. D ISCUSSION Although previous works such as [169], [171], [177]
leveraged the deep learning into mapping, the maps
A. Deep learning-based visual perception
of these methods are still built traditionally. With the
The constructed map is an intuitive representation of the developments of Nerf algorithms [262], [263], it provides
scene perception and the basis for intelligent robots to au- a way to present the scene by using neural networks. Most
tonomously perform advanced tasks. Mapping has undergone recent work has tried to construct the SLAM systems
a development process from 2D to 3D, from sparse to dense, based on Nerf [264], and this is quite an interesting and
and from topological to semantic, among others. Furthermore, promising direction.
although several methods have been proposed to improve the • Multi-sensor data fusion based on deep learning:
localization accuracy, there are still many challenges remain- Fusing information from multi-sensors (IMU, LiDAR,
ing to be solved. Therefore, we summarize the challenges and event-based camera, or infrared camera ) or multi-agent
promising directions of perception as follows. is an effective way to deal with poor quality input images
• Accurate perception: Although learning-based percep- comprising motion blur and recover scale information.
tion algorithms have made great process in the perception However, expressing the additional sensor information
areas, their accuracy, especially the accuracy of unsuper- explicitly in the constraints for training is a significant
vised learning methods, still has much room for improve- challenge. For example, the current methods leverage
ment. Digging the more effective constraints for training IMU data with images for pose estimation in a supervised
from the aspect of geometry, cross-task relationships and manner [139], [140], and the information from IMU is not
interpretability, utilizing novel learning frameworks, like represented in the loss function. Thus, whether the IMU
meta-learning, curriculum learning and lifelong learning data plays an important role in pose estimation and what
to make full use of the data, and developing more efficient role it plays is unknown and not yet explainable.
neural network frameworks for feature extraction and • Integration of deep learning and traditional frame-
inference are both promising directions. works: Although a lot of relevant research has been
• Robust perception: Robustness is one of the most im- summarized above, there is still a lot of work to be done
portant indicators for the reality application of perception in this direction. With the help of deep learning, the basic
algorithms. Although current learning systems have re- idea is to improve the traditional frameworks by analyzing
ceived good accuracy on the datasets, the network will the limitations and shortcomings of traditional methods.
be affected by the sensor noise, lighting and scenarios For example, considering that the current direct methods
when being used to real environments. Therefore, the rely heavily on the photometric consistency assumption,
robust environmental perception, ego-motion perception we can use deep learning to perform a photometric cor-
and navigation based on learning systems under different rection or transform images into photometric-consistent
conditions (like different seasons, different weather, dif- feature maps.
ferent lighting conditions, different source sensors, indoor
and outdoor as well as day and night) in the same scene
B. Reinforcement learning based visual navigation
are problems to be handled.
• Real-time perception: Real-time perception is impor- There is still a long way to go before reinforcement learning
tant for autonomous systems in practical applications. can be applied to autonomous systems.Therefore, there are
Current high accuracy networks are based on complex many challenges to be addressed.
network structures, which includes a huge number of • Sparse rewards: Rewards have a great impact on
parameters and large Flops. Therefore, the training and the learning results during the training process, but the
application of deep neural networks have a higher demand problem of sparse rewards in reinforcement learning has
on the computing power of the systems, which limits the not been well solved. When the training tasks are com-
practical applications. Using novel lite-weight learning plicated, the probability of exploring the target (getting
architectures, such as light weight network and knowl- positive rewards) by random exploring becomes very
edge distillation, to improve the real-time performance low. Therefore, it is difficult for reinforcement learning
of perception networks will be another trend. algorithms to converge by only relying on the positive
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 14
[3] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforce- [26] J. Wang, Y. Hong, J. Wang, J. Xu, Y. Tang, Q.-L. Han, and K. Jürgen,
ment learning for financial signal representation and trading,” IEEE “Cooperative and competitive multi-agent systems: From optimization
Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, to games,” IEEE/CAA Journal of Automatica Sinica, 2022.
pp. 653–664, 2016. [27] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis,
[4] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with “Optimal and autonomous control using reinforcement learning: A
deep learning: A review,” IEEE Transactions on Neural Networks and survey,” IEEE Transactions on Neural Networks and Learning Systems,
Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. vol. 29, no. 6, pp. 2042–2062, 2017.
[5] F. Qian, W. Zhong, and W. Du, “Fundamental theories and key tech- [28] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised
nologies for smart and optimal manufacturing in the process industry,” learning of depth and ego-motion from video,” in Proceedings of the
Engineering, vol. 3, no. 2, pp. 154–160, 2017. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[6] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, pp. 1851–1858.
“From perception to decision: A data-driven approach to end-to-end [29] S. Ghosh, N. Das, I. Das, and U. Maulik, “Understanding deep learning
motion planning for autonomous ground robots,” in Proc. 2017 IEEE techniques for image segmentation,” ACM Computing Surveys (CSUR),
International Conference on Robotics and Automation (ICRA), 2017, vol. 52, no. 4, p. 73, 2019.
pp. 1527–1533. [30] E. Jung, N. Yang, and D. Cremers, “Multi-frame GAN: Image enhance-
[7] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, ment for stereo visual odometry in low light,” Proceedings of Machine
A. Forechi, L. Jesus, R. Berriel, T. Paixão, F. Mutz et al., “Self-driving Learning Research (PMLR), vol. 100, pp. 651–660, 2020.
cars: A survey,” arXiv preprint arXiv:1901.04407, 2019. [31] M. G. Jadidi, L. Gan, S. A. Parkison, J. Li, and R. M. Eustice,
[8] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey “Gaussian processes semantic map representation,” arXiv preprint
of deep learning techniques for autonomous driving,” Journal of Field arXiv:1707.01532, 2017.
Robotics, vol. 37, no. 3, pp. 362–386, 2020. [32] L. Gan, M. G. Jadidi, S. A. Parkison, and R. M. Eustice, “Sparse
[9] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of bayesian inference for dense semantic mapping,” arXiv preprint
motion planning and control techniques for self-driving urban vehicles,” arXiv:1709.07973, 2017.
IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, [33] J. Bruce, N. Sünderhauf, P. Mirowski, R. Hadsell, and M. Milford,
2016. “One-shot reinforcement learning for robot navigation with interactive
[10] G. Loianno and V. Kumar, “Cooperative transportation using small replay,” arXiv preprint arXiv:1711.10137, 2017.
quadrotors using monocular vision and inertial sensing,” IEEE Robotics [34] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Matching,
and Automation Letters, vol. 3, no. 2, pp. 680–687, 2017. robustness, optimization, and applications,” IEEE Robotics & Automa-
[11] F. Ingrand and M. Ghallab, “Deliberation for autonomous robots: A tion Magazine, vol. 19, no. 2, pp. 78–90, 2012.
survey,” Artificial Intelligence, vol. 247, pp. 10–44, 2017. [35] A. J. Davison, “Real-time simultaneous localisation and mapping with
[12] A. Elfes, “Using occupancy grids for mobile robot perception and a single camera.” in Proc. International Conference on Computer Vision
navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989. (ICCV), vol. 3, 2003, pp. 1403–1410.
[13] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, [36] A. J. Davison, Y. G. Cid, and N. Kita, “Real-time 3D SLAM with wide-
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous angle vision,” IFAC Proceedings Volumes, vol. 37, no. 8, pp. 868–873,
localization and mapping: Toward the robust-perception age,” IEEE 2004.
Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016. [37] W. Jeong and K. M. Lee, “CV-SLAM: a new ceiling vision-based
[14] E. Frazzoli, M. A. Dahleh, and E. Feron, “Maneuver-based motion SLAM technique,” in Proc. 2005 IEEE/RSJ International Conference
planning for nonlinear systems with symmetries,” IEEE Transactions on Intelligent Robots and Systems, 2005, pp. 3195–3200.
on Robotics, vol. 21, no. 6, pp. 1077–1091, 2005. [38] P. Smith, I. D. Reid, and A. J. Davison, “Real-time monocular
[15] M. Sualeh and G.-W. Kim, “Simultaneous Localization and Mapping in SLAM with straight lines,” Proceedings of the British Machine Vision
the Epoch of Semantics: A Survey,” International Journal of Control, Conference, 2006.
Automation and Systems, vol. 17, no. 3, pp. 729–742, 2019. [39] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:
[16] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source SLAM Real-time single camera SLAM,” IEEE Transactions on Pattern Anal-
system for monocular, stereo, and RGB-D cameras,” IEEE Transactions ysis and Machine Intelligence, no. 6, pp. 1052–1067, 2007.
on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017. [40] G. Klein and D. Murray, “Parallel tracking and mapping for small
[17] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct AR workspaces,” in Proceedings of the 2007 6th IEEE and ACM
monocular SLAM,” in Proc. European Conference on Computer Vision. International Symposium on Mixed and Augmented Reality, 2007, pp.
Springer, 2014, pp. 834–849. 1–10.
[18] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: [41] G. Silveira, E. Malis, and P. Rives, “An efficient direct approach to
Dense tracking and mapping in real-time,” in Proc. 2011 International visual SLAM,” IEEE Transactions on Robotics, vol. 24, no. 5, pp.
Conference on Computer Vision, 2011, pp. 2320–2327. 969–979, 2008.
[19] Y. Xiang and D. Fox, “DA-RNN: Semantic mapping with data as- [42] D. Migliore, R. Rigamonti, D. Marzorati, M. Matteucci, and D. G.
sociated recurrent neural networks,” arXiv preprint arXiv:1703.03098, Sorrenti, “Use a single camera for simultaneous localization and
2017. mapping with mobile object tracking in dynamic environments,” in
[20] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.- Proc. ICRA Workshop on Safe Navigation in Open and Dynamic
L. Shyu, S.-C. Chen, and S. Iyengar, “A survey on deep learning: Environments: Application to Autonomous Vehicles, 2009, pp. 12–17.
Algorithms, techniques, and applications,” ACM Computing Surveys [43] R. A. Newcombe and A. J. Davison, “Live dense reconstruction with
(CSUR), vol. 51, no. 5, p. 92, 2019. a single moving camera,” in Proc. 2010 IEEE Computer Society
[21] B. Frénay and M. Verleysen, “Classification in the presence of label Conference on Computer Vision and Pattern Recognition, 2010, pp.
noise: a survey,” IEEE Transactions on Neural Networks and Learning 1498–1505.
Systems, vol. 25, no. 5, pp. 845–869, 2013. [44] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.
[22] P. Druzhkov and V. Kustikova, “A survey of deep learning methods and Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon,
software tools for image classification and object detection,” Pattern “Kinectfusion: Real-time dense surface mapping and tracking.” in Proc.
Recognition and Image Analysis, vol. 26, no. 1, pp. 9–15, 2016. 10th IEEE International Symposium on Mixed and Augmented Reality
[23] A. R. Sharma and P. Kaushik, “Literature survey of statistical, deep (ISMAR), vol. 11, no. 2011, 2011, pp. 127–136.
and reinforcement learning in natural language processing,” in Proc. [45] M. Kaess, M. Fallon, H. Johannsson, and J. Leonard, “Kintinuous:
2017 International Conference on Computing, Communication and Spatially extended kinectfusion,” CSAIL Tech. Rep., 2012.
Automation (ICCCA), 2017, pp. 350–354. [46] D. Weikersdorfer, R. Hoffmann, and J. Conradt, “Simultaneous local-
[24] M. Lippi, M. A. Montemurro, M. Degli Esposti, and G. Cristadoro, ization and mapping for event-based vision systems,” in International
“Natural language statistical features of lstm-generated texts,” IEEE Conference on Computer Vision Systems. Springer, 2013, pp. 133–142.
Transactions on Neural Networks and Learning Systems, vol. 30, [47] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, “3-D
no. 11, pp. 3326–3337, 2019. mapping with an RGB-D camera,” IEEE Transactions on Robotics,
[25] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional vol. 30, no. 1, pp. 177–187, 2013.
network for real-time 6-dof camera relocalization,” in Proceedings of [48] M. Li and A. I. Mourikis, “High-precision, consistent EKF-based
the IEEE International Conference on Computer Vision, 2015, pp. visual-inertial odometry,” The International Journal of Robotics Re-
2938–2946. search, vol. 32, no. 6, pp. 690–711, 2013.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 16
[49] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct [71] H. Huang, H. Ye, Y. Sun, and M. Liu, “Monocular visual odometry
monocular visual odometry,” in Proc. 2014 IEEE international confer- using learned repeatability and description,” in 2020 IEEE International
ence on robotics and automation (ICRA), 2014, pp. 15–22. Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
[50] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Event- 8913–8919.
based 3D SLAM with a depth-augmented dynamic vision sensor,” in [72] M. Ferrera, A. Eudes, J. Moras, M. Sanfourche, and G. Le Besnerais,
Proc. 2014 IEEE International Conference on Robotics and Automation “Ov2slam: A fully online and versatile visual slam for real-time
(ICRA), 2014, pp. 359–364. applications,” IEEE Robotics and Automation Letters, vol. 6, no. 2,
[51] J. Engel, J. Stückler, and D. Cremers, “Large-scale direct SLAM with pp. 1399–1406, 2021.
stereo cameras,” in Proc. 2015 IEEE/RSJ International Conference on [73] C. Campos, R. Elvira, J. J. G. Rodrı́guez, J. M. Montiel, and J. D.
Intelligent Robots and Systems (IROS), 2015, pp. 1935–1942. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–
[52] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a inertial, and multimap slam,” IEEE Transactions on Robotics, 2021.
versatile and accurate monocular SLAM system,” IEEE Transactions [74] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit et al., “FastSLAM:
on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. A factored solution to the simultaneous localization and mapping
[53] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, problem,” in Proceedings of the AAAI 18th National Conference on
“Keyframe-based visual–inertial odometry using nonlinear optimiza- Artificial Intelligence, pp. 593–598, 2002.
tion,” The International Journal of Robotics Research, vol. 34, no. 3, [75] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, “FastSLAM 2.0:
pp. 314–334, 2015. An improved particle filtering algorithm for simultaneous localization
[54] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual and mapping that provably converges,” in Proceedings of the 18th
inertial odometry using a direct EKF-based approach,” in Proc. 2015 International Joint Conference on Artificial intelligence, 2003, pp.
IEEE/RSJ International Conference on Intelligent Robots and Systems 1151–1156.
(IROS), 2015, pp. 298–304. [76] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE
[55] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and Robotics and Automation Magazine, vol. 18, no. 4, pp. 80–92, 2011.
S. Leutenegger, “ElasticFusion: Real-time dense SLAM and light [77] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald,
source estimation,” Proc. The International Journal of Robotics Re- “Robust real-time visual odometry for dense RGB-D mapping,” Proc.
search, vol. 35, no. 14, pp. 1697–1716, 2016. 2013 IEEE International Conference on Robotics and Automation, pp.
[56] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold 1–8, 2013.
preintegration for real-time visual–inertial odometry,” IEEE Transac- [78] Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza,
tions on Robotics, vol. 33, no. 1, pp. 1–21, 2016. “Semi-dense 3D reconstruction with a stereo event camera,” in Pro-
[57] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, ceedings of the European Conference on Computer Vision (ECCV),
“SVO: Semidirect visual odometry for monocular and multicamera 2018, pp. 235–251.
systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, [79] P. Geneva, J. Maley, and G. Huang, “An efficient schmidt-EKF for
2016. 3D visual-inertial SLAM,” in Proceedings of the IEEE Conference on
[58] H. Rebecq, T. Horstschäfer, G. Gallego, and D. Scaramuzza, “EVO: Computer Vision and Pattern Recognition, 2019, pp. 12 105–12 115.
A geometric approach to event-based 6-DOF parallel tracking and [80] D. G. Lowe, “Distinctive image features from scale-invariant key-
mapping in real time,” IEEE Robotics and Automation Letters, vol. 2, points,” International Journal of Computer Vision, vol. 60, no. 2, pp.
no. 2, pp. 593–600, 2016. 91–110, 2004.
[59] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [81] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust fea-
Transactions on Pattern Analysis and Machine Intelligence, vol. 40, tures,” in Proc. European Conference on Computer Vision. Springer,
no. 3, pp. 611–625, 2017. 2006, pp. 404–417.
[60] R. Wang, M. Schworer, and D. Cremers, “Stereo dso: Large-scale direct [82] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “ORB:
sparse visual odometry with stereo cameras,” in Proceedings of the An efficient alternative to SIFT or SURF.” in Proc. International
IEEE International Conference on Computer Vision, 2017, pp. 3903– Conference on Computer Vision (ICCV), vol. 11, no. 1. Citeseer,
3911. 2011, p. 2.
[61] L. Von Stumberg, V. Usenko, and D. Cremers, “Direct sparse visual- [83] X. Gao, R. Wang, N. Demmel, and D. Cremers, “Ldso: Direct sparse
inertial odometry using dynamic marginalization,” in Proc. 2018 IEEE odometry with loop closure,” in Proceedings of the 2018 IEEE/RSJ
International Conference on Robotics and Automation (ICRA), 2018, International Conference on Intelligent Robots and Systems (IROS),
pp. 2510–2517. 2018, pp. 2198–2204.
[62] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefu- [84] L. von Stumberg, P. Wenzel, Q. Khan, and D. Cremers, “Gn-net: The
sion: Real-time globally consistent 3D reconstruction using on-the-fly gauss-newton loss for multi-weather relocalization,” IEEE Robotics and
surface reintegration,” ACM Transactions on Graphics (ToG), vol. 36, Automation Letters, vol. 5, no. 2, pp. 890–897, 2020.
no. 3, p. 24, 2017. [85] G. Pascoe, W. Maddern, M. Tanner, P. Piniés, and P. Newman,
[63] R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular SLAM with “NID-SLAM: Robust monocular SLAM using normalised information
map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. distance,” in Proceedings of the IEEE Conference on Computer Vision
796–803, 2017. and Pattern Recognition, 2017, pp. 1435–1444.
[64] D. Schlegel, M. Colosi, and G. Grisetti, “ProSLAM: Graph SLAM [86] K. Yousif, A. Bab-Hadiashar, and R. Hoseinnezhad, “An overview to
from a programmer’s perspective,” in Proc. 2018 IEEE International visual odometry and visual SLAM: Applications to mobile robotics,”
Conference on Robotics and Automation (ICRA), 2018, pp. 1–9. Intelligent Industrial Systems, vol. 1, no. 4, pp. 289–311, 2015.
[65] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, [87] T. Taketomi, H. Uchiyama, and S. Ikeda, “Visual SLAM algorithms:
C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for a survey from 2010 to 2016,” IPSJ Transactions on Computer Vision
fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3, and Applications, vol. 9, no. 1, p. 16, 2017.
no. 2, pp. 965–972, 2018. [88] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
[66] H. Liu, M. Chen, G. Zhang, H. Bao, and Y. Bao, “ICE-BA: Incremental, single image using a multi-scale deep network,” in Advances in Neural
consistent and efficient bundle adjustment for visual-inertial SLAM,” in Information Processing Systems, 2014, pp. 2366–2374.
Proceedings of the IEEE Conference on Computer Vision and Pattern [89] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and
Recognition, 2018, pp. 1974–1982. surface normal estimation from monocular images using regression on
[67] T. Qin, P. Li, and S. Shen, “VINS-mono: A robust and versatile monoc- deep features and hierarchical CRFS,” in Proceedings of the IEEE
ular visual-inertial state estimator,” IEEE Transactions on Robotics, Conference on Computer Vision and Pattern Recognition, 2015, pp.
vol. 34, no. 4, pp. 1004–1020, 2018. 1119–1127.
[68] S. H. Lee and J. Civera, “Loosely-coupled semi-direct monocular [90] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single
SLAM,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. monocular images using deep convolutional neural fields,” IEEE Trans-
399–406, 2018. actions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10,
[69] T. Schops, T. Sattler, and M. Pollefeys, “BAD SLAM: Bundle Adjusted pp. 2024–2039, 2015.
Direct RGB-D SLAM,” in Proceedings of the IEEE Conference on [91] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
Computer Vision and Pattern Recognition, 2019, pp. 134–144. and T. Brox, “A large dataset to train convolutional networks for
[70] F. Schenk and F. Fraundorfer, “RESLAM: A real-time robust edge- disparity, optical flow, and scene flow estimation,” in Proceedings of
based SLAM system,” in Proc. 2019 International Conference on the IEEE Conference on Computer Vision and Pattern Recognition,
Robotics and Automation (ICRA), 2019, pp. 154–160. 2016, pp. 4040–4048.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 17
[92] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, the IEEE Conference on Computer Vision and Pattern Recognition,
A. Bachrach, and A. Bry, “End-to-end learning of geometry and context 2018, pp. 340–349.
for deep stereo regression,” in Proceedings of the IEEE International [112] R. Li, S. Wang, Z. Long, and D. Gu, “UndeepVO: Monocular visual
Conference on Computer Vision, 2017, pp. 66–75. odometry through unsupervised deep learning,” in Proceedings of the
[93] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordi- 2018 IEEE International Conference on Robotics and Automation
nal regression network for monocular depth estimation,” in Proceedings (ICRA), 2018, pp. 7286–7291.
of the IEEE Conference on Computer Vision and Pattern Recognition, [113] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu, “Unos: Uni-
2018, pp. 2002–2011. fied unsupervised optical-flow and stereo-depth estimation by watching
[94] J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, videos,” in Proceedings of the IEEE Conference on Computer Vision
and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions and Pattern Recognition, 2019, pp. 8071–8081.
for single-view depth,” in Proceedings of the IEEE Conference on [114] J. Hur and S. Roth, “Self-supervised monocular scene flow estimation,”
Computer Vision and Pattern Recognition, 2019, pp. 11 826–11 835. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[95] L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, and J. Heikkila, “Guiding Pattern Recognition, 2020, pp. 7396–7405.
monocular depth estimation using depth-attention volume,” in Proceed- [115] C. Chi, Q. Wang, T. Hao, P. Guo, and X. Yang, “Feature-level
ings of the European Conference on Computer Vision (ECCV), 2020, collaboration: Joint unsupervised learning of optical flow, stereo depth
pp. 581–597. and camera motion,” in Proceedings of the IEEE/CVF Conference on
[96] M. Ramamonjisoa, Y. Du, and V. Lepetit, “Predicting sharp and Computer Vision and Pattern Recognition (CVPR), June 2021, pp.
accurate occlusion boundaries in monocular depth estimation using 2463–2473.
displacement fields,” in Proceedings of the IEEE/CVF Conference on [116] Y. Jiao, T. D. Tran, and G. Shi, “Effiscene: Efficient per-pixel rigidity
Computer Vision and Pattern Recognition, 2020, pp. 14 648–14 657. inference for unsupervised joint learning of optical flow, depth, camera
[97] T. Chen, S. An, Y. Zhang, C. Ma, H. Wang, X. Guo, and W. Zheng, pose and motion segmentation,” in Proceedings of the IEEE/CVF
“Improving monocular depth estimation by leveraging structural aware- Conference on Computer Vision and Pattern Recognition (CVPR), June
ness and complementary datasets,” in Proceedings of the European 2021, pp. 5538–5547.
Conference on Computer Vision (ECCV), 2020, pp. 90–108. [117] H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware represen-
[98] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, tation enhancement for self-supervised monocular depth estimation,” in
“Towards robust monocular depth estimation: Mixing datasets for zero- Proceedings of the IEEE/CVF International Conference on Computer
shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis & Vision, 2021, pp. 12 642–12 652.
Machine Intelligence, 2020. [118] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and
[99] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn K. Fragkiadaki, “SfM-Net: Learning of structure and motion from
for single view depth estimation: Geometry to the rescue,” in Proc. video,” arXiv preprint arXiv:1704.07804, 2017.
European Conference on Computer Vision. Springer, 2016, pp. 740– [119] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised
756. learning of geometry with edge-aware depth-normal consistency,” arXiv
[100] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu- preprint arXiv:1711.03665, 2017.
lar depth estimation with left-right consistency,” in Proceedings of the [120] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning
IEEE Conference on Computer Vision and Pattern Recognition, 2017, of depth and ego-motion from monocular video using 3D geometric
pp. 270–279. constraints,” in Proceedings of the IEEE Conference on Computer
[101] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep Vision and Pattern Recognition, 2018, pp. 5667–5675.
learning for monocular depth map prediction,” in Proceedings of the [121] Y. Zou, Z. Luo, and J.-B. Huang, “DF-Net: Unsupervised joint learning
IEEE Conference on Computer Vision and Pattern Recognition, 2017, of depth and flow using cross-task consistency,” in Proceedings of the
pp. 6647–6655. European Conference on Computer Vision (ECCV), 2018, pp. 36–53.
[102] M. Poggi, F. Tosi, and S. Mattoccia, “Learning monocular depth [122] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,
estimation with unsupervised trinocular assumptions,” in Proc. 2018 optical flow and camera pose,” in Proceedings of the IEEE Conference
International Conference on 3D Vision (3DV), 2018, pp. 324–333. on Computer Vision and Pattern Recognition, 2018, pp. 1983–1992.
[103] P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano, [123] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and
“Geometry meets semantics for semi-supervised monocular depth M. J. Black, “Competitive collaboration: Joint unsupervised learning
estimation,” in Proc. Asian Conference on Computer Vision. Springer, of depth, camera motion, optical flow and motion segmentation,” in
2018, pp. 298–313. Proceedings of the IEEE Conference on Computer Vision and Pattern
[104] F. Aleotti, F. Tosi, M. Poggi, and S. Mattoccia, “Generative adversarial Recognition, 2019, pp. 12 240–12 249.
networks for unsupervised monocular depth prediction,” in Proceedings [124] G. Wang, H. Wang, Y. Liu, and W. Chen, “Unsupervised learning of
of the European Conference on Computer Vision (ECCV), 2018, pp. monocular depth and ego-motion using multiple masks,” in Proc. 2019
0–0. International Conference on Robotics and Automation (ICRA), 2019,
[105] A. Pilzer, D. Xu, M. Puscas, E. Ricci, and N. Sebe, “Unsupervised pp. 4724–4730.
adversarial depth estimation using cycled generative networks,” in Proc. [125] S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha, “Sequential adversarial
2018 International Conference on 3D Vision (3DV), 2018, pp. 587–595. learning for self-supervised deep visual odometry,” in Proceedings of
[106] A. Pilzer, S. Lathuiliere, N. Sebe, and E. Ricci, “Refine and distill: the IEEE International Conference on Computer Vision, 2019, pp.
Exploiting cycle-inconsistency and knowledge distillation for unsu- 2851–2860.
pervised monocular depth estimation,” in Proceedings of the IEEE [126] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self-supervised
Conference on Computer Vision and Pattern Recognition, 2019, pp. deep visual odometry with online adaptation,” in Proceedings of the
9768–9777. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[107] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning monocular 2020, pp. 6339–6348.
depth estimation infusing traditional stereo knowledge,” in Proceedings [127] M. B. Vankadari, S. Garg, A. Majumdar, S. Kumar, and A. Behera,
of the IEEE Conference on Computer Vision and Pattern Recognition, “Unsupervised monocular depth estimation for night-time images using
2019, pp. 9799–9809. adversarial domain feature adaptation,” in Lecture Notes in Computer
[108] P.-Y. Chen, A. H. Liu, Y.-C. Liu, and Y.-C. F. Wang, “Towards Sciences (LNCS)-European Conference on Computer Vision, 2020, pp.
scene understanding: Unsupervised monocular depth estimation with 443–459.
semantic-aware representation,” in Proceedings of the IEEE Conference [128] C. Zhao, G. G. Yen, Q. Sun, C. Zhang, and Y. Tang, “Masked GAN
on Computer Vision and Pattern Recognition, 2019, pp. 2624–2632. for unsupervised depth and pose prediction with scale consistency,”
[109] X. Fei, A. Wong, and S. Soatto, “Geo-supervised visual depth pre- IEEE Transactions on Neural Networks and Learning Systems, vol. 32,
diction,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. no. 12, pp. 5392–5403, 2020.
1661–1668, 2019. [129] J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monoc-
[110] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning ular depth via simultaneous unsupervised representation learning,” in
depth from monocular videos using direct methods,” in Proceedings Proceedings of the IEEE/CVF Conference on Computer Vision and
of the IEEE Conference on Computer Vision and Pattern Recognition, Pattern Recognition, 2020, pp. 14 402–14 413.
2018, pp. 2022–2030. [130] C. Shu, K. Yu, Z. Duan, and K. Yang, “Feature-metric loss for self-
[111] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and supervised learning of depth and egomotion,” in Proceedings of the
I. Reid, “Unsupervised learning of monocular depth estimation and European Conference on Computer Vision (ECCV), 2020, pp. 572–
visual odometry with deep feature reconstruction,” in Proceedings of 588.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 18
[131] J. Watson, O. Mac Aodha, V. Prisacariu, G. Brostow, and M. Firman, tonomous driving,” in Proceedings of the IEEE/CVF Winter Conference
“The temporal opportunist: Self-supervised multi-frame monocular on Applications of Computer Vision, 2021, pp. 61–71.
depth,” in Proceedings of the IEEE/CVF Conference on Computer [153] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints
Vision and Pattern Recognition (CVPR), June 2021, pp. 1164–1174. of virtual normal for depth prediction,” in Proceedings of the IEEE
[132] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and International Conference on Computer Vision, 2019, pp. 5684–5693.
Y. Yuan, “HR-Depth: High resolution self-supervised monocular depth [154] J. Lienen, E. Hullermeier, R. Ewerth, and N. Nommensen, “Monocular
estimation,” in Proceedings of the Thirty-Fifth AAAI Conference on depth estimation via listwise ranking using the plackett-luce model,”
Artificial Intelligence, 2021. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[133] H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular Pattern Recognition (CVPR), June 2021, pp. 14 595–14 604.
depth estimation with internal feature fusion,” in British Machine Vision [155] S. M. H. Miangoleh, S. Dille, L. Mai, S. Paris, and Y. Aksoy, “Boosting
Conference (BMVC), 2021. monocular depth estimation models to high-resolution via content-
[134] K. R. Konda and R. Memisevic, “Learning visual odometry with a adaptive multi-resolution merging,” in Proceedings of the IEEE/CVF
convolutional network.” in Proc. 10th International Conference on Conference on Computer Vision and Pattern Recognition (CVPR), June
Computer Vision Theory and Applications, 2015, pp. 486–490. 2021, pp. 9685–9694.
[135] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring [156] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging
representation learning with cnns for frame-to-frame ego-motion es- into self-supervised monocular depth estimation,” in Proceedings of the
timation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. IEEE International Conference on Computer Vision, 2019, pp. 3828–
18–25, 2015. 3838.
[136] S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: Towards [157] M. Klingner, J.-A. Termöhlen, J. Mikolajczyk, and T. Fingscheidt,
end-to-end visual odometry with deep recurrent convolutional neural “Self-supervised monocular depth estimation: Solving the dynamic
networks,” in Proc. 2017 IEEE International Conference on Robotics object problem by semantic guidance,” in Proceedings of the European
and Automation (ICRA), 2017, pp. 2043–2050. Conference on Computer Vision (ECCV), 2020, pp. 582–600.
[137] F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha, “Guided [158] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3D
feature selection for deep visual odometry,” in Proceedings of Asian packing for self-supervised monocular depth estimation,” in Proceed-
Conference on Computer Vision. Springer, 2018, pp. 293–308. ings of the IEEE/CVF Conference on Computer Vision and Pattern
[138] F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha, “Beyond track- Recognition, 2020, pp. 2485–2494.
ing: Selecting memory and refining poses for deep visual odometry,” in [159] A. Johnston and G. Carneiro, “Self-supervised monocular trained
Proceedings of the IEEE Conference on Computer Vision and Pattern depth estimation using self-attention and discrete disparity volume,”
Recognition, 2019, pp. 8575–8583. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[139] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet: Pattern Recognition, 2020, pp. 4756–4765.
Visual-inertial odometry as a sequence-to-sequence learning problem,”
[160] A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar, and
in Proc. Thirty-First AAAI Conference on Artificial Intelligence, 2017,
K. Daniilidis, “The multivehicle stereo event camera dataset: An event
pp. 3995–4001.
camera dataset for 3d perception,” IEEE Robotics and Automation
[140] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and Letters, vol. 3, no. 3, pp. 2032–2039, 2018.
N. Trigoni, “Selective sensor fusion for neural visual-inertial odom-
[161] H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Uni-
etry,” in Proceedings of the IEEE Conference on Computer Vision and
directional fusion for 360 panorama depth estimation,” IEEE Robotics
Pattern Recognition, 2019, pp. 10 542–10 551.
and Automation Letters, vol. 6, no. 2, pp. 1519–1526, 2021.
[141] F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera
relocalization with graph neural networks,” in Proceedings of the [162] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, sentation learning for human pose estimation,” in Proceedings of the
2020, pp. 11 375–11 384. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2019, pp. 5693–5703.
[142] X. Wei, Y. Zhang, Z. Li, Y. Fu, and X. Xue, “Deepsfm: Structure
from motion via deep bundle adjustment,” in European conference on [163] C. Zhao, Y. Tang, and Q. Sun, “Unsupervised monocular
computer vision. Springer, 2020, pp. 230–247. depth estimation in highly complex environments,” arXiv preprint
[143] B. Zhuang and M. Chandraker, “Fusing the old with the new: Learning arXiv:2107.13137, 2021.
relative camera pose with geometry-guided uncertainty,” in Proceed- [164] K. Wang, Z. Zhang, Z. Yan, X. Li, B. Xu, J. Li, and J. Yang, “Regu-
ings of the IEEE/CVF Conference on Computer Vision and Pattern larizing nighttime weirdness: Efficient self-supervised monocular depth
Recognition (CVPR), June 2021, pp. 32–42. estimation in the dark,” in Proceedings of the IEEE/CVF International
[144] M. Poggi, S. Kim, F. Tosi, S. Kim, F. Aleotti, D. Min, K. Sohn, and Conference on Computer Vision, 2021, pp. 16 055–16 064.
S. Mattoccia, “On the confidence of stereo matching in a deep-learning [165] Y. Zou, P. Ji, Q.-H. Tran, J.-B. Huang, and M. Chandraker, “Learning
era: a quantitative evaluation,” IEEE Transactions on Pattern Analysis monocular visual odometry via self-supervised long-term modeling,” in
and Machine Intelligence, 2021. Proceedings of the European Conference on Computer Vision (ECCV),
[145] R. Hartley and A. Zisserman, Multiple View Geometry in Computer 2020, pp. 710–727.
Vision. Cambridge University Press, 2003. [166] W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization:
[146] Y. Furukawa, C. Hernández et al., “Multi-view stereo: A tutorial,” Joint depth-pose learning without posenet,” in Proceedings of the
Foundations and Trends® in Computer Graphics and Vision, vol. 9, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
no. 1-2, pp. 1–148, 2015. 2020, pp. 9151–9161.
[147] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense [167] E. S. Jones and S. Soatto, “Visual-inertial navigation, mapping and
two-frame stereo correspondence algorithms,” International Journal of localization: A scalable real-time causal approach,” The International
Computer Vision, vol. 47, no. 1-3, pp. 7–42, 2002. Journal of Robotics Research, vol. 30, no. 4, pp. 407–430, 2011.
[148] H. Hirschmuller, “Accurate and efficient stereo processing by semi- [168] P. Li, T. Qin, B. Hu, F. Zhu, and S. Shen, “Monocular visual-
global matching and mutual information,” in 2005 IEEE Computer inertial state estimation for mobile augmented reality,” in 2017 IEEE
Society Conference on Computer Vision and Pattern Recognition International Symposium on Mixed and Augmented Reality (ISMAR),
(CVPR’05), vol. 2. IEEE, 2005, pp. 807–814. 2017, pp. 11–21.
[149] K. Karsch, C. Liu, and S. Kang, “Depth extraction from video using [169] K. Tateno, F. Tombari, I. Laina, and N. Navab, “CNN-SLAM: Real-
non-parametric sampling-supplemental material,” in Proc. European time dense monocular SLAM with learned depth prediction,” in Pro-
conference on Computer Vision, 2012, pp. 775–788. ceedings of the IEEE Conference on Computer Vision and Pattern
[150] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” Recognition, 2017, pp. 6243–6252.
in Proceedings of the IEEE Conference on Computer Vision and Pattern [170] E. Sucar and J.-B. Hayet, “Bayesian scale estimation for monocular
Recognition, 2014, pp. 89–96. SLAM based on generic object detection for correcting scale drift,” in
[151] C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular depth Proc. 2018 IEEE International Conference on Robotics and Automation
estimation based on deep learning: An overview,” Science China (ICRA), 2018, pp. 1–7.
Technological Sciences, vol. 63, no. 9, pp. 1612–1627, 2020. [171] N. Yang, R. Wang, J. Stuckler, and D. Cremers, “Deep virtual stereo
[152] V. R. Kumar, M. Klingner, S. Yogamani, S. Milz, T. Fingscheidt, odometry: Leveraging deep depth prediction for monocular direct
and P. Mader, “Syndistnet: Self-supervised monocular fisheye camera sparse odometry,” in Proceedings of the European Conference on
distance estimation synergized with semantic segmentation for au- Computer Vision (ECCV), 2018, pp. 817–833.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 19
[172] C. Zhao, Y. Tang, Q. Sun, and A. V. Vasilakos, “Deep direct visual [194] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and
odometry,” IEEE Transactions on Intelligent Transportation Systems, I. Reid, “Unsupervised scale-consistent depth and ego-motion learning
pp. 1–10, 2021. from monocular video,” in Advances in Neural Information Processing
[173] Z. Wang, Q. Zhang, J. Li, S. Zhang, and J. Liu, “A computationally ef- Systems, 2019, pp. 35–45.
ficient semantic SLAM solution for dynamic scenes,” Remote Sensing, [195] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM:
vol. 11, no. 11, p. 1363, 2019. Semantic monocular visual localization and mapping based on deep
[174] J. Cheng, Y. Sun, and M. Q.-H. Meng, “A dense semantic mapping learning in dynamic environment,” Robotics and Autonomous Systems,
system based on CRF-RNN network,” in Proc. 2017 18th International vol. 117, pp. 1–16, 2019.
Conference on Advanced Robotics (ICAR), 2017, pp. 589–594. [196] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
[175] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman- once: Unified, real-time object detection,” in Proceedings of the IEEE
ticfusion: Dense 3D semantic mapping with convolutional neural Conference on Computer Vision and Pattern Recognition, 2016, pp.
networks,” in Proc. 2017 IEEE International Conference on Robotics 779–788.
and automation (ICRA), 2017, pp. 4628–4635. [197] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
[176] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, A. C. Berg, “SSD: Single shot multibox detector,” in Proc. European
“Deeper depth prediction with fully convolutional residual networks,” Conference on Computer Vision. Springer, 2016, pp. 21–37.
in Proc. 2016 Fourth International Conference on 3D Vision (3DV), [198] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
2016, pp. 239–248. Proceedings of the IEEE International Conference on Computer Vision,
[177] S. Y. Loo, A. J. Amiri, S. Mashohor, S. H. Tang, and H. Zhang, “CNN- 2017, pp. 2961–2969.
SVO: Improving the mapping in semi-direct visual odometry using [199] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
single-image depth prediction,” in Proc. 2019 International Conference convolutional encoder-decoder architecture for image segmentation,”
on Robotics and Automation (ICRA), 2019, pp. 5218–5223. IEEE Transactions on Pattern Analysis and Machine Intelligence,
[178] J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “DeepFactors: vol. 39, no. 12, pp. 2481–2495, 2017.
Real-time probabilistic dense monocular SLAM,” IEEE Robotics and [200] F. Zhong, S. Wang, Z. Zhang, and Y. Wang, “Detect-SLAM: Making
Automation Letters, vol. 5, no. 2, pp. 721–728, 2020. object detection and SLAM mutually beneficial,” in Proc. 2018 IEEE
[179] L. Tiwari, P. Ji, Q.-H. Tran, B. Zhuang, S. Anand, and M. Chandraker, Winter Conference on Applications of Computer Vision (WACV), 2018,
“Pseudo RGB-D for self-Improving monocular SLAM and depth pp. 1001–1010.
prediction,” in Proceedings of the European Conference on Computer [201] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
Vision (ECCV), 2020, pp. 437–455. arXiv preprint arXiv:1804.02767, 2018.
[180] N. Yang, L. V. Stumberg, R. Wang, and D. Cremers, “D3VO: Deep [202] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “DS-
depth, deep pose and deep uncertainty for monocular visual odometry,” SLAM: A semantic visual SLAM towards dynamic environments,” in
in Proceedings of the IEEE/CVF Conference on Computer Vision and Proc. 2018 IEEE/RSJ International Conference on Intelligent Robots
Pattern Recognition, 2020, pp. 1281–1292. and Systems (IROS), 2018, pp. 1168–1174.
[181] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: [203] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering
The kitti dataset,” The International Journal of Robotics Research, moving instances and estimating visual odometry for self and surround-
vol. 32, no. 11, pp. 1231–1237, 2013. ings,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[182] B. Wagstaff, V. Peretroukhin, and J. Kelly, “Self-supervised deep pose and Pattern Recognition, 2020, pp. 2168–2177.
corrections for robust visual odometry,” in 2020 IEEE International [204] S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D object SLAM,”
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. IEEE Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019.
2331–2337. [205] D. Frost, V. Prisacariu, and D. Murray, “Recovering stable scale
[183] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, in monocular SLAM using object-supplemented bundle adjustment,”
stereo, and rgb-d cameras,” Advances in Neural Information Processing IEEE Transactions on Robotics, vol. 34, no. 3, pp. 736–747, 2018.
Systems, vol. 34, 2021. [206] E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual local-
[184] P. Liu, M. Geppert, L. Heng, T. Sattler, A. Geiger, and M. Pollefeys, ization using semantically segmented images,” in Proc. 2018 IEEE
“Towards robust visual odometry with a multi-camera system,” in 2018 International Conference on Robotics and Automation (ICRA), 2018,
IEEE/RSJ International Conference on Intelligent Robots and Systems pp. 6484–6490.
(IROS). IEEE, 2018, pp. 1154–1161. [207] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Prob-
[185] H. Alismail, M. Kaess, B. Browning, and S. Lucey, “Direct visual abilistic data association for semantic SLAM,” in Proc. 2017 IEEE
odometry in low light using binary descriptors,” IEEE Robotics and International Conference on Robotics and Automation (ICRA), 2017,
Automation Letters, vol. 2, no. 2, pp. 444–451, 2016. pp. 1722–1729.
[186] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image [208] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso:
translation using cycle-consistent adversarial networks,” in Proceedings Visual semantic odometry,” in Proceedings of the European Conference
of the IEEE International Conference on Computer Vision, 2017, pp. on Computer Vision (ECCV), 2018, pp. 234–250.
2223–2232. [209] J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, and J. Montiel,
[187] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image “Towards semantic SLAM using a monocular camera,” in Proc. 2011
translation with conditional adversarial networks,” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems,
the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1277–1284.
2017, pp. 1125–1134. [210] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J.
[188] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent online video Davison, “SLAM++: Simultaneous localisation and mapping at the
style transfer,” in Proceedings of the IEEE International Conference level of objects,” in Proceedings of the IEEE Conference on Computer
on Computer Vision, 2017, pp. 1105–1114. Vision and Pattern Recognition, 2013, pp. 1352–1359.
[189] C. Gao, D. Gu, F. Zhang, and Y. Yu, “ReCoNet: Real-time Coherent [211] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match
Video Style Transfer Network,” in Proc. Asian Conference on Com- locally: Efficient and robust 3D object recognition,” in Proc. 2010
puter Vision. Springer, 2018, pp. 637–653. IEEE Computer Society Conference on Computer Vision and Pattern
[190] X. Guo, Y. Li, and H. Ling, “LIME: Low-light image enhancement via Recognition, 2010, pp. 998–1005.
illumination map estimation,” IEEE Transactions on Image Processing, [212] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Mean-
vol. 26, no. 2, pp. 982–993, 2016. ingful maps with object-oriented semantic mapping,” in Proc. 2017
[191] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van Gool, IEEE/RSJ International Conference on Intelligent Robots and Systems
“Night-to-day image translation for retrieval-based localization,” in (IROS), 2017, pp. 5079–5085.
Proc. 2019 International Conference on Robotics and Automation [213] B.-S. Kim, P. Kohli, and S. Savarese, “3D scene understanding by
(ICRA), 2019, pp. 5958–5964. voxel-CRF,” in Proceedings of the IEEE International Conference on
[192] R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza, Computer Vision, 2013, pp. 1425–1432.
“Learning-based image enhancement for visual odometry in challeng- [214] A. Hermans, G. Floros, and B. Leibe, “Dense 3D semantic mapping of
ing HDR environments,” in Proc. 2018 IEEE International Conference indoor scenes from RGB-D images,” in Proc. 2014 IEEE International
on Robotics and Automation (ICRA), 2018, pp. 805–811. Conference on Robotics and Automation (ICRA), 2014, pp. 2631–2638.
[193] L. Cui and C. Ma, “SOF-SLAM: A semantic visual SLAM for dynamic [215] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg, “Joint semantic
environments,” IEEE Access, vol. 7, pp. 166 528–166 539, 2019. segmentation and 3D reconstruction from monocular video,” in Proc.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 20
European Conference on Computer Vision. Springer, 2014, pp. 703– self-supervised imitation learning for vision-language navigation,” in
718. Proceedings of the IEEE Conference on Computer Vision and Pattern
[216] X. Li and R. Belaroussi, “Semi-dense 3D semantic mapping from Recognition, 2019, pp. 6629–6638.
monocular SLAM,” arXiv preprint arXiv:1611.04144, 2016. [238] D. S. Chaplot, L. Lee, R. Salakhutdinov, D. Parikh, and D. Batra,
[217] J. Ma, J. Wu, J. Zhao, J. Jiang, H. Zhou, and Q. Z. Sheng, “Non- “Embodied multimodal multitask learning,” Proceedings of the 29th
rigid point set registration with robust transformation learning under International Joint Conference on Artificial Intelligence (IJCAI), pp.
manifold regularization,” IEEE Transactions on Neural Networks and 2442–2448, 2020.
Learning Systems, vol. 30, no. 12, pp. 3584–3597, 2018. [239] H. Lin, S. Garg, J. Hu, G. Kaddoum, M. Peng, and M. S. Hos-
[218] Y. Hu, B. Subagdja, A.-H. Tan, and Q. Yin, “Vision-based topological sain, “Blockchain and deep reinforcement learning empowered spatial
mapping and navigation with self-organizing neural networks,” IEEE crowdsourcing in software-defined internet of vehicles,” IEEE Trans-
Transactions on Neural Networks and Learning Systems, 2021. actions on Intelligent Transportation Systems, 2020.
[219] B. Gozick, K. P. Subbu, R. Dantu, and T. Maeshiro, “Magnetic maps [240] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F.-F. Li, and
for indoor navigation,” IEEE Transactions on Instrumentation and A. Farhadi, “Target-driven visual navigation in indoor scenes using
Measurement, vol. 60, no. 12, pp. 3883–3891, 2011. deep reinforcement learning,” in Proc. 2017 IEEE International Con-
[220] B. Barshan and H. F. Durrant-Whyte, “Inertial navigation systems ference on Robotics and Automation (ICRA), 2017, pp. 3357–3364.
for mobile robots,” IEEE Transactions on Robotics and Automation, [241] L. Tai and M. Liu, “A robot exploration strategy based on Q-learning
vol. 11, no. 3, pp. 328–342, 1995. network,” in Proc. 2016 IEEE International Conference on Real-time
[221] R. Stahn, G. Heiserich, and A. Stopp, “Laser scanner-based navigation Computing and Robotics (RCAR), 2016, pp. 57–62.
for commercial vehicles,” in Proc. 2007 IEEE Intelligent Vehicles [242] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,
Symposium, 2007, pp. 969–974. “Dueling network architectures for deep reinforcement learning,” in
[222] I. Skog and P. Handel, “In-car positioning and navigation technolo- International conference on machine learning. PMLR, 2016, pp.
gies—A survey,” IEEE Transactions on Intelligent Transportation 1995–2003.
Systems, vol. 10, no. 1, pp. 4–21, 2009. [243] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[223] F. Bonin-Font, A. Ortiz, and G. Oliver, “Visual navigation for mobile with double q-learning,” in Proceedings of the AAAI Conference on
robots: A survey,” Journal of Intelligent and Robotic Systems, vol. 53, Artificial Intelligence, vol. 30, no. 1, 2016.
no. 3, pp. 263–296, 2008. [244] Y. Zeng, X. Xu, S. Jin, and R. Zhang, “Simultaneous navigation and
[224] D. Kim and R. Nevatia, “Symbolic navigation with a generic map,” radio mapping for cellular-connected uav with deep reinforcement
Autonomous Robots, vol. 6, no. 1, pp. 69–88, 1999. learning,” IEEE Transactions on Wireless Communications, 2021.
[225] J. Gaspar, N. Winters, and J. Santos-Victor, “Vision-based navigation [245] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
and environmental representations with an omnidirectional camera,” D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
IEEE Transactions on Robotics and A·utomation, vol. 16, no. 6, pp. ment learning,” arXiv preprint arXiv:1509.02971, 2015.
890–898, 2000. [246] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep
[226] P. Saeedi, P. D. Lawrence, and D. G. Lowe, “Vision-based 3-d trajectory Q-learning with model-based acceleration,” in Proc. International
tracking for unknown environments,” IEEE Transactions on Robotics, Conference on Machine Learning, 2016, pp. 2829–2838.
vol. 22, no. 1, pp. 119–136, 2006. [247] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient
[227] M. A. K. Jaradat, M. Al-Rousan, and L. Quadan, “Reinforcement UAV control for effective and fair communication coverage: A deep
based mobile robot navigation in dynamic environment,” Robotics and reinforcement learning approach,” IEEE Journal on Selected Areas in
Computer-Integrated Manufacturing, vol. 27, no. 1, pp. 135–149, 2011. Communications, vol. 36, no. 9, pp. 2059–2070, 2018.
[228] H. Shi, G. Sun, Y. Wang, and K.-S. Hwang, “Adaptive image-based [248] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
visual servoing with temporary loss of the visual signal,” IEEE Trans- learning: Continuous control of mobile robots for mapless navigation,”
actions on Industrial Informatics, vol. 15, no. 4, pp. 1956–1965, 2018. in Proc. 2017 IEEE/RSJ International Conference on Intelligent Robots
[229] M. Bellemare, W. Dabney, R. Dadashi, A. A. Taiga, P. S. Castro, and Systems (IROS), 2017, pp. 31–36.
N. Le Roux, D. Schuurmans, T. Lattimore, and C. Lyle, “A geometric [249] Z. Zhang, J. Chen, Z. Chen, and W. Li, “Asynchronous episodic deep
perspective on optimal representations for reinforcement learning,” in deterministic policy gradient: Toward continuous control in computa-
Advances in Neural Information Processing Systems, 2019, pp. 4360– tionally complex environments,” IEEE Transactions on Cybernetics,
4371. 2019.
[230] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil- [250] H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning naviga-
ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised tion behaviors end-to-end with autorl,” IEEE Robotics and Automation
auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016. Letters, vol. 4, no. 2, pp. 2007–2014, 2019.
[231] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, [251] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to in Neural Information Processing Systems, 2000, pp. 1008–1014.
navigate in complex environments,” arXiv preprint arXiv:1611.03673, [252] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
2016. D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
[232] H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement learning-based reinforcement learning,” in Proc. International Conference on Machine
automatic exploration for navigation in unknown environment,” IEEE Learning, 2016, pp. 1928–1937.
Transactions on Neural Networks and Learning Systems, vol. 31, no. 6, [253] R. Druon, Y. Yoshiyasu, A. Kanezaki, and A. Watt, “Visual object
pp. 2064–2076, 2020. search by learning spatial context,” IEEE Robotics and Automation
[233] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, Letters, vol. 5, no. 2, pp. 1279–1286, 2020.
I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language nav- [254] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
igation: Interpreting visually-grounded navigation instructions in real policy maximum entropy deep reinforcement learning with a stochastic
environments,” in Proceedings of the IEEE Conference on Computer actor,” in International Conference on Machine Learning. PMLR,
Vision and Pattern Recognition, 2018, pp. 3674–3683. 2018, pp. 1861–1870.
[234] L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, [255] J. C. de Jesus, V. A. Kich, A. H. Kolling, R. B. Grando, M. A. d.
and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in S. L. Cuadros, and D. F. T. Gamarra, “Soft actor-critic for navigation
vision-and-language navigation,” in Proceedings of the IEEE Confer- of mobile robots,” Journal of Intelligent & Robotic Systems, vol. 102,
ence on Computer Vision and Pattern Recognition, 2019, pp. 6741– no. 2, pp. 1–11, 2021.
6749. [256] T.-Y. Lee, J. van Baar, K. Wittenburg, and A. Sullivan, “Analysis of the
[235] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation contribution and temporal dependency of lstm layers for reinforcement
with self-supervised auxiliary reasoning tasks,” in Proceedings of the learning tasks,” in Proceedings of the IEEE Conference on Computer
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vision and Pattern Recognition Workshops, 2019, pp. 99–102.
2020, pp. 10 012–10 022. [257] P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. An-
[236] Q. Sun, Y. Zhuang, Z. Chen, Y. Fu, and X. Xue, “Depth-guided adain derson, D. Teplyashin, K. Simonyan, A. Zisserman, R. Hadsell et al.,
and shift attention network for vision-and-language navigation,” in “Learning to navigate in cities without a map,” in Advances in Neural
2021 IEEE International Conference on Multimedia and Expo (ICME). Information Processing Systems, 2018, pp. 2419–2430.
IEEE, 2021, pp. 1–6. [258] X. Chen, A. Ghadirzadeh, J. Folkesson, M. Björkman, and P. Jensfelt,
[237] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, “Deep reinforcement learning to acquire navigation skills for wheel-
W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and legged robots in complex environments,” in Proc. 2018 IEEE/RSJ
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 21