KEMBAR78
Monocular Depth Estimation with U-Net | PDF | Image Segmentation | Cybernetics
0% found this document useful (0 votes)
2K views8 pages

Monocular Depth Estimation with U-Net

This document proposes a simpler fully convolutional neural network architecture called U-Net for estimating depth maps from single RGB images. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. A coarse model is first trained to produce a low-resolution depth map, which is then refined by a fine model to produce the final high-resolution depth map. The scale invariant smooth L1 loss function is used for training to handle ambiguity caused by scale in depth. Evaluation on standard datasets shows this simpler U-Net architecture achieves comparable results to more complex state-of-the-art models while using significantly fewer parameters.

Uploaded by

Diksha Meghwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
2K views8 pages

Monocular Depth Estimation with U-Net

This document proposes a simpler fully convolutional neural network architecture called U-Net for estimating depth maps from single RGB images. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. A coarse model is first trained to produce a low-resolution depth map, which is then refined by a fine model to produce the final high-resolution depth map. The scale invariant smooth L1 loss function is used for training to handle ambiguity caused by scale in depth. Evaluation on standard datasets shows this simpler U-Net architecture achieves comparable results to more complex state-of-the-art models while using significantly fewer parameters.

Uploaded by

Diksha Meghwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 8

Depth prediction using a single image

Diksha Meghwal Imran Rob Fergus


dm4511@nyu.edu ii398@nyu.edu fergus@nyu.edu

Courant Institute of Mathematics


New York University

Abstract to be trained. This requires huge amount of image


This paper addresses the problem of estimat-
data to train. We adopted the novel approach used
ing the depth map of a scene given a single by U-Net architecture to create a simpler archi-
RGB image. We propose a simpler fully con- tecture that consists of a contracting path to cap-
volutional architecture, encompassing resid- ture context and a symmetric expanding path that
ual learning, to model the ambiguous mapping enables precise localization. U-Net architectures
between monocular images and depth maps. have already provided good resulst on the prob-
In order to improve the output resolution, we lem of segmentation. We implement the coarse
present a novel way to efficiently learn fea-
and fine model proposed by Eigen [10] to compare
ture map up-sampling within the network. The
architecture consists of a contracting path to results acheived by using u-net on same image set.
capture context and a symmetric expanding
path that enables precise localization. For opti- 2 Related Work
mization, we use the scale invariant loss that is
particularly suited for the task at hand and han- Depth estimation from image data has originally
dles the ambiguity caused by scale of the depth relied on stereo vision [11] [12], using image pairs
in the image. Our model is composed of a sin-
of the same scene to reconstruct 3D shapes. Such
gle architecture that is trained end-to-end and
does not rely on post-processing techniques, approaches relied on motion (Structure-from-
such as CRFs or other additional refinement Motion [13]) or different shooting conditions
steps. The model contains significantly fewer (Shape-from-Shading [14], Shape-from-Defocus
parameters than the current SOTA. [15]). Despite the ambiguities that arise in lack
of such information, but inspired by the analogy
1 Introduction
to human depth perception from monocular cues,
Scene depth inference from a single image is cur- depth map prediction from a single RGB image
rently an important issue in machine learning [1] has also been investigated. Below, we focus on
[2], [3] [4] [5]. The underlying rationale of this the related work for single RGB input, similar to
problem is the possibility of human depth percep- our method.
tion from single images. The task here is to as- Classic methods on monocular depth estimation
sign a depth value to every single pixel in the im- have mainly relied on hand-crafted features and
age, which can be considered as a dense regres- used probabilistic graphical models to tackle
sion problem. Depth information can benefit many the problem [16] [17] [18] [19], usually making
challenging computer vision problems, such as se- strong assumptions about scene geometry. One
mantic segmentation [6], [7], pose estimation [8], of the first works, by Saxena et al.[20], uses a
and object detection [9]. MRF to infer depth from local and global features
During the past decade, significant effort has been extracted from the image, while superpixels [21]
made in the research community to improve the are introduced in the MRF formulation in order to
performance of monocular depth learning, and sig- enforce neighboring constraints. Their work has
nificant accuracy has been achieved thanks to the been later extended to 3D scene reconstruction
rapid development and advances of deep neural [22]. Inspired by this work, Liu et al. [23] com-
networks. However, most networks tend to be bine the task of semantic segmentation with depth
pretty heavy and contain large no of parameters estimation, where predicted labels are used as
additional constraints to facilitate the optimization present in the original architecture. The original
task. Ladicky et al.[24] instead jointly predict architecture also crops the image while downsam-
labels and depths in a classification approach. pling the image so we end up losing boundary val-
ues. We prevent this loss of boundary by applying
More recently, remarkable advances in the field appropriate padding. Also since the original archi-
of deep learning drove research towards the use tecture is designed for classification problem, we
of CNNs for depth estimation. Since the task is have modified the last layer to produce an output
closely related to semantic labeling, most works which is the same size as the target depth image.
have built upon the most successful architectures
of the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) [25], often initializing their
networks with AlexNet [26] or the deeper VGG
[27]. Eigen et al.[10] have been the first to use
CNNs for regressing dense depth maps from a
single image in a two-scale architecture, where
the first stage based on AlexNet produces a
coarse output and the second stage refines the
original prediction. Their work is later extended
to additionally predict normals and labels with a
deeper and more discriminative model based on
VGG and a three-scale architecture for further Figure 1: Architecture of the proposed network.
refinement[3]. U-net architecture (example for 8x8 pixels in the
lowest resolution). Each blue box corresponds to
3 U-Net Model a multi-channel feature map. The number of chan-
nels is denoted on top of the box. The x-y-size is
The original network architecture of u-net is
provided at the lower left edge of the box. White
illustrated in Figure 1. It consists of a con-
boxes represent copied feature maps. The arrows
tracting path (left side) and an expansive path
denote the different operations.
(right side). The contracting path follows the
typical architecture of a convolutional network.
It consists of the repeated application of two 4 Fine and Coarse Model
3x3 convolutions (unpadded convolutions), each
followed by a rectified linear unit (ReLU) and 4.1 Coarse network
a 2x2 max pooling operation with stride 2 for The coarse-scale network contains five feature ex-
downsampling. At each downsampling step we traction layers of convolution and max-pooling,
double the number of feature channels. Every step followed by two fully connected layers. The in-
in the expansive path consists of an upsampling put, feature map and output sizes are also given in
of the feature map followed by a 2x2 convolution Fig. 2. The final output is at 1/4-resolution com-
(up-convolution) that halves the number of feature pared to the input and corresponds to a center crop
channels, a concatenation with the corresponding containing most of the input (as we describe later,
feature map from the contracting path, and two we lose a small border area due to the first layer
3x3 convolutions, each followed by a ReLU. At of the fine-scale network and image transforma-
the final layer a 1x1 convolution is used to map tions). Since the top layers of the coarse model are
each 64-component feature vector to the desired fully connected layers, they encompass the entire
number of classes. image giving a global feature map of the image.
The middle and lower layers focus on small part
We have slightly modified the original architec- of the image to enable localization.
ture of u-net to further reduce the no of parameters
to train. The size of the input image is reduced and 4.2 Fine network
also we dropped 1 layer in the both contracting and The task of fine scale model is to make local re-
expanding path of the architecture. In total the net- finements in the depth using the combined input
work has 18 convolutional layers as opposed to 23 of the image as well as the depth map produced by
the coarse scale network. The fine scale network The equation is:
consists of convolution layers that have a small 1X
viewing area of 45*45 pixels that focus on the lo- smooth L1(y 0 , y) = zi (3)
n
i
cal nuances of the image. It consists of a single
max pool operation that is applied with a convolu- where z i is as define below:
(
tion layer on the input image. The size of this out- 0 · 5 |y 0 i − yi|2 , |y 0 i − yi| < 1
put matches the size of the depth map produced by zi = (4)
|y 0 i − yi | − 0 · 5, otherwise
coarse network and this combined input is then ap-
plied on subsequent convolution layers that main- It also popularly known as the Huber loss. How-
tain the same size using appropriate padding. This ever, the best results were observed from using the
way the model achieves local refinement by learn- scale invariant error as proposed by Eigen et al.
ing the local features like wall edges, corners. All [10]. For a predicted depth map y’ and ground
hidden convolution layers are followed by recti- truth y, each with n pixels indexed by i the method
fied linear activations. The final layers predicts the is defined as below:
!2
depth map and hence is a fully connected layer. 1 X λ X
L(y 0 , y) = di2 − 2 di (5)
n n
i
where d i is the difference between the absolute
truth and the global truth defined as below:
d i = log y 0 − log y (6)
where λ = 0.5. This method mitigates the ambi-
guity caused by the scale of the image since all
the operations are happening at log scale. The
first part of the equation is very similar to l2 Loss.
However, since the operation is in log scale, each
pair of pixels in the prediction must differ in depth
Figure 2: Coarse-fine model with details of each
by an amount similar to that of the correspond-
convolution layer.
ing pair in the ground truth. Also the second term
of the equation is basically a product of the two
5 Loss function vectors d i and d j and is actually smaller if two
predictions are off from the ground truth by a simi-
The standard loss functions for optimization in re- lar margin as compared to the other prediction and
gression problems are the L1, L2 loss function, larger when they are in opposite directions. In ev-
minimizing the absolute distance or squared eu- ery gradient descent step, when we compute this
clidean norm between predictions y’ and ground scale invariant loss by taking the cumulative loss
truth for an entire batch and then dividing by the batch
size to get the final loss value per pixel. The value
X
y 0 − y of λ is chosen as an average between 0 which re-

L1 = (1)
duces the equation to a simple L2 and 1 which is
the actual scale invariant loss and seems to provide
X 2 the best results on the observed dataset
L2 = y0 − y (2)
In addition to the scale-invariant error, we also
Although this produces good results in our test measure the performance of our method according
cases, we found that there were still ambiguities to several error metrics have been proposed in
in the results obtained. prior works, as described in Section 6.4

We then tried smooth L1 loss function which


6 Experimental setup
is a combination of L1-loss and L2-loss. It be-
haves as L1-loss when the absolute value of the In this section, we first describe the training cor-
argument is high, and it behaves like L2-loss when pus, explain our training and inference setups and
the absolute value of the argument is close to zero. give implementation details about our model.
6.1 Dataset depth array. Also the depth is transformed into log
We train our model on the labeled version of NYU space to handle the ambiguity caused by scale.
Depth v2[28] which comprises of 1449 densely
6.3 Training and inference
labeled pairs of aligned RGB and depth images.
The raw distributions contain many additional im- The input images and their corresponding depth
ages collected from the same scenes as in the more maps are used to train the network in pytorch.
commonly used small distributions, but with no For coarse and fine model, training is done in two
preprocessing; in particular, points for which there steps. First step involves tuning the coarse model
is no depth value are left unfilled. However, given with respect to the ground truth. This network
the scope of this project we restrict ourselves to takes as input an image of size 304 x 228 and pro-
the labeled data set where the data has been pre- duces a depth map that generates a fuzzy depth
processed to fill in missing depth labels. We split map for the whole image. Once coarse network
the database into a group of 1024 images for train- is trained we freeze that model and train the fine
ing, 224 for validation and 201 images for evalua- model using the input from this trained coarse net-
tion. work and the input rgb image. We use the stan-
dard SGD optimizer with momentum 0.9 and our
Number of pairs (Train) 1024 log invariant loss function for gradient computa-
Number of pairs (Validation) 224 tion for both the networks.
We use separate learning rate for each layer.
Number of pairs (Test) 201
coarse layer lr fine layer lr
Table 1: Distribution of data in train, test and
validation sets in the entire NYU depth labeled conv1 0.001 conv1 0.001
dataset. conv2 0.001 conv2 0.01
conv3 0.001 conv3 0.001
6.2 Image Preprocessing conv4 0.001
Since we are using the NYU depth dataset, we ex- conv5 0.001
tract the data from matlab file provided on their
official website[28]. It consists of both the rgb fc1 0.1
images and their corresponding depth map as the fc2 0.1
numpy arrays. For rgb images, we subtract the per
channel mean of the value of a pixel for a consid- Table 2: Learning rate for each convolutional layer
erable subset of the data. This basic normaliza- in the coarse and fine architecture
tion for images is done for training, validation as
well as test to make the network more robust and For the u-net architecture we train with an
less susceptible to differing background and light- input of size 64 * 64, use the SGD optimizer with
ening conditions. Also, we resize the rgb images momentum of 0.9, learning rate 0.01 and the scale
according to the network input requirement. For invariant loss function to train the network with
the coarse and fine model the rgb input image is a batch size of 32. We run the model for approx
transformed to a size of 304 x 228 while the depth 200 epochs and it converges in 20 mins to provide
target is resized to 74 x 55 as the model produces substantial results.
an output 1/4 of the input image. For U-Net since
the input and output size remain same as we con- The loss values are defined for each pixel
cat across down-sampling and up-sampling paths, for which we are predicting a depth value. The
we transform the input to a 64 * 64 image to make weights of the network are randomly initialized.
the computations less heavy. Finally we normalize Ideally the initial weights should be adapted
the rgb image by calculating the mean per pixel such that each feature map in the network has
per channel for 500 odd images and subtracting it approximately unit variance. For a network with
from each rgb image and adding the standard de- our architecture (alternating convolution and
viation for a batch ReLU layers) this can be achieved by drawing the
For depth we use bilinear interpolation to resize initial weights from a Gaussian distribution with
a standard deviation of 2/N, where N denotes the our loss values were less compared to the ones
number of incoming nodes of one neuron [5]. observed in the benchmark models. To validate
that our model was not erratic, we tried a bunch
of other loss functions like L1(absolute relative
6.4 Experiments difference between pixels in a batch of images),
We implemented our model using standard py- L2, smooth L1 loss (Huber loss), changing the
torch libraries. We used prince clusters provided lambda value in the scale invariant loss function
by NYU Computer Science departments and ran and observing the results on the absolute as well
sbatch jobs to run our models. We plugged in ten- as logarithmic scale. The original scale invariant
sorboard to monitor the progress of the model and function seemed to perform the best among all
get a good feel of the gradient descent. To visual- despite its unexplainably small values.
ize the depth map and to get a good sense of the re-
sults we generate plots of the output images along
with input and target while evaluating.
Model Architecture Since the original architec-
ture of the u-net model was pretty big to train,
we started with a smaller network with 3 convolu-
tional layers comprising of just conv nets followed
by Relu (no pooling) for downsampling and 3 con-
vTranspose2D for upsampling with no concatena-
tion of weights at the same level. This model had
just 94011 parameters and was first made to over-
fit a small trainset of 1 image. We then slowly
added more layers, one at a time, adding padding
to conv nets to ensure no loss of boundary and
added maxpool for downsampling. The model
had now 433771 parameters and seemed to im-
prove a little with the increase in the no of parame-
Figure 4: Output of our implementation of the
ters in the network but showed most improvement
coarse and fine model on the NYU labeled dataset
with the concatenation of feature maps across the
downsample-upsample bridge. We also tried to
change the activation function to tanH but didn’t 6.5 Evaluation metrics
observe any improvement. Coarse Fine UNet Coarse [10] Fine [10]
Scale Inv. 0.094 0.095 0.077 0.221 0.219
Optimizers We experimented with several delta1 0.502 0.498 0.573 0.618 0.611
optimizers like Adagrad, Adamax, Adadelta, delta2 0.816 0.812 0.860 0.891 0.887
delta3 0.948 0.947 0.959 0.969 0.971
Adam (with and without amsgrad) and found SGD rmse(lin) 0.889 0.898 0.806 0.871 0.907
to perform equal or slightly better than Adam. rmse(log) 0.116 0.118 0.096 0.283 0.285
We started with a standard learning rate of 10−3 abs rel
abs sqr rel
0.276
0.307
0.267
0.297
0.263
0.272
0.228
0.223
0.215
0.212
and observed that the rate of convergence for the
model was pretty slow. It took 3 hours and 3000 Table 3: Error Table. delta1 is no of pixels for
epochs for the model to converge. We slowly which output/input threshold < 1.25, delta2 with
increased the learning rate with a factor of 10 threshold < 1.252 , delta3 with 1.253
and observed that model now converged in just
200 epochs in 20 mins. To further improve the
results we tried to implement a decaying rate with We use several metrics apart from the scale in-
a step size of 30 epochs but we didn’t see any variant function to evaluate the depth predictions
improvement in the results. of our model with the ground truth while validat-
ing. These include the threshold loss functions
that evaluate the maximum of the ratio between
Loss Functions We trained our model using the predicted output and actual output. We also use
scale invariant loss function but we observed that RMSE linear which is nothing but the standard
(a) Plot of loss functions as described in the legend (b) Plot of accuracies as described in the legend

Figure 3: Error and accuracies on the u-net model

root mean squared error. Since out dataset is in log observed in the model when we combine the input
space, we convert into the exponent of the base to in the expanding part of the model with the corre-
get the linear value. Similarly, we add other loss sponding output in the contracting path as shown
functions like relative difference, squared relative in figure 5.
difference for both log space and absolute. We plot
these values during the validation stage and ob-
served that post 200 epochs the network stabilizes
and doesn’t progress much. The values observed
are evaluated on the evaluation dataset of the la-
beled NYU Depth dataset which had 201 images.

Table 4: Notable Results in area of Depth Map Pre-


diction

Wang[29] Eigen[3] ResNet[4] AlexNet[4] VGG[4]


delta1 0.605 0.769 0.811 0.586 0.626
delta2 0.890 0.950 0.953 0.869 0.894
delta3 0.970 0.988 0.988 0.967 0.974
rmse(lin) 0.745 0.641 0.573 0.845 0.746
rmse(log) 0.262 0.214 0.195 0.283 0.285
abs rel 0.220 0.158 0.127 0.209 0.194

7 Results and Analysis


The fine and coarse model that we implemented
didn’t perform comparable to the benchmark Figure 5: Output of the network before and af-
results as shown in Tables 3 and 4. This may be ter concatenating the feature map from the corre-
because of the lack of larger images in the training sponding converging path
dataset as we restricted ourselves to the labeled
dataset given the constraint of resources. As shown in Table 3 shows the summary of
results across the two models, using various
The initial encoder-decoder model that we im- evaluation metrics. We see that our u-net based
plemented which consisted of just 94000 parame- model falls short of all the benchmark values by
ters performed terribly and produced no meaning- a small value. However as observed in the plots
ful output. This was because the model didn’t have in the Fig5, our results for u-net are better across
sufficient parameters to be trained for the task at all the given metrics as compared to our own
hand. As we increased the no of layers and hence implementation of the coarse-fine model. This
the no of parameters in the model, we observe im- leads us to believe that if we were able to fix the
provements in the model. A major improvement is anomaly in our loss function, our model would
have given much better results that could have analysis and machine intelligence, 36(11):2144–
been comparable to the benchmark values. The 2158, 2014.
scale invariant loss function does handle scale
[3] David Eigen and Rob Fergus. Predicting depth, sur-
based ambiguity face normals and semantic labels with a common
multi-scale convolutional architecture. In Proceed-
ings of the IEEE International Conference on Com-
8 Conclusion puter Vision, pages 2650–2658, 2015.

We propose a end to end trainable model for es- [4] Iro Laina, Christian Rupprecht, Vasileios Belagian-
timating the depth map from a single rgb image nis, Federico Tombari, and Nassir Navab. Deeper
using NYU Depth labeled dataset. We show how depth prediction with fully convolutional residual
networks. In 3D Vision (3DV), 2016 Fourth Interna-
to train this model by using the coarse fine model tional Conference on, pages 239–248. IEEE, 2016.
for comparison. We see that our model with u-net
based architecture converged in mere 200 epochs [5] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang
and trains quickly as there are no heavy fully con- Wang, and Nicu Sebe. Multi-scale continuous crfs
as sequential deep networks for monocular depth es-
nected layers used in the model. Also we use timation. In Proceedings of CVPR, volume 1, 2017.
no processing of images like CRFs or other ad-
ditional refinement steps. The output generated by [6] Caner Hazirbas, Lingni Ma, Csaba Domokos, and
our model is comparable to the fine-coarse model Daniel Cremers. Fusenet: Incorporating depth into
semantic segmentation via fusion-based cnn archi-
implemented by us and can do even better with in- tecture. In Asian Conference on Computer Vision,
creased dataset size. pages 213–228. Springer, 2016.
For future work, we would like to investigate
our findings further and try to explore other loss [7] Yuanzhouhan Cao, Chunhua Shen, and Heng Tao
Shen. Exploiting depth from single monocular
functions with better gradient optimization strate- images for object detection and semantic segmen-
gies to better account for the loss in the output im- tation. IEEE Transactions on Image Processing,
ages. We would also like to investigate the effects 26(2):836–846, 2017.
of using the original u-net architecture which ex-
[8] Jamie Shotton, Ross Girshick, Andrew Fitzgibbon,
pects input of size 572 * 572 and see the impact
Toby Sharp, Mat Cook, Mark Finocchio, Richard
of these increased parameters on the model’s per- Moore, Pushmeet Kohli, Antonio Criminisi, Alex
formance. Also make our model more robust by Kipman, et al. Efficient human pose estimation from
applying it on non labeled dataset as well. single depth images. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 35(12):2821–
9 Contributions 2840, 2013.

− Diksha Meghwal - Worked on implement- [9] Shuran Song Jianxiong Xiao. Deep sliding shapes
for amodal 3d object detection in rgb-d images.
ing variations of u-net model, built log parser
to plot the graphs for gradient descent, and [10] David Eigen, Christian Puhrsch, and Rob Fergus.
worked on setting up the framework for run- Depth map prediction from a single image using a
ning the program parallely on GPUs multi-scale deep network. In Advances in neural
− Imran - Worked on implementation of the information processing systems, pages 2366–2374,
2014.
fine and coarse model, extracting and trans-
forming images from NYU Depth Dataset [11] Nathan Silberman, Derek Hoiem, Pushmeet Kohli,
and developing framework to calculate nu- and Rob Fergus. Indoor segmentation and support
merous loss values for comparison inference from rgbd images. In European Confer-
ence on Computer Vision, pages 746–760. Springer,
2012.
References [12] Andreas Geiger, Philip Lenz, Christoph Stiller, and
[1] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Raquel Urtasun. Vision meets robotics: The kitti
Make3d: Learning 3d scene structure from a single dataset. The International Journal of Robotics Re-
still image. IEEE transactions on pattern analysis search, 32(11):1231–1237, 2013.
and machine intelligence, 31(5):824–840, 2009.
[13] Vijay Badrinarayanan, Alex Kendall, and Roberto
[2] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth Cipolla. Segnet: A deep convolutional encoder-
transfer: Depth extraction from video using non- decoder architecture for image segmentation. arXiv
parametric sampling. IEEE transactions on pattern preprint arXiv:1511.00561, 2015.
[14] Hyeonwoo Noh, Seunghoon Hong, and Bohyung [26] Guanghui Wang, Hung-Tat Tsui, and QM Jonathan
Han. Learning deconvolution network for semantic Wu. What can we learn about the scene structure
segmentation. In Proceedings of the IEEE interna- from three orthogonal vanishing points in images.
tional conference on computer vision, pages 1520– Pattern Recognition Letters, 30(3):192–202, 2009.
1528, 2015.
[27] Miaomiao Liu, Mathieu Salzmann, and Xuming
[15] Jonathan Long, Evan Shelhamer, and Trevor Dar- He. Discrete-continuous depth estimation from a
rell. Fully convolutional networks for semantic seg- single image. In Proceedings of the IEEE Confer-
mentation. In Proceedings of the IEEE conference ence on Computer Vision and Pattern Recognition,
on computer vision and pattern recognition, pages pages 716–723, 2014.
3431–3440, 2015.
[28] Pushmeet Kohli Nathan Silberman, Derek Hoiem
[16] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Rob Fergus. Indoor segmentation and support
and Jitendra Malik. Hypercolumns for object seg- inference from rgbd images. In ECCV, 2012.
mentation and fine-grained localization. In Proceed-
ings of the IEEE conference on computer vision and [29] Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen,
pattern recognition, pages 447–456, 2015. Brian Price, and Alan L Yuille. Towards unified
depth and semantic prediction from a single image.
[17] Olaf Ronneberger, Philipp Fischer, and Thomas In Proceedings of the IEEE Conference on Com-
Brox. U-net: Convolutional networks for biomed- puter Vision and Pattern Recognition, pages 2800–
ical image segmentation. In International Confer- 2809, 2015.
ence on Medical image computing and computer-
assisted intervention, pages 234–241. Springer,
2015.
[18] Guosheng Lin, Anton Milan, Chunhua Shen, and
Ian D Reid. Refinenet: Multi-path refinement net-
works for high-resolution semantic segmentation. In
Cvpr, volume 1, page 5, 2017.
[19] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai
Wang, and Xiang Bai. Richer convolutional features
for edge detection. In Computer Vision and Pat-
tern Recognition (CVPR), 2017 IEEE Conference
on, pages 5872–5881. IEEE, 2017.
[20] Saining Xie and Zhuowen Tu. Holistically-nested
edge detection. In Proceedings of the IEEE interna-
tional conference on computer vision, pages 1395–
1403, 2015.
[21] Derek Hoiem, Alexei A Efros, and Martial Hebert.
Automatic photo pop-up. In ACM transactions on
graphics (TOG), volume 24, pages 577–584. ACM,
2005.
[22] Alexander G Schwing and Raquel Urtasun. Effi-
cient exact inference for 3d indoor scene understand-
ing. In European Conference on Computer Vision,
pages 299–313. Springer, 2012.
[23] Varsha Hedau, Derek Hoiem, and David Forsyth.
Thinking inside the box: Using appearance mod-
els and context based on room geometry. In Euro-
pean Conference on Computer Vision, pages 224–
237. Springer, 2010.
[24] Ashutosh Saxena, Sung H Chung, and Andrew Y
Ng. Learning depth from single monocular images.
In Advances in neural information processing sys-
tems, pages 1161–1168, 2006.
[25] Ashutosh Saxena, Sung H Chung, and Andrew Y
Ng. 3-d depth reconstruction from a single still
image. International journal of computer vision,
76(1):53–69, 2008.

You might also like