Refine Net
Refine Net
Abstract
1
Another type of methods exploits features from interme- score of 83.4 on the PASCAL VOC 2012 dataset, out-
diate layers for generating high-resolution prediction, e.g., performing the currently best approach DeepLab by a
the FCN method in [36] and Hypercolumns in [22]. The in- large margin.
tuition behind these works is that features from middle lay- To facilitate future research, we release both source code
ers are expected to describe mid-level representations for and trained models for our RefineNet.1
object parts, while retaining spatial information. This infor-
mation is though to be complementary to the features from 1.1. Related Work
early convolution layers which encode low-level spatial vi-
CNNs become the most successful methods for seman-
sual information like edges, corners, circles, etc., and also
tic segmentation in recent years. The early methods in
complementary to high-level features from deeper layers
[18, 23] are region-proposal-based methods which classify
which encode high-level semantic information, including
region proposals to generate segmentation results. Recently
object- or category-level evidence, but which lack strong
fully convolution network (FCNNs) based based methods
spatial information.
[36, 5, 10] show effective feature generation and end-to-
We argue that features from all levels are helpful for se- end training, and thus become the most popular choice for
mantic segmentation. High-level semantic features helps semantic segmentation. FCNNs have also been widely ap-
the category recognition of image regions, while low-level plied in other dense-prediction tasks, e.g., depth estimation
visual features help to generate sharp, detailed boundaries [15, 13, 33], image restoration [14], image super-resolution
for high-resolution prediction. How to effectively exploit [12]. The proposed method here is also based on fully
middle layer features remains an open question and de- convolution-style networks.
serves more attentions. To this end we propose a novel net- FCNN based methods usually have the limitation of low-
work architecture which effectively exploits multi-level fea- resolution prediction. There are a number of proposed tech-
tures for generating high-resolution predictions. Our main niques which addressed this limitation and aim to generate
contributions are as follows: high-resolution predictions. The atrous convolution based
1. We propose a multi-path refinement network (Re- approach DeepLab-CRF in [5] directly output a middle-
fineNet) which exploits features at multiple levels resolution score map then applies the dense CRF method
of abstraction for high-resolution semantic segmenta- [27] to refine boundaries by leveraging color contrast in-
tion. RefineNet refines low-resolution (coarse) seman- formation. CRF-RNN [47] extends this approach by im-
tic features with fine-grained low-level features in a plementing recurrent layers for end-to-end learning of the
recursive manner to generate high-resolution seman- dense CRF and FCNN. Deconvolution methods [38, 2]
tic feature maps. Our model is flexible in that it can be learn deconvolution layers to up-sample the low-resolution
cascaded and modified in various ways. predictions. The depth estimation method [34] employs
super-pixel pooling to output high-resolution prediction.
2. Our cascaded RefineNets can be effectively trained There are several existing methods which exploit mid-
end-to-end, which is crucial for best prediction per- dle layer features for segmentation. The FCN method in
formance. More specifically, all components in Re- [36] adds prediction layers to middle layers to generate
fineNet employ residual connections [24] with iden- prediction scores at multiple resolutions. They average
tity mappings [25], such that gradients can be directly the multi-resolution scores to generate the final prediction
propagated through short-range and long-range resid- mask. Their system is trained in a stage-wise manner rather
ual connections allowing for both effective and effi- than end-to-end training. The method Hypercolumn [22]
cient end-to-end training. merges features from middle layers and learns dense clas-
sification layers. Their method employs stage-wise training
3. We propose a new network component we call instead of end-to-end training. The method Seg-Net [2] and
“chained residual pooling” which is able to capture U-Net [40] apply skip-connections in the deconvolution ar-
background context from a large image region. It does chitecture to exploit the features from middle layers.
so by efficiently pooling features with multiple win- Although there are a few existing work, how to effec-
dow sizes and fusing them together with residual con- tively exploit middle layer features remains an open ques-
nections and learnable weights. tion. We propose a novel network architecture, RefineNet,
to address this question. The network architecture of Re-
4. The proposed RefineNet achieves new state-of-the- fineNet is clearly different from existing methods. Re-
art performance on 7 public datasets, including PAS- fineNet consists of a number of specially designed compo-
CAL VOC 2012, PASCAL-Context, NYUDv2, SUN- nents which are able to refine the coarse high-level semantic
RGBD, Cityscapes, ADE20K, and the object parsing 1 Our source code will be available at https://github.com/
t 1/4
en 2
m
1/8 fi ne RefineNet
4 1/16
1/32 RefineNet 1/32
1/4 1/8 1/8 1/8 1/8 Prediction
Figure 2. Comparison of fully convolutional approaches for dense classification. Standard multi-layer CNNs, such as ResNet (a) suffer
from downscaling of the feature maps, thereby losing fine structures along the way. Dilated convolutions (b) remedy this shortcoming
by introducing atrous filters, but are computationally expensive to train and quickly reach memory limits even on modern GPUs. Our
proposed architecture that we call RefineNet (c) exploits various levels of detail at different stages of convolutions and fuses them to obtain
a high-resolution prediction without the need to maintain large intermediate feature maps. The details of the RefineNet block are outlined
in Sec. 3 and illustrated in Fig 3.
(a)
2x RCU
Multi-path input
Output Conv.
2x RCU
1x RCU
...
...
...
2x RCU
3x3 Conv
5x5 Pool
Pooling
Upsample
3x3 Conv
...
3x3 Conv
3x3 Conv
3x3 Conv
5x5 Pool
ReLU
ReLU
Sum
3x3 Conv
5x5 Pool
...
Upsample
3x3 Conv
according to the resolutions of the feature maps, and employ a 2-cascaded version, a single-block approach as well as a
a 4-cascaded architecture with 4 RefineNet units, each of 2-scale 7-path architecture later in Sec. 4.3.
which directly connects to the output of one ResNet block
as well as to the preceding RefineNet block in the cascade. We denote RefineNet-m as the RefineNet block that con-
Note, however, that such a design is not unique. In fact, nects to the output of block-m in ResNet. In practice, each
our flexible architecture allows for a simple exploration of ResNet output is passed through one convolutional layer to
different variants. For example, a RefineNet block can ac- adapt the dimensionality. Although all RefineNets share the
cept input from multiple ResNet blocks. We will analyse same internal architecture, their parameters are not tied, al-
lowing for a more flexible adaptation for individual levels
of detail. Following the illustration in Fig. 2(c) bottom up, inputs), and then up-samples all (smaller) feature maps to
we start from the last block in ResNet, and connect the out- the largest resolution of the inputs. Finally, all features
put of ResNet block-4 to RefineNet-4. Here, there is only maps are fused by summation. The input adaptation in this
one input for RefineNet-4, and RefineNet-4 serves as an ex- block also helps to re-scale the feature values appropriately
tra set of convolutions which adapt the pre-trained ResNet along different paths, which is important for the subsequent
weights to the task at hand, in our case, semantic segmen- sum-fusion. If there is only one input path (e.g., the case
tation. In the next stage, the output of RefineNet-4 and the of RefineNet-4 in Fig. 2(c)), the input path will directly go
ResNet block-3 are fed to RefineNet-3 as 2-path inputs. The through this block without changes.
goal of RefineNet-3 is to use the high-resolution features
Chained residual pooling. The output feature map then
from ResNet block-3 to refine the low-resolution feature
goes through the chained residual pooling block, schemat-
map output by RefineNet-4 in the previous stage. Similarly,
ically depicted in Fig. 3(d). The proposed chained residual
RefineNet-2 and RefineNet-1 repeat this stage-wise refine-
pooling aims to capture background context from a large
ment by fusing high-level information from the later layers
image region. It is able to efficiently pool features with
and high-resolution but low-level features from the earlier
multiple window sizes and fuse them together using learn-
ones. As the last step, the final high-resolution feature maps
able weights. In particular, this component is built as a
are fed to a dense soft-max layer to make the final predic-
chain of multiple pooling blocks, each consisting of one
tion in the form of a dense score map. This score map is
max-pooling layer and one convolution layer. One pool-
then up-sampled to match the original image using bilinear
ing block takes the output of the previous pooling block as
interpolation.
input. Therefore, the current pooling block is able to re-use
The entire network can be efficiently trained end-to-end. the result from the previous pooling operation and thus ac-
It is important to note that we introduce long-range resid- cess the features from a large region without using a large
ual connections between the blocks in ResNet and the Re- pooling window. If not further specified, we use two pool-
fineNet modules. During the forward pass, these long-range ing blocks each with stride 1 in our experiments.
residual connections convey the low-level features that en-
The output feature maps of all pooling blocks are fused
code visual details for refining the coarse high-level feature
together with the input feature map through summation of
maps. In the training step, the long-range residual connec-
residual connections. Note that, our choice to employ resid-
tions allow direct gradient propagation to early convolution
ual connections also persists in this building block, which
layers, which helps effective end-to-end training.
once again facilitates gradient propagation during training.
In one pooling block, each pooling operation is followed by
3.2. RefineNet
convolutions which serve as a weighting layer for the sum-
The architecture of one RefineNet block is illustrated in mation fusion. It is expected that this convolution layer will
Fig. 3(a). In the multi-path overview shown in Fig 2(c), learn to accommodate the importance of the pooling block
RefineNet-1 has one input path, while all other RefineNet during the training process.
blocks have two inputs. Note, however, that our architecture
Output convolutions. The final step of each RefineNet
is generic and each Refine block can be easily modified to
block is another residual convolution unit (RCU). This re-
accept an arbitrary number of feature maps with arbitrary
sults in a sequence of three RCUs between each block. To
resolutions and depths.
reflect this behavior in the last RefineNet-1 block, we place
Residual convolution unit. The first part of each Re- two additional RCUs before the final softmax prediction
fineNet block consists of an adaptive convolution set that step. The goal here is to employ non-linearity operations
mainly fine-tunes the pretrained ResNet weights for our on the multi-path fused feature maps to generate features
task. To that end, each input path is passed sequentially for further processing or for final prediction. The feature
through two residual convolution units (RCU), which is a dimension remains the same after going through this block.
simplified version of the convolution unit in the original
ResNet [24], where the batch-normalization layers are re- 3.3. Identity Mappings in RefineNet
moved (cf . Fig. 3(b)). The filter number for each input path
Note that all convolutional components of the RefineNet
is set to 512 for RefineNet-4 and 256 for the remaining ones
have been carefully constructed inspired by the idea behind
in our experiments.
residual connections and follow the rule of identity map-
Multi-resolution fusion. All path inputs are then fused into ping [25]. This enables effective backward propagation of
a high-resolution feature map by the multi-resolution fusion the gradient through RefineNet and facilitates end-to-end
block, depicted in Fig. 3(c). This block first applies convo- learning of cascaded multi-path refinement networks.
lutions for input adaptation, which generate feature maps Employing residual connections with identity mappings
of the same feature dimension (the smallest one among the allows the gradient to be directly propagated from one block
to any other blocks, as was recently shown by [25]. This Table 1. Object parsing results on the Person-Part dataset. Our
method achieves the best performance (bold).
concept encourages to maintain a clean information path method IoU
for shortcut connections, so that these connections are not Attention [7] 56.4
“blocked” by any non-linear layers or components. Instead, HAZN [45] 57.5
non-linear operations are placed on branches of the main LG-LSTM [29] 58.0
information path. We follow this guideline for developing Graph-LSTM [28] 60.2
DeepLab [5] 62.8
the individual components in RefineNet, including all con- DeepLab-v2 (Res101) [6] 64.9
volution units. It is this particular strategy that allows the RefineNet-Res101 (ours) 68.6
multi-cascaded RefineNet to be trained effectively. Note
that we include one non-linear activation layer (ReLU) in Table 2. Ablation experiments on NYUDv2 and Person-Part.
the chained residual pooling block. We observed that this Initialization Chained pool. Msc Eva NYUDv2 Person-Parts
ReLU is important for the effectiveness of subsequent pool- ResNet-50 no no 40.4 64.1
ResNet-50 yes no 42.5 65.7
ing operations and it also makes the model less sensitive to ResNet-50 yes yes 43.8 67.1
changes in the learning rate. We observed that one single ResNet-101 yes no 43.6 67.6
ReLU in each RefineNet block does not noticeably reduce ResNet-101 yes yes 44.7 68.6
ResNet-152 yes yes 46.5 68.8
the effectiveness of gradient flow.
We have both short-range and long-range residual con-
nections in RefineNet. Short-range residual connections re- and the mean accuracy [36] over all classes. As commonly
fer to local shot-cut connections in one RCU or the residual done in the literature, we apply simple data augmentation
pooling component, while long-range residual connections during training. Specifically, we perform random scaling
refer to the connection between RefineNet modules and the (ranging from 0.7 to 1.3), random cropping and horizontal
ResNet blocks. With long-range residual connections, the flipping of the images. If not further specified, we apply
gradient can be directly propagated to early convolution lay- test-time multi-scale evaluation, which is a common prac-
ers in ResNet and thus enables end-to-end training of all tice in segmentation methods [10, 6]. For multi-scale evalu-
network components. ation, we average the predictions on the same image across
The fusion block fuses the information of multiple short- different scales for the final prediction. We also present an
cut paths, which can be considered as performing summa- ablation experiment to inspect the impact of various com-
tion fusion of multiple residual connections with necessary ponents and an alternative 2-cascaded version of our model.
dimension or resolution adaptation. In this aspect, the role Our system is built on MatConvNet [44].
of the multi-resolution fusion block here is analogous to
the role of the “summation” fusion in a conventional resid- 4.1. Object Parsing
ual convolution unit in ResNet. There are certain layers in
RefineNet, and in particular within the fusion block, that We first present our results on the task of object parsing,
perform linear feature transformation operations, like linear which consists of recognizing and segmenting object parts.
feature dimension reduction or bilinear up-sampling. These We carry out experiments on the Person-Part dataset [8, 7]
layers are placed on the shortcut paths, which is similar to which provides pixel-level labels for six person parts in-
the case in ResNet [24]. As in in ResNet, when a shortcut cluding Head, Torso, Upper/Lower Arms and Upper/Lower
connection crosses two blocks, it will include a convolution Legs. The rest of each image is considered background.
layer in the shortcut path for linear feature dimension adap- There are training 1717 images and 1818 test images. We
tation, which ensures that the feature dimension matches the use four pooling blocks in our chained residual pooling for
subsequent summation in the next block. Since only linear this dataset.
transformation are employed in these layers, gradients still We compare our results to a number of state-of-the-art
can be propagated through these layers effectively. methods, listed in Table 1. The results clearly demon-
strate the improvement over previous works. In particular,
4. Experiments we significantly outperform the the recent DeepLab-v2 ap-
proach [6] which is based on dilated convolutions for high-
To show the effectiveness of our approach, we carry resolution segmentation, using the same ResNet as initial-
out comprehensive experiments on seven public datasets, ization. In Table 2, we present an ablation experiment to
which include six popular datasets for semantic segmenta- quantify the influence of the following components: Net-
tion on indoors and outdoors scenes (NYUDv2, PASCAL work depth, chained residual pooling and multi-scale eval-
VOC 2012, SUN-RGBD, PASCAL-Context, Cityscapes, uation (Msc Eva), as described earlier. This experiment
ADE20K MIT), and one dataset for object parsing called shows that each of these three factors can improve the over-
Person-Part. The segmentation quality is measured by the all performance. Qualitative examples of our object parsing
intersection-over-union (IoU) score [16], the pixel accuracy on this dataset are shown in Fig.4.
Table 3. Segmentation results on NYUDv2 (40 classes).
method training data pixel acc. mean acc. IoU
Gupta et al. [20] RGB-D 60.3 - 28.6
FCN-32s [36] RGB 60.0 42.2 29.2
FCN-HHA [36] RGB-D 65.4 46.1 34.0
Context [30] RGB 70.0 53.6 40.6
RefineNet-Res152 RGB 73.6 58.9 46.5
person
potted
mbike
bottle
sheep
horse
chair
table
train
aero
boat
bike
sofa
cow
bird
dog
bus
car
cat
tv
Method mean
FCN-8s [36] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2
DeconvNet [38] 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.5
CRF-RNN [47] 90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1 74.7
BoxSup [10] 89.8 38.0 89.2 68.9 68.0 89.6 83.0 87.7 34.4 83.6 67.1 81.5 83.7 85.2 83.5 58.6 84.9 55.8 81.2 70.7 75.2
DPN [35] 89.0 61.6 87.7 66.8 74.7 91.2 84.3 87.6 36.5 86.3 66.1 84.4 87.8 85.6 85.4 63.6 87.3 61.3 79.4 66.4 77.5
Context [30] 94.1 40.7 84.1 67.8 75.9 93.4 84.3 88.4 42.5 86.4 64.7 85.4 89.0 85.8 86.0 67.5 90.2 63.8 80.9 73.0 78.0
DeepLab [5] 89.1 38.3 88.1 63.3 69.7 87.1 83.1 85.0 29.3 76.5 56.5 79.8 77.9 85.8 82.4 57.4 84.3 54.9 80.5 64.1 72.7
DeepLab2-Res101 [6] 92.6 60.4 91.6 63.4 76.3 95.0 88.4 92.6 32.7 88.5 67.6 89.6 92.1 87.0 87.4 63.3 88.3 60.0 86.8 74.5 79.7
CSupelec-Res101 [4] 92.9 61.2 91.0 66.3 77.7 95.3 88.9 92.4 33.8 88.4 69.1 89.8 92.9 87.7 87.5 62.6 89.9 59.2 87.1 74.2 80.2
RefineNet-Res101 94.9 60.2 92.8 77.5 81.5 95.0 87.4 93.3 39.6 89.3 73.0 92.7 92.4 85.4 88.3 69.7 92.2 65.3 84.2 78.7 82.4
RefineNet-Res152 94.7 64.3 94.9 74.9 82.9 95.1 88.5 94.7 45.5 91.4 76.3 90.6 91.8 88.1 88.0 69.9 92.3 65.9 88.7 76.8 83.4
1 1
1/4 1/4
RefineNet RefineNet
1/4
1/4
1/8 1/8
1/16 1/16 2
RefineNet
1/16
1/32 1/32
Prediction Prediction
(a) (b)
1/4
Prediction
2
1/8 RefineNet
1/8
3
1/16 RefineNet
4 1/16
1/32 RefineNet
1/32
0.6x input
1/8
1/16
1/32
(c)
Figure 7. Illustration of 3 variants of our network architecture: (a) single RefineNet, (b) 2-cascaded RefineNet and (c) 4-cascaded RefineNet
with 2-scale ResNet. Note that our proposed RefineNet block can seamlessly handle different numbers of inputs of arbitrary resolutions
and dimensions without any modification.
an ARC Laureate Fellowship (FL130100102). convolutional nets, atrous convolution, and fully connected
CRFs. CoRR, abs/1606.00915, 2016.
[7] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At-
References tention to scale: Scale-aware semantic image segmentation.
arXiv preprint arXiv:1511.03339, 2015.
[1] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr. Higher [8] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and
order conditional random fields in deep neural networks. In A. Yuille. Detect what you can: Detecting and representing
European Conference on Computer Vision. Springer, 2016. objects using holistic models and body parts. In Proceed-
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A ings of the IEEE Conference on Computer Vision and Pattern
deep convolutional encoder-decoder architecture for image Recognition, pages 1971–1978, 2014.
segmentation. CoRR, 2015. [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[3] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se- R. Benenson, U. Franke, S. Roth, and B. Schiele. The
mantic segmentation with second-order pooling. In ECCV, cityscapes dataset for semantic urban scene understanding.
2012. In CVPR, 2016.
[4] S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- [10] J. Dai, K. He, and J. Sun. BoxSup: Exploiting bounding
ference for semantic image segmentation with deep gaussian boxes to supervise convolutional networks for semantic seg-
crfs. In ECCV, 2016. mentation. In ICCV, 2015.
[5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [11] J. Dai, K. He, and J. Sun. Convolutional feature masking for
Yuille. Semantic image segmentation with deep convolu- joint object and stuff segmentation. In CVPR, 2015.
tional nets and fully connected CRFs. In ICLR, 2015. [12] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep
[6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. convolutional network for image super-resolution. In ECCV,
Yuille. DeepLab: Semantic image segmentation with deep
2014. [33] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields
[13] D. Eigen and R. Fergus. Predicting depth, surface normals for depth estimation from a single image. In CVPR, 2015.
and semantic labels with a common multi-scale convolu- [34] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth
tional architecture. In ICCV, 2015. from single monocular images using deep convolutional neu-
[14] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image ral fields. CoRR, abs/1502.07411, 2015.
taken through a window covered with dirt or rain. In ICCV, [35] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic im-
2013. age segmentation via deep parsing network. In ICCV, 2015.
[15] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction [36] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
from a single image using a multi-scale deep network. In networks for semantic segmentation. In CVPR, 2015.
NIPS, 2014. [37] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fi-
[16] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and dler, R. Urtasun, et al. The role of context for object detection
A. Zisserman. The pascal visual object classes (voc) chal- and semantic segmentation in the wild. In CVPR, 2014.
lenge. In IJCV, 2010. [38] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
[17] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruc- work for semantic segmentation. In ICCV, 2015.
tion and refinement for semantic segmentation. In ECCV, [39] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features
2016. and algorithms. In CVPR, 2012.
[18] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich [40] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-
feature hierarchies for accurate object detection and semantic lutional networks for biomedical image segmentation. In
segmentation. In CVPR, 2014. N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, ed-
[19] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organiza- itors, Medical Image Computing and Computer-Assisted In-
tion and recognition of indoor scenes from rgb-d images. In tervention, pages 234–241, 2015.
CVPR, 2013. [41] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
[20] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning segmentation and support inference from rgbd images. In
rich features from RGB-D images for object detection and ECCV, 2012.
segmentation. In ECCV, 2014. [42] K. Simonyan and A. Zisserman. Very deep convolutional
[21] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and J. Ma- networks for large-scale image recognition. In ICLR, 2015.
lik. Semantic contours from inverse detectors. In ICCV, [43] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
2011. scene understanding benchmark suite. In CVPR, 2015.
[22] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper- [44] A. Vedaldi and K. Lenc. MatConvNet – convolutional neural
columns for object segmentation and fine-grained localiza- networks for matlab, 2014.
tion. In CVPR, 2014. [45] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better
[23] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul- to see clearer: Human and object parsing with hierarchical
taneous detection and segmentation. In ECCV, 2014. auto-zoom net. arXiv preprint arXiv:1511.06881, 2015.
[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [46] F. Yu and V. Koltun. Multi-scale context aggregation by di-
for image recognition. In CVPR 2016, 2016. lated convolutions. CoRR, 2015.
[25] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [47] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
deep residual networks. arXiv preprint arXiv:1603.05027, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random
2016. fields as recurrent neural networks. In ICCV, 2015.
[26] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian [48] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and
segnet: Model uncertainty in deep convolutional encoder- A. Torralba. Semantic understanding of scenes through the
decoder architectures for scene understanding. CoRR, ADE20K dataset. CoRR, abs/1608.05442, 2016.
abs/1511.02680, 2015.
[27] P. Krähenbühl and V. Koltun. Efficient inference in fully
connected CRFs with Gaussian edge potentials. In NIPS,
2012.
[28] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Se-
mantic object parsing with graph lstm. arXiv preprint
arXiv:1603.07063, 2016.
[29] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan.
Semantic object parsing with local-global long short-term
memory. arXiv preprint arXiv:1511.04510, 2015.
[30] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient
piecewise training of deep structured models for semantic
segmentation. In CVPR, 2016.
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, 2014.
[32] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspon-
dence across scenes and its applications. IEEE T. Pattern
Analysis & Machine Intelligence, 2011.