KEMBAR78
U-Net-Based Medical Image Segmentation | PDF | Image Segmentation | Medical Imaging
0% found this document useful (0 votes)
86 views16 pages

U-Net-Based Medical Image Segmentation

Uploaded by

si_aymen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views16 pages

U-Net-Based Medical Image Segmentation

Uploaded by

si_aymen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Hindawi

Journal of Healthcare Engineering


Volume 2022, Article ID 4189781, 16 pages
https://doi.org/10.1155/2022/4189781

Review Article
U-Net-Based Medical Image Segmentation

Xiao-Xia Yin ,1,2 Le Sun,3 Yuhan Fu,1 Ruiliang Lu,4 and Yanchun Zhang 1

1
Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China
2
College of Engineering and Science, Victoria University, Melbourne, VIC 8001, Australia
3
Engineering Research Center of Digital Forensics, Ministry of Education,
Nanjing University of Information Science and Technology, Nanjing, China
4
Department of Radiology, The First People’s Hospital of Foshan, Foshan 528000, China

Correspondence should be addressed to Xiao-Xia Yin; xiaoxia.yin@gzhu.edu.cn

Received 26 January 2022; Revised 2 March 2022; Accepted 23 March 2022; Published 15 April 2022

Academic Editor: Hangjun Che

Copyright © 2022 Xiao-Xia Yin et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Deep learning has been extensively applied to segmentation in medical imaging. U-Net proposed in 2015 shows the advantages of
accurate segmentation of small targets and its scalable network architecture. With the increasing requirements for the per-
formance of segmentation in medical imaging in recent years, U-Net has been cited academically more than 2500 times. Many
scholars have been constantly developing the U-Net architecture. This paper summarizes the medical image segmentation
technologies based on the U-Net structure variants concerning their structure, innovation, efficiency, etc.; reviews and categorizes
the related methodology; and introduces the loss functions, evaluation parameters, and modules commonly applied to seg-
mentation in medical imaging, which will provide a good reference for the future research.

1. Introduction computed tomography (CT), magnetic resonance imaging


(MRI), and ultrasound that are used by adults [18–64 years]
Interpretation of medical images such as CT and MRI re- annually in US and Ontario are also illustrated in
quires extensive training and skills because the segmentation Figures 1(b)–1(d), respectively. The imaging rates (per 1000
of organs and lesions needs to be performed layer by layer. people) of CT, MRI, and ultrasound use continued to in-
Manual segmentation means a heavy workload to the crease among adults, but at lower pace in more recent years.
doctors, which can introduce bias if it involves the subjective Whether the observed imaging utilization was appropriate
opinions of doctors. To analyze complicated images, it or was associated with improved patient outcomes is
usually requires doctors to make a joint diagnosis, which is unknown.
time consuming. Furthermore, automatic segmentation is a Nowadays, the application of deep learning technology
challenging task, and it is still an unsolved problem for most in medical imaging has attracted extensive attention. How to
medical applications due to the wide variety connected with automatically recognize and segment the lesions in medical
image modalities, encoding parameters, and organic images has become one of the issues that concern lots of
variability. researchers. Ronneberger et al. [2] proposed U-Net at the
According to [1], medical imaging increased rapidly MICCAI conference in 2015 to tackle this problem, which
from 2000 to 2016. As illustrated in Figure 1(a), retrospective was a breakthrough of deep learning in segmentation of
cohort study of patterns of medical imaging between 2000 medical imaging. U-Net is a Fully Convolutional Network
and 2016 was conducted among 16 million to 21 million (FCN) applied to biomedical image segmentation, which is
patients. These patients were enrolled annually in 7 US composed of the encoder, the bottleneck module, and the
integrated and mixed-model insurance health care systems decoder. The widely used U-Net meets the requirements of
and for individuals receiving care in Ontario, Canada. medical image segmentation for its U-shaped structure
Relative imaging rates by different imaging modality, such as combined with context information, fast training speed, and
2 Journal of Healthcare Engineering

Adults
1200 450
Age ≥ 65 y Age ≥ 65 y
400

Rate per 1000 Person-Years


Rate per 1000 Person-Years

1000
350
800 300
250
600 Age 18–64 y 200
400 150
Age 18–64 y
100
200
Age 18 y 50
0 0
2000 2004 2008 2012 2016 2000 2004 2008 2012 2016
Examination Year Examination Year
(a) (b)
Adults Adults
140 Age ≥ 65 y 600 Age ≥ 65 y
Rate per 1000 Person-Years

Rate per 1000 Person-Years


120 500 Age ≥ 65 y
100 Age 18–64 y
Age ≥ 65 y 400
80 Age 18–64 y
Age 18–64 y 300 Age 18–64 y
60
200
40
20 100

0 0
2000 2004 2008 2012 2016 2000 2004 2008 2012 2016
Examination Year Examination Year
(c) (d)

Figure 1: Illustration of relative rates of imaging for United States compared with Ontario from year 2000 to year 2016. CT indicates
computed tomography; MRI indicates magnetic resonance imaging. All US data are shown as solid curves; Ontario data are shown as dashed
curves [1]. (a) All examinations. (b) CT. (c) MRI. (d) Ultrasound.

a small amount of data used. The structure of U-Net is residual multikernel pooling block (RMP). The introduced
shown in Figure 2. CE-Net is widely applied to segmentation in 2D medical
Containing many slices, biomedical images are often imaging [11] and outperforms the original U-Net method.
blocky in a volume space. An image processing algorithm of To further advance the segmentation, UNet++, a novel
2D is often used to analyze a 3D image [3–7]. But when the and greater neural network structure for image segmenta-
information is sorted and trained one by one, it would result tion was proposed by Zhou et al. [13]. Moreover, it is a
in increased computational expenses and low efficiency. deeply supervised encoder-decoder network connected by a
Therefore, it is difficult to deal with volume images in many series of nested and dense hopping paths to narrow the
cases. A 3D U-Net model derived from the 2D U-Net is semantic gap between the encoding and decoding subnet-
designed to address these problems. To further target on work feature maps. Later, to improve more accuracy, es-
architectures of different forms and dimensions, Oktay et al. pecially for organs of different sizes, a new version UNET 3+
[8] proposed a new attention gate (AG) model for medical was designed by Huang et al. [14]. It utilizes full-scale skip
imaging analysis. The model trained with AG indirectly links and deep supervisions, which combines low-level
learns to restrain irrelevant regions in an input image and details and high-level semantics mapped at different scales of
highlight striking features suitable for specific tasks. This is features and learns hierarchical representation from full-
conducive to eradicating the inevitability of applying overt scale aggregated feature maps. The suggested UNet 3+ could
exterior tissue/organ localization units of cascading con- increase computational productivity by decreasing network
volutional neural networks (CNNs) [8, 11]. AG could be parameters.
combined with standard CNN structure like U-Net, which Framework regarding nnU-Net (“no-new-Net”) is de-
increases the sensitivity and the precision of the model. To veloped by Isensee et al. [15] as a robust self-adaptive
get more advanced data and retain spatial data aimed at 2D framework from U-Net. It was designed by making slight
segmentation, Gu et al. in 2019 [12] proposed the context alterations to the 2D and 3D U-Net, where 2D, 3D, 2D, and
encoder network (CE-Net), using pretrained Res-Net blocks 3D links were proposed to work together and form a net-
as fixed feature extractors. It is mainly composed of three work pool. The nnU-Net could not only automatically adapt
parts—feature encoder, context extractor, and feature de- its architecture to the given image geometry, but thoroughly
coder. The context extractor is composed of a newly in- define all the other steps including image preprocessing, data
troduced dense atrous convolution (DAC) block and a training, testing, and potential postprocessing.
Journal of Healthcare Engineering 3

1 65 65

128 64 64 2

input output
image segmnetation
tile map

392 × 392

390 × 390

388 × 388
388 × 388
572 × 572
570 × 570
568 × 568

128 128
256 128

2002

1982
1962
2842
2822
2802

256 256 512 128

conv 3×3, ReLU

1042
1402

1382

1362

1022

1002
copy and crop
512 512 1024
max pool 2×2
682

562

542

522
662

322 642

1024 up–conv 2×2

conv 1×1
302

282

Figure 2: Illustration of U-Net convolution network structure. The left side of the U-shape is the encoding stage, also called contraction path
with each layer consisting of two 3 ∗ 3 convolutions with ReLu activation and a 2 ∗ 2 maximum pooling layer. The right side of the U-shape,
also called expansion part, consists of the decoding stage and the upsampling process that is realized via 2 ∗ 2 deconvolution to reduce the
quantity of input channels by half [2].

U2-Net as a simple and powerful deep network archi- Since U-Net was proposed, its encoder-decoder-hop
tecture developed by Qin et al. [16] consists of a two-level network structure has inspired a large amount of segmen-
nested U-shaped structure applied to salient target detection tation means in medical imaging. Such deeplearning tech-
(SOD). It has the following advantages: (1) due to the mixed nologies as attention mechanism, dense module, feature
receptive fields of various sizes in the proposed residual enhancement, evaluation function improvement, and other
U-shaped block (RSU), it could capture a larger amount of basic U-Net structures have been introduced into medical
contextual data at various scales. (2) The pooling operation image segmentation and become widely adopted. These
used in the RSU block increases the depth of the entire variations of U-Net-related deep learning networks are
structure without substantially pushing up the computa- designed to optimize results by improving the accuracy and
tional cost. computing efficiency of medical image segmentation
TransUNet designed by Chen et al. [17] encodes tokenized through changing network structures, adding new modules,
image patches and extracts global contexts from the input etc. However, most of the existing literature related to U-Net
sequence of CNN feature map; the decoder upsamples the focused on introducing isolated new ideas and rarely gave a
encoded features and combines with the high-resolution comprehensive review that summarizes the variations of the
CNN feature maps for precise localization. It uses trans- U-Net structure for deep learning of segmentation in
formers as a powerful encoding structure for segmentation. medical imaging. This paper discussed some of these ideas in
Due to the inherent locality of convolution operations, U-Net more depth.
usually shows limitations in clearly modeling dependencies. To sum up, the basic motivation behind this work is not
The transformer designed for sequence-to-sequence predic- to elaborate into new ideas in U-Net-related deep learning
tion has become an alternative architecture with an innate networks but to use effectively U-Net-related deep learning
global self-attention mechanism while localization capabilities networks techniques into the segmentation of multidi-
of the transformer frame may be limited due to insufficient mensional data for biomedical applications. The presented
low-level details. method can be generalized to any dimension and can be
4 Journal of Healthcare Engineering

used effectively to other types of multidimensional data as training accuracy, feature fusion, small sample training set,
well. and generalization improvement. Various strategies are
This paper is organized as follows. Section 2 addresses applied in the designing of different network structures to
the current challenges faced by medical image segmentation. address different segmentation problems.
Section 3 reviews these variations of U-Net-related deep This section is focused on variations of U-Net-based
learning networks. Section 4 collects various experiment networks, with the description of U-Net framework, fol-
results in literature in relation to different U-Net networks, lowed by the comprehensive analysis of the U-Net variants
along with the validation parameters for optimized network by performing (1) intermodality and (2) intramodality
structure through the associated deep learning models. The categorization to establish better insights into the associated
future development in the U-Net-based variant networks is challenges and solutions. The main related work is sum-
analyzed and discussed. Finally, Section 5 concludes this marized from the aspects of the improved performance
paper. indicators and the main structural characteristics.

2. Existing Challenges 3.1. Traditional U-Net. The traditional U-Net is two-di-


mensional network architecture whose structure is shown in
This section presents the current challenges faced by medical
Figure 2. U-Net modifies and extends the Fully Convolu-
image segmentation which make it inevitable to improve
tional Network (FCN), making it work with very few
and innovate U-Net-based deep learning approaches.
training images and produce more accurate segmentation.
First, medical image processing requires extremely high
The major idea is to replace the general shrinkage network
accuracy for disease diagnosis [18–23]. Segmentation in
with sequential layers and the pooling operation is related to
medical imaging refers to pixel-level or voxel-level seg-
downsampling operator, which is supplemented by
mentation. Generally, the boundary between multiple cells
upsampling operator. Hence the output’s resolution is raised
and organs is difficult to be distinguished on the image [3].
by these layers. The high-resolution of the contracted path is
Moreover, the data obtained from the image are usually
combined with the upsampled output for localization.
preprocessed, the relevant network is built, which continues
Hence sequential convolutional layers could study fine
to be run by adjusting the parameters even though a certain
features and result in a more accurate segmentation.
level of accuracy is reached by using the relevant deep
An important modification in the U-Net architecture lies
learning model [24].
in the upsampling section, where there are huge amounts of
Second, medical images are acquired from various
feature channels allowing the network to spread contextual
medical equipment and the standards for them and anno-
data to higher-resolution layers. Therefore, the expansion
tations or performance of CT/MRI machines are not uni-
path is roughly symmetrical to the contraction path, forming
form. Hence deep-learning-related trained models are only
a U-shaped structure. The network applies the effective part
suitable for specific scenarios. Meanwhile, the deep network
of every convolution—the map of segmentation contains
with weak generalization may easily capture wrong features
mere pixels, and the complete context of the pixels could be
from the analyzed medical images. Furthermore, significant
obtained in the input image. This method allows seamless
inequality always exists between the size of negative and
segmentation in arbitrarily large imaging using crucial
positive samples, which may have a greater impact on the
overlapping tiling strategies, without which the resolution
segmentation. However, U-Net could afford an approach
will be limited by GPU memory [1].
achieving better performance in reducing overfitting [25].
The traditional CNN is usually connected to several fully
Third, interpretable deep learning models applied to
connected layers after convolution and the feature map
analyze medical images are highly required, but there is a
produced by the convolutional layer is mapped into a feature
lack of confidence in its predicted results [26, 27]. U-Net is a
vector with a fixed length for image-level classification. An
CNN showing poor interpretability. Segmentation in
improved FCN structure, however, identifies the image at
medical imaging could reflect the patient’s physiological
the pixel level, thereby facilitating the task of segmentation
condition and accurate disease diagnosis. It is not easy for
in imaging at the semantic level [28].
the segmentation lacking interpretability and confidence to
U-Net could be applied to the segmentation due to its
be trusted and recognized by professional doctors for clinic
large measurement size of medical images. It is impossible to
application. Although disease diagnosis mainly relies on
input the large medical images into the network when they
images, combined with other supplements, which has also
are segmented and required to be cut into small pieces.
increased the complexity. It is a challenge to realize the
Overlapping-tilling strategies are suitable for small pieces
interpretability and confidence of medical image segmen-
cutting using U-Net due to its network structure. Thus, it
tation via perceiving and adjusting these trade-offs.
could accept images of any size as inputs [29].

3. Methodology
3.2. 3D U-Net. Biomedical imaging is a set of three-di-
Various medical image segmentation methods have been mensional images composed of slices at different locations.
developed very quickly based on U-Net for performance Biomedical image analysis involves dealing with a large
optimization. U-Net is improved in the areas of application amount of volume data. Annotating these data labeled by
range, feature enhancement, training speed optimization, segmentation could cause difficulties because only two-
Journal of Healthcare Engineering 5

dimensional slices can be displayed on computers. There- stitched or at the bottleneck of U-Net to reduce false-positive
fore, low efficiency and loss of contexts are common during predictions.
3D-image processing by traditional 2D image models. To The Attention U-Net put forward by Oktay et al. [8] in
solve this, Ozgun Cicek et al. [30] put forward a 3D U-Net 2018 adds an integrated attention gate (AG) before U-Net
with a shrinking encoder part for analyzing the entire image splices the corresponding features in the encoder and de-
and a continuous expansion decoder part for generating full- coder and readjusted the output features of the encoder. This
resolution segmentation on the basis of the previous U-Net module facilitates generation of gating signal to eliminate the
structure. The structure of 3D U-Net is similar to 2D U-Net response of irrelevant and noisy ambiguity in the skip
in many aspects, except that all operations in the 3D network connection, emphasizing the salient features transmitted via
are replaced with corresponding 3D convolution, 3D the skip connection. Figure 2 displays the inside structure of
pooling, and 3D upsampling. Batch normalization (BN) [31] the attention module.
is used to prevent the network bottlenecks. The salient features useful for specific tasks are stressed
Just like the standard U-Net, there is an encoding path in the model trained by AG, which indirectly learns and
and a decoding path with 4 parsing steps in every layer in the suppresses unconcerned areas of the input image. Thus,
encoding path. It contains two 3 × 3 × 3 convolutions fol- obvious exterior tissue/organ positioning modules are not
lowed by a corrected linear unit (ReLu) and then a 2 × 2 × 2 necessarily used in the Cascaded CNN. Without extra
maximum pooling layer with 2-step size of each. Every layer computational cost, the forecast precision and sensitivity of
in the synthesis path is composed of 2 × 2 × 2 upper con- the model could be improved by AG due to its compatibility
volutions with two steps in each dimension and two sub- in standard CNN architectures like U-Net. To estimate the
sequent 3 × 3 × 3 convolutions with a ReLu active layer attention U-Net structure, two big CT abdominal data sets
behind each. The skip connections from the equal-resolution were used for multiclass segmentation in imaging. The re-
feature map in the encoding path provide the necessary sults show a significant enhancement of U-Net’s prediction
high-resolution features for the decoding path. In the last performance by AG under different data sets and training
layer, 1 × 1 × 1 convolution decreases the quantity of output scales, and the computational efficiency is maintained as
channels to that of labels standing at 3. The structure has well.
19069955 parameters in total. The structure of attention U-Net, as shown in Figure 3, is
In addition to the rotation, scaling, and gray value in- a U-Net-based structure with two stages: encoding and
crease, smooth dense deformation fields are applied to the decoding. The coarse-grained map of the left structure
data and ground truth labelers before training Therefore, captures information in the context and highlights the type
random vectors are sampled from a general distribution and position of foreground objects. Subsequently, feature
whose standard deviation is 4 in a grid spaced 32 voxels in maps extracted from numerous scales are fused via jump
each direction, followed by the application of B-spline in- links to merge coarse-grained and fine-grained dense pre-
terpolation. The softmax with weighted cross-entropy loss is dictions. As for the method put forward in the paper, the
used to compare the network output and the ground truth attention gate mechanism is to add an AG to each skip
label, to reduce the weight of the common background, connection layer to spread the attention coefficient. AG has
increase the weight of internal tubules, and realize the two inputs, x from the feature map of the shallow network
balance effect of small blood vessels and background voxels on the left and g from that of the lower network, which will
on the loss. be output from AG. Then the feature fusion is performed on
This end-to-end learning strategy could use semiauto- the feature map after sampling on the right.
matic and completely automatic methods to segment 3D This method makes it unnecessary to utilize external
targets from sparse annotations. The structure and data object positioning models. It is a convenient tool not only
enhancement of this network allow it to learn from a small used in natural image analysis and machine translation but
number of labeled data and to obtain good generalization also in image classification and regression. Studies showed
capabilities. Appropriate rigid transformation and minor that the algorithm is very useful for the identification and
elastic deformation applications could generate reasonable positioning of tissues/organs, and a certain degree of ac-
images, rationalize its preprocessing method, and enable the curacy could be achieved in the use of smaller computing
network structure to be extended to any size of the 3D data resources, especially for small-sized organs such as the
set. pancreas [32].

3.3. Attention U-Net. Attention could be considered as a 3.4. CE-Net. A fusion of features with different scales serves
method of organizing computational resources to interpret as a crucial approach to optimizing segmentation perfor-
the signal informatively. Since its introduction, the attention mance. Due to fewer convolutions, the low-level features
mechanism has become more and more popular in the deep experience lower semantics and more noise despite of their
learning industry. This paper summarizes a method in the higher resolution and more position. In addition, the res-
application of the attention mechanism onto the U-Net olution is considerably low and the detail perception is poor
network. Given the small lesions and large shape changes, despite that high-level features contain more intensive se-
the attention module is generally added in image segmen- mantic information. It is of huge significance to efficiently
tation before the encoder- and decoder-related features are combine the advantages of these two to improve the
6 Journal of Healthcare Engineering

Nc × H1 × W1 × D1
Segmentation Map
F1 × H1 × W1 × D1
1 × H1 × W1 × D1

F1 × H1 × W1 × D1
F1 × H1 × W1 × D1
Input Image

F1 × H2 × W2 × D2

F2 × H2 × W2 × D2

F2 × H2 × W2 × D2
F2 × H2 × W2 × D2

F1 × H2 × W2 × D2
F3 × H3 × W3 × D3
F2 × H3 × W3 × D3

F3 × H3 × W3 × D3
F3 × H3 × W3 × D3

F2 × H3 × W3 × D3
(Conv 3×3×3 + ReLU) (×2)
Upsampling (by 2)
Max–pooling (by 2)

F3 × H4 × W4 × D4

F4 × H4 × W4 × D4
Skip Connection
Gating Signal (Query)
Concatenation
Attention Gate

Figure 3: The U-Net model structure of the proposed AG is added. The input image is gradually filtered and downsampled at each scale in
the network’s encoding part (for example, H4 � H1/8), indicating the quantity of classes. The gates (AGs) filter the characteristics of
propagation by skipping connections. The feature AGs is selected by extracting context information (gating) from a coarser scale [8].

segmentation model. Feature fusion includes the contextual connection. Res-Net adds a shortcut mechanism to avoid
features’ fusion of the network and the fusion of different gradient disappearance and improve the network conver-
modal features in a larger sense. Gu et al. [10] designed a new gence efficiency, as shown in Figure 4(b). It is a basic method
network called CE-Net, which adopts new modules of dense to improve U-Net segmentation performance using pre-
atrous convolution block (DAC) and residual multikernel trained Res-Net.
pooling block (RMP) to offer fused information like the
fusion of contextual features from the encoder, to get higher-
level information with a decrease in the feature loss [33], for 3.4.2. Context Extraction Module. The context extraction
example, to retain spatial information for 2D segmentation module, composed of DAC and RMP, extracts contextual se-
in medical imaging and classification [34]. mantic information and produces more advanced feature maps.
The overall framework of CE-Net is shown in Figure 4.
The DAC block could identify broader and more in-depth (1) Hollow Convolution. As for semantic segmentation and
semantic features via injecting four cascaded branches with object detection, deep convolutional layers have displayed
multiscale dense hole convolution. The remaining connec- superiority in image feature representation extraction. But
tions are used to prevent the gradient from disappearing. In the pooling layer might cause loss of image semantic in-
addition, the RMP block is a residual multicore pool based formation, which is solved by applying dense hole convo-
on the spatial pyramid pool, which encodes the multiscale lution [35] to dense image segmentation. The hole
context features of the object extracted from the DAC convolution has an expansion rate parameter which implies
module without extra learning weights using various size that the size of the expansion and the convolution kernel is
pool operations. In summary, the DAC block extracts rich the same with the ordinary convolution. It means param-
feature representations through multiscale dense hole eters remain unchanged in the neural network, but the hole
convolution and then uses the RMP block to extract more convolution has a larger receptive field, which refers to the
context information through multiscale pooling operations. size involved by the convolution kernel on the image. The
The joint use of newly proposed DAC block and RMP block size of the receptive field is related to stride, the number of
with the backbone codec structure is unprecedented in CE- convolutional layers, and padding parameters.
Net’s context encoder network. This allows the enhancement
of the segmentation by further collecting abstract features (2) DAC. Inspired by Inception [36, 37], Res-Net [38], and
and maintaining more spatial information. hole convolution, dense hole convolution blocks (DAC) [11]
are used for encoding high-level semantic feature maps. The
DAC has four branches cascading down, with the acceptance
3.4.1. Feature Encoder Module. In the U-Net structure, each field of each branch being 3, 7, 9, and 19, respectively and a
encoder block includes two convolutional layers and a gradual increase in the number of atrous convolutions. DAC
maximum pooling layer. As for the CE-Net network uses different receptive fields like the inception structure. In
structure, a pretrained ResNet-34 is used in the feature each hole convolution branch, a 1 × 1 convolution is applied
encoding module and the first four feature extraction to ReLu. The shortcut links in Res-Net are used directly to
blocks are retained without mean pooling and full add the original features. Generally, the convolution of the
Journal of Healthcare Engineering 7

DAC RMP
block block
2
14 × 512 142 × 512 142 × 516
2 282 × 256
28 × 256 2
56 × 128
562 × 128 2
112 × 64
2
112 × 64 2242 × 64
2
224 × 64
4482 × 3

(a) Pretrained Feature Encoder Context Extractor Feature Decoder

Residual blocks Decoder blocks


Skip Connection

De–Conv
3×3 , 2
Conv

Conv
1×1

1×1
Max Pooling + Residual blocks
Conv

Conv
3×3

3×3

Conv 7 × 7 stride 2

(b) (c)
Figure 4: CE-net network structure diagram. (a) The original U-Net encoder block is first supplemented by the ResNet-34 block, shown as
(b), to be pretrained by ImageNet. A dense convolution (DAC) block and a RMP block were contained in the bottleneck module. Eventually,
the features are withdrawn and gathered in the decoder module. The feature size is enlarged by a decoder block (c), including 1 × 1
convolution and 3 × 3 deconvolution operations, to supplement the original upsampling operation [11].

large receptive field could extract and produce a larger deconvolution. The image can be enlarged by conducting
number of abstract features for the large target and vice upsampling through linear interpolation. Deconvolution
versa. The DAC block can extract features from the targets of (also known as transposed convolution) uses convolution to
various sizes through the combination of hole convolutions expand the image. Adaptive mapping is used in transposed
and different expansion rates. convolution to recover more comprehensive information.
Therefore, transposed convolution is implemented to
(3) RMP. One of the challenges in medical image seg- achieve a higher resolution in the decoder. Based on the
mentation lies in the significant change in target size [39, 40]. shortcut connection and the decoder block, the feature
For instance, an advanced tumor is usually much bigger than decoder module produces a mask of the same size as the
the early one [41]. An RMP [11] is proposed to solve this original input.
problem, by which targets with various sizes could be de- Unlike U-Net, CE-Net applies a pretrained Res-Net
tected by applying numerous effective fields of view. The block in the feature encoder. The integration of DAC
proposed RMP utilizes four receptive fields with different module, RMP module and Res-Net into the U-Net archi-
size to encode global context information. To reduce the tecture allows it to retain more spatial information. It was
dimensionality of the weights and the computational cost, a suggested that this approach could optimize segmentation in
1 × 1 convolution is used after each pooling branch. Af- medical imaging for various tasks of optic disc segmentation
terwards, the upsampling of the low-dimensional feature [42], retinal blood vessel detection [11], lung segmentation
map is performed to obtain the same size of features as an [11], cell contour segmentation [35], and retinal OCT layer
original feature map through bilinear interpolation, allowing segmentation [43]. This approach could be extensively
extraction of features of various scales. utilized in other 2D medical image segmentation tasks.

3.4.3. Feature Decoder Module. The feature decoder module 3.5. UNET++. Variants of encoder and decoder architec-
allows the recovery of the high-level semantic features tures such as U-Net and FCN are found to be the most
extracted from the context extractor module and the feature advanced image segmentation models [44]. These seg-
encoder module. Continuous pooling and convolution mentation networks share a common feature—skip con-
operations often lead to the loss of information, which, nections that link the depth, semantics, and coarse-grained
however, can be remedied by conducting a quick connection feature maps from the decoder subnetwork together with the
from the encoder to the decoder. In U-shaped networks, the shallow, low-level, and fine-grained feature mapping from
two basic operations of decoder are simple upsampling and the encoder subnetwork. More pinpoint precision is needed
8 Journal of Healthcare Engineering

in segmentation of lesions or abnormalities in medical In segmentation of most nonorgan images, false posi-
images needs than regular images. Edge segmentation faults tives are inevitable. The background noise information most
in medical imaging may cause some serious consequences in likely stays at a shallower level, causing oversegmentation.
clinic. Therefore, a variety of methods to improve feature UNet3++ solves this problem by adding classification-
fusion have been proposed to address that. In addition, Zhou guidance module (CGM) designed to foresee whether the
et al. [13, 45] improved the skip connection and proposed input image has organs to realize more accurate segmen-
UNet++ with deep monitoring nested dense jump con- tation. With the largest number of semantic information, the
nection path. classification results could further direct each segmentation
As for U-Net, the feature map of the encoder is received side to be output in two steps. With the help of the argmax
by the decoder. But UNet++ uses a dense convolutional function, the two-dimensional tensor is converted into a
block and the quantity of convolutional layers relies on that single output of {0, 1}, which represents the presence/ab-
of the U-shaped structure. In essence, the dense convolution sence of organs. Subsequently, the single classification
block connects the semantic gap between the encoder and output is multiplied with the side segmentation output.
decoder feature maps. It is assumed that when the received Given the simplicity of the binary classification task, this
encoder feature map and the related decoder feature map are module could easily obtain accurate classification by opti-
similar at the semantic level, the optimizer can easily tackle mizing the binary cross-entropy loss function [48] and re-
the problems it encounters. The effective integration of alize the direction of oversegmentation of nonorgan images.
U-Nets of different depths is used to alleviate unknown In summary, UNet 3+ maximizes the application of full-
network depths. These U-Nets could share an encoder in scale feature maps and achieves precise segmentation and
part and simultaneously learn together through deep su- efficient network structure with fewer parameters and deep
pervision, which will allow the model to be pruned and supervision. It has been extensively validated, for example,
improved. This redesigned skip connection could aggregate on representative but demanding volumetric segmentation
semantic features of different scales on the decoder subnet, in medical imaging: (i) liver segmentation from 3D CT scans
thereby automatically generating a highly flexible feature and (ii) whole heart and big vessels segmentation from 3D
fusion scheme. MR images [49]. The CGM and the hybrid loss function are
further applied to obtain a higher level of accuracy in lo-
cation-aware and boundary-aware segmented images.
3.6. UNET 3+. UNet++, an improvement based on U-Net,
was designed by developing a structure with nested and
dense skip connections. But it does not express enough 3.7. nnU-Net. It has been designed for different tasks since
information from multiple scales and the network param- U-Net was first proposed, with its different network
eters are numerous and complex. UNet 3+ (UNet+++) is an structure, preprocessing, training, and inference. These
innovative network structure proposed by Huang et al. [46], options are dependent on each other and significant to the
which uses full-scale skip connections and deep supervi- final result. Fabian et al. [15, 50] proposed nnU-Net, namely
sions. Full-scale jump connection combines high-level se- no new-Net. The network is based on 2D and 3D U-Net with
mantics with low-level semantics from feature maps of a robust self-adaptive framework. It involves a set of three
various scales. Deep supervision learns hierarchical repre- relatively simple U-Net models. Only slight modifications
sentations from feature maps aggregated at multiple scales. are made to the original U-Net, and no various extension
This method uses the newly proposed hybrid loss function to plug-ins were used, including residual connection, dense
refine the results, particularly suitable for resolving organs of connection, and various attention mechanisms. The nnU-
different sizes. It not only improves accuracy and compu- Net gives unexpectedly accurate results in applications like
tational efficiency, but also reduces network parameters after accurate brain tumor segmentation [51]. Since medical
fewer channels compared to U-Net and UNet++. The net- images are often three-dimensional, the design of nnU-Net
work structure of UNet 3+ is shown in Figure 5. considers a basic U-Net architecture pool composed of 2D
To learn hierarchical representation from full-scale ag- U-Net, 3D U-Net, and U-Net cascade. 2D and 3D U-Net
gregated feature maps, UNet 3+ further adopts full-scale could generate full-resolution results. The first stage of the
deep supervision. Different from UNet++, each decoder cascaded network produces a low-resolution result and the
stage in UNet 3+ has a side output, which uses standard second stage optimizes it.
ground truth for supervision. To achieve in-depth super- Now that 3D U-Net is widely used, why is 2D still useful?
vision, the last layer at each decoder stage is sent to an This is because the author proves that when the data are
ordinary 3 × 3 convolutional layer, followed by a bilinear anisotropic, the traditional 3D segmentation method be-
upsampling and a sigmoid function to enlarge it to full comes very poor in resolution. The 3D network takes up a lot
resolution. of GPU memory. Then you could use smaller image slices for
To further strengthen the organ’s boundary, a multiscale training, but for images of larger organs such as livers, this
structural similarity index loss function is proposed to give block-based method will hinder training. This is caused by
more weight to the fuzzy boundary. Facilitated by this, UNet the limited size of the receptive field; the network structure
3+ will focus on fuzzy boundaries. The more significant the cannot collect enough contextual information to identify the
difference in regional distribution is, the greater the MS- target objects. A cascade model is used here to overcome the
SSIM value becomes [47]. shortcomings of 3D U-Net on data sets with large image size.
Journal of Healthcare Engineering 9

(a) UNet (b) UNet 3+ (c) UNet++


(Plain skip connections) (Nested and dense skip connections) (Full-scale skip connections)

Down–sampling Up–sampling Conventional Full–scal inter Full–scal intra Sup Supervision by


skip connection skip connection skip connection ground truth

Figure 5: A graphic overview of UNet, UNet++, and UNet 3+. By optimizing jump connections and using full-scale depth monitoring,
UNet 3+ integrates multiscale features and produces more precise location perception and segmentation maps with clarified boundaries,
regardless of the fewer parameters provided [14, 46].

First, the first-level 3D U-Net is trained on the downsampled enhances the depth with entire architecture but without
image and afterward the result is upsampled to the original significantly increasing computational cost when the pool-
voxel interval arrangement. The upsampling result is sent to ing operations are applied to these RSU blocks.
the second-level 3D U-Net as an additional input channel RSU structure: as to SOD and other segmentation tasks,
(one-hot encoding) and the image block-based strategy is both local and global context information is of great sig-
used for training on the full-resolution image. nificance. As to modern CNN designs, VGG, Res-Net,
The structure of U-Net has negated most of the new DenseNet, 1 × 1 or 3 × 3 small convolution filters are the
network structures in recent years. It is believed that the most commonly used feature extraction components. De-
network structure has been advanced. The more complex the spite its high computational efficiency and small storage size,
network, the greater the risk of overfitting. More attention its filter experience is too small to capture global informa-
should be paid to other factors such as preprocessing, tion; hence, the shallow output feature map only contains
training, reasoning strategies, and postprocessing. local features. To obtain more global information on the
shallow high-resolution topographic map [62, 63], the most
direct method is to expand the receiving field.
3.8. U2-Net. Salient object detection (SOD) [52] was The existing convolutional block with the smallest re-
designed to segment the most visually attractive objects in ceptive field fails to obtain global information, and the
the image. It is extensively applied to eye-tracking data [53], output feature map at the bottom layer only contains local
image segmentation, and other fields. The recent years have features. To obtain richer global information on high-res-
seen a progress in deep CNN especially the emergence of olution shallow feature maps, the receptive field must be
FCN in image segmentation, which substantially enhances expanded. There are attempts to expand the receptive field
the performance of salient target detection. Most SOD by using hole convolution to extract local and nonlocal
network designs share a common pattern, which is to focus features. However, performing multiple extended convo-
on the application of deep features extracted from the lutions on the input feature map of the original resolution
present backbone networks, e.g., AlexNet [54, 55], VGG (especially in the initial stage) requires a large amount of
[56], Res-Net [57], ResNeXt [39, 58], and DenseNet [59]. But computing and memory resources. Inspired by U-Net, a new
these backbone networks were proposed for image classi- RSU is proposed to obtain multiscale features within the
fication, which extract features that represent semantics stage. RSU is mainly composed of three parts as follows.
instead of local details and global contrast information that
are crucial for saliency detection. They must pretrain on the (1) Input convolutional layer: convert the input feature
data-inefficient ImageNet data, especially when the target map x(H × W × Cin ) into an intermediate image
data follows a different distribution from ImageNet. F1 (x) with the number of Cout channels to extract
U2-Net [16, 60] is an uncomplicated and powerful deep local features.
network used for salient target detection. It does not use a (2) Use the intermediate feature map F1 (x) as input and
pretrained backbone model for image classification and learn to extract and encode multiscale context in-
could receive training from scratch. It could capture more formation U(F1 (x)). U refers to U-Net. The greater
contextual information because it uses the RSU (ReSidual the L, the deeper the RSU and the more pooling
U-blocks) structure [60, 61], which combines the charac- operations, the bigger the receptive field and the
teristics of different scales of receptive fields. Meanwhile, it more local and global features.
10 Journal of Healthcare Engineering

(3) Through the summation of F1 (x), local features and In summary, TransUNet mixes CNN and transformer as
multiscale features are merged. an encoder and allows the use of medium and high-reso-
lution CNN feature maps in the decoding path, hence more
Hence the residual U-block RSU about how to stack and
context information can be involved. TransUNet not only
connect these structures is proposed. It results in a completely
uses image features as a sequence to encode strong global
different method from previous cascade stacking: Un-Net. The
context but also makes good use of low-level CNN features
exponential notation here means a nested U-shaped structure
through a U-shaped hybrid frame design.
rather than a cascaded stack. In theory, the index n could be
adjusted to any positive integer to realize a single-layer or
multilayer nested U-shaped structure. However, to be applied 4. Overview of Validation Methods of
to practical applications. n is set to 2 to form the two-leveled Resultant Experiments
U2-Net. The top layer of it is a large U-shaped structure in-
cluding 11 stages with each filled with a well-configured RSU. 4.1. Evaluation Parameters. The several U-Net-based ex-
Therefore, the nested U structure could extract the multiscale tended structure networks introduced above possess dif-
features in each stage and the multilevel features in the ag- ferent improved structures and characteristics, and their
gregation stage with higher efficiency. Unlike those SOD effects in real-world applications vary. Therefore, this paper
models which are built on present backbones, U2-Net is summarized the corresponding advantages of each by
constructed on the proposed RSU block that allows training comparing the parameters. The segmentation evaluation
from scratch and different model sizes to be configured parameters play a crucial part in the evaluation of image
according to the constraints of the target environment. segmentation performance. This section mainly lists several
commonly used evaluation parameters in image segmen-
tation neural networks and illustrates the characteristics of
3.9. TransUNet. Due to the inherent locality of convolution each network in various experiments.
operations, U-Net is usually limited in explicitly modeling True positive (TP), true negative (TN), false positive
remote dependencies. Recently, the transformer designed (FP), and false negative (FN) are mainly used to count two
for sequence-to-sequence prediction has emerged as an types of classification problems. There is no doubt that
alternative architecture with a global self-attention mecha- multiple categories could also be counted separately. The
nism. However, its positioning capabilities are limited by its samples are divided into positive and negative samples.
insufficient underlying details. TransUNet with the advan-
tages of transformer [64] and U-Net was proposed by Chen
et al. [17] as a powerful alternative to medical image seg- 4.2. Performance Comparison. The related methods pro-
mentation. This is because the transformer treats the input as posed in this paper use almost different data sets including
a one-dimensional sequence and only focuses on modeling retinal blood vessels, liver, kidney, gastric cancer, and cell
the global context of all stages, which results in low-reso- sections. The data sets used by various methods are not the
lution features and a lack of detailed positioning informa- same; hence, it is difficult to compare different methods
tion. Direct upsampling to full resolution cannot effectively horizontally. This paper listed the data sets to provide an
recover this information, which results in rough segmen- index of data set names. The performance comparison is
tation results. In addition, the U-Net architecture provides a listed in Table 1.
way to achieve precise positioning by extracting low-level
features and linking them to high-resolution CNN feature 4.3. Future Development. Medical image segmentation is a
maps, which could adequately complement for fine spatial popular and developing research field. As an implementa-
details. An overview of the framework is shown in Figure 6. tion standard of medical segmentation, the U-Net network
The transformer could be used as a powerful encoder for structure has been in use and improved for many years.
medical image segmentation and combined with U-Net to Although the work and improvements of U-Net in recent
enhance finer details and restore local spatial information. years have begun to solve the challenges presented in Section
TransUNet has achieved excellent performance in multi- 2, there are still some unsolved problems. In this part, some
organ segmentation and heart segmentation. In the design of promising research discussing those problems will be out-
TransUNet, the issue is how to encode the feature repre- lined (accuracy issues, interpretability, and network training
sentation directly from the decomposed image patch using issues) and other challenges that may still exist will be
the transformer. introduced.
In order to complete the purpose of segmentation, that
is, to classify the image at the pixel level, the most direct
method is to upsample the encoded feature map to predict 4.3.1. Higher Generalization Ability. The model is not only
the full resolution of the dense output. To restore the spatial required to have a good fit (training error) to the training
order, the size of the coding function should first reshape the data set but also to have a good fit (generalization ability) to
size of the image from HW/P2 to H/P × W/P. The next step the unknown data set (prediction set). As for tasks like
is to use 1 × 1 convolution to decrease the channel size of the medical image segmentation, small sample data are usually
reshaped feature to the number of classes. Afterward, di- more prone to overfitting or underfitting. Therefore, the
rectly upsampling the feature map to full resolution H × W is frequently used methods such as early stopping, regulari-
performed to predict the final segmentation result. zation, feedback, input fuzzification, and dropout have
Journal of Healthcare Engineering 11

Embedded Sequence
x1p, x2p, ... , xN
p

Layer
Norm
(16, H, W)
1/2
MSA CNN
Hidden Feature
+ 1/4
Linear Projection (64, H/2, W/2)
Layer
Norm 1/8
Transformer Layer Conv3³3, ReLU
MLP (n = 12)
(128, H/4, W/4)
Upsample

Transformer Layer Segmentation head


+
(256, H/8, W/8) Downsample
reshape
Hidden Feature Feature Concatention
z1 (n_patch, D) (D, H/16, W/16) (512, H/16, W/16)
(a) (b)
Figure 6: Overview of TransUNet’s framework. (a) The transformer layer’s structure and (b) the entire TransUNet’s structure. After the U-Net
encoding stage of the network, a transformer structure composed of 12 layers of transformers is added to process the corresponding processed
image sequence. Then the number of channels and dimensions of the picture are unified to the standard by redetermining the size [17].

Table 1: Performance contrast of the networks listed in this article. Different methods use different data sets for evaluation, which makes it
hard to compare various approaches horizontally.
U-net type Medical image data base Evaluation parameters Values
DRIVE [1] Accuracy 0.955 ± 0.003 [1]
U-Net [1]
Amazon data set IoU 0.9530 [64]
3D U-Net [29] Xenopus kidney embryos IoU 0.732 [29]
Gastric cancer [7] Dice coefficient 0.767 ± 0.132 [7]
Attention U-Net [7]
Amazon data set [64] IoU 0.9581 [64]
DRIVE [10] Accuracy 0.975 ± 0.003 [10]
CE-Net [10]
Lung segmentation CT IoU 0.9495 [65]
Cell nuclei [12] Jaccard/IoU 0.9263 [12]
U-Net++ [12]
Lung segmentation CT [65] IoU 0.9521 [65]
UNET 3+ [13] ISBI LiTS 2017 Dice coefficient 0.9552
nnU-Net [14] BRATS challenge Dice coefficient 0.8987 ± 0.157
Vienna reading [15] Dice coefficient 0.8943 ± 0.04 [15]
U2 Net [15]
CVC-ClinicDB IoU 0.8611 [66]
MICCAI 2015 Dice coefficient 0.7748
TransUNet [16]
CVC-ClinicDB IoU 0.89 [66]

improved the generalization problem of neural networks to not know when there will be an error and what causes it in
varying degrees. But in general, the essence of the neural medical images. Medical images reflect on people’s health;
network is instance learning and the network has the cog- hence, interpretability is crucial. Now, people often use
nition of most instances through limited samples. However, sensitivity analysis or gradient-based analysis methods for
recently it has been suggested to seek innovation and interpretability analysis. There are many attempts to im-
abandon the long-used input vector fuzzification processing plement interpretability after training such as surrogate
method. models, knowledge distillation, and hidden layer
visualization.
4.3.2. Improved Interpretability. As for Interpretability or
Explainable Artificial Intelligence (XAI), what always con- 4.3.3. Resolution and Processing of Data Imbalance. Data
cerns researchers engaged in machine learning is that many imbalance often occurs due to inconsistent machine models
current deep neural networks cannot fully understand the in medical image segmentation. But in fact, many common
decision-making models from human’s perspective. We do imbalance problems can be avoided. Nowadays, the
12 Journal of Healthcare Engineering

Extend Unet to 3D images 3D Unet

Add attention module to


the jump connection Attention U–net
Add new modules
Add a transformer layer to the encoder TransU–net

Add DAC and RMP modules between encoder and decoder CE–net

U–net Improve the fusion method


of contextual features Concatenation of a series of nested dense U–net ++
convolutional blocks
Improvements at skip connections

Full–scale jump connection and deep supervision U–net 3++

Improve encoder structure Use RSU block instead of general encoding structure U2–net

Change the network usage Use 2D and 3D–based network pools nnU–net
without changing the structure

Combine the advantages of Add a transfotmer stmcture to encoder TransUNet


transformer and U–Net

Figure 7: U-Net-based extension structure summary diagram.

Table 2: The summary of the changes in network structures and adjusted parameters. The number of parameters for a K × K(×K) size
convolution kernel, Ci input channels, and Co output channels is a K × K(×K) × Ci × Co and is given below for a few U-Net variants.
Model
Dimension Improved structure Highlights #Params Kernel size
structure
Fully connected layer (relative Fully connected layer changed to
U-Net 2D 30M [67] 3 × 3; 2 × 2; 1 × 1
to CNN) upsampling layer
2D convolution operation replaced 1 × 1 × 1; 2 × 2 × 2;
3D U-Net 3D Encoder, decoder 19M [68]
with 3D 3×3×3
Attention U- Add the attention module to the skip
2D Skip connection 123M [65] 1×1
Net connection
Bottleneck between encoder
CE-Net 2D DAC and RMP structure 110 [65] 3 × 3; 1 × 1
and decoder
Use dense blocks and in-depth
UNET++ 2D Skip connection 35 [65] 3 × 3; 1 × 1
supervision
Full-scale jump connection and deep
UNET 3+ 2D Skip connection 26.97 [69] 3 × 3; 3 × 3 × 3
supervision
Multiple ordinary U-Nets form a
nnU-Net 2D/3d Network organization 4×4×4
network pool
Use RSU as the decoding and
U2-Net 2D Encoder and decoder 176M [70] 3×3
encoding unit
Add the transformer module after the 2.93M
Trans-U-Net 2D Encoder 1×1
decoder [66, 71]

common ways to solve them include expanding the data, original distribution is beneficial for dealing with
using different evaluation indicators, resampling the data imbalances.
set, trying artificial data samples, and using different algo-
rithms. It was suggested in a recent ICML paper that the
increased amount of data could increase the error of the 4.3.4. A New Exploration of Transformer and Attention
training set with a known distribution and destroys the Mechanism. This paper introduced attention and trans-
original training set’s allocation, thereby improving the former methods that afford an innovative combination of
classifier’s performance. This paper implicitly used mathe- these two mechanisms and U-Net. So far, some research has
matical methods to increase the data without changing the explored the feasibility of using the Transformer structure
size of the data set. However, we believe that destroying the which only works on the self-attention mechanism as an
Journal of Healthcare Engineering 13

encoder for medical image segmentation without any pre- Acknowledgments


training. In the future, more novel models will be proposed to
solve different problems in medical segmentation with con- This work was funded by Science and Technology Projects in
tinuous breakthroughs in attention and transformer methods. Guangzhou, China (grant no. 202102010472). This work is
funded by National Natural Science Foundation of China
(NSFC) (grant no. 62176071).
4.3.5. Multimodal Learning and Application.
Single-modal representation learning is to express infor- References
mation as numerical vectors that could be processed by the
computer or further abstracted into higher-level feature [1] R. Smith-Bindman, M. L. Kwan, E. C. Marlow et al., “Trends
vectors, while multimodal representation learning is to in use of medical imaging in US health care systems and in
eliminate intermodality by taking advantage of the com- Ontario, Canada, 2000-2016,” JAMA, vol. 322, no. 9,
plementarity between multiple modalities. In medical im- pp. 843–856, 2019 Sep 3.
ages, multimodal data with different imaging mechanisms [2] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu-
could provide information at multiple levels. Multimodal tional networks for biomedical image segmentation,” in
Proceedings of the Medical Image Computing and Computer-
image segmentation is used to fuse information among
Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger,
different modalities for multimodal fusion and collaborative
W. Wells, and A. Frangi, Eds., October 2015.
learning. Research on multimodal learning is becoming [3] X. X. Yin, B. W.-H. Ng, and Q. Yang, A. Pitman,
more popular in recent years and the application of medical K. Ramamohanarao, and D. Abbott, Anatomical landmark
images will grow more sophisticated in the future. localization in breast dynamic contrast-enhanced MR im-
aging,” Medical, & Biological Engineering & Computing,
vol. 50, no. 1, pp. 91–101, 2012.
5. Discussion and Conclusion [4] X.-X. Yin, S. Hadjiloucas, J.-H. Chen, Y. Zhang, J.-L. Wu, and
M.-Y. Su, “Correction: tensor based multichannel recon-
This paper introduces several classic networks with im- struction for breast tumours identification from DCE-MRIs,”
proved U-Net structures to deal with different problems that PLoS One, vol. 12, no. 4, p. e0176133, 2017.
are encountered in medical image segmentation. We review [5] P. Radiuk, “Applying 3D U-net architecture to the task of
the paper. multi-organ segmentation in computed tomography,” Ap-
A summary of the technical context based on the U-Net plied Computer Systems, vol. 25, no. 1, pp. 43–50, 2020.
extended structure introduced above is shown in Figure 7. [6] Q. Tong, M. Ning, W. Si, X. Liao, and J. Qin, “3D deeply-
supervised U-net based whole heart segmentation,” in Sta-
This paper summarized U-Net network dimensions,
tistical Atlases and Computational Models of the Heart. ACDC
improved structure, and structure parameters, along with
and MMWHS Challenges. STACOM 2017, M. Pop, Ed.,
kernel size. Table 2 summarized these aspects. vol. 10663, Cham. Switzerland, Springer, 2018.
U- Net could meet the high-precision segmentation of all [7] C. Wang, T. MacGillivray, G. Macnaught, G. Yang, and
lesions with its differentiation of organ structures and the D. Newby, “A two-stage U-net model for 3D multi-class
diversification of lesion shapes. With the development and segmentation on full-resolution cardiac data,” in Statistical
improvement of attention mechanism, dense module, Atlases and Computational Models of the Heart. Atrial Seg-
transformer module, residual structure, graph cut, and other mentation and LV Quantification Challenges. STACOM 2018,
modules, different modules based on U-Net have been used M. Pop, Ed., vol. 11395, Springer, Cham. Switzerland, 2019.
recently to achieve precise segmentation of different lesions. [8] O. Oktay, J. Schlemper, L. Folgoc et al., “Attention U-Net:
Based on the various U-Net extended structures, this paper Learning where to Look for the Pancreas,” in Proceedings of
classifies and analyzes several classic medical image seg- the 1st Conference on Medical Imaging with Deep Learning,
mentation methods based on the U-Net structure. Amsterdam, The Netherlands, July 2018.
[9] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi,
It is concluded that U-Net-based architecture is indeed
“Convolutional neural networks: an overview and application
quite ground-breaking and valuable in medical image in radiology,” Insights into Imaging, vol. 9, no. 4, pp. 611–629,
analysis. However, although U-Net-based deep learning has 2018.
become a dominant method in a variety of complex tasks [10] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional
such as medical image segmentation and classification, it is Networks for Semantic Segmentation,” in Proceedings of the
not all-powerful. It is essential to be familiar with key IEEE conference on computer vision and pattern recognition,
concepts and advantages of U-Net variants as well as lim- pp. 3431–3440, Boston, MA, USA, June 2015.
itations of it, in order to leverage it in radiology research with [11] P. Jaccard, “The distribution of the flora in the alpine Zone.1,”
the goal of improving radiologist performance and, even- New Phytologist, vol. 11, no. 2, pp. 37–50, February 1912.
tually, patient care. Despite the many challenges remaining [12] Z. Gu, J. Cheng, H. Fu et al., “CE-net: context encoder net-
in deep learning-based image analysis, U-Net is expected to work for 2D medical image segmentation,” IEEE Transactions
be one of the major paths forward [72–80]. on Medical Imaging, vol. 38, no. 10, pp. 2281–2292, 2019.
[13] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang,
“UNet++: a nested U-net architecture for medical image
Conflicts of Interest segmentation,” Deep Learning in Medical Image Analysis and
Multimodal Learning for Clinical Decision Support, vol. 11045,
The authors declare that they have no conflicts of interest. pp. 3–11, 2018.
14 Journal of Healthcare Engineering

[14] H. Huang, L. Lin, R. Tong et al., “UNet 3+: a full-scale segmentation from sparse annotation,” in Proceedings of the
connected UNet for medical image segmentation,” in Pro- Medical Image Computing and Computer-Assisted Interven-
ceedings of the ICASSP 2020 - 2020 IEEE International tion – MICCAI 2016, S. Ourselin, L. Joskowicz, M. Sabuncu,
Conference on Acoustics, Speech and Signal Processing G. Unal, and W. Wells, Eds., October, 2016.
(ICASSP), pp. 1055–1059, Barcelona, Spain, May 2020. [31] S. Ioffe and C. Szegedy, “Batch normalization: accelerating
[15] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and deep network training by reducing internal covariate shift,
K. H. Maier-Hein, “nnU-Net: a self-configuring method for ICML’15,” in Proceedings of the 32nd International Conference
deep learning-based biomedical image segmentation,” Nature on International Conference on Machine Learning, vol. 37,
Methods, vol. 18, no. 2, pp. 203–211, 2021. pp. 448–456, Lille, France, July, 2015.
[16] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and [32] J. Schlemper, O. Oktay, M. Schaap et al., “Attention gated
M. Jagersand, “U2-Net: going deeper with nested U-structure networks: learning to leverage salient regions in medical
for salient object detection,” Pattern Recognition, vol. 106, images,” Medical Image Analysis, vol. 53, pp. 197–207, 2019.
p. 107404, 2020. [33] H. Ma, Y. Zou, and P. X. Liu, “MHSU-Net: a more versatile
[17] J. Chen, Y. Lu, Q. Yu et al., “TransUNet: Transformers Make neural network for medical image segmentation,” Computer
Strong Encoders for Medical Image Segmentation,” 2021, Methods and Programs in Biomedicine, vol. 208, p. 106230,
https://arxiv.org/abs/2102.04306. 2021.
[18] X. X. Yin, S. Hadjiloucas, and Y. Zhang, Pattern Classification [34] B. Jin, P. Liu, P. Wang, L. Shi, and J. Zhao, “Optic disc
of Medical Images: Computer Aided Diagnosis, Springer- segmentation using attention-based U-net and the improved
Verlag, Heidelberg, Germany, 2017. cross-entropy convolutional neural network,” Entropy,
[19] S. Irshad, X. Yin, and Y. Zhang, “A new approach for retinal vol. 22, no. 8, p. 844, 2020.
vessel differentiation using binary particle swarm optimiza- [35] C. Han, Y. Duan, X. Tao, and J. Lu, “Dense convolutional
tion,” Computer Methods in Biomechanics and Biomedical networks for semantic segmentation,” IEEE Access, vol. 7,
Engineering: Imaging & Visualization, vol. 9, no. 5, pp. 510– pp. 43369–43382, 2019.
522, 2021. [36] R. F. Mansour and N. O. Aljehane, An Optimal Segmentation
[20] X. Yin, S. Irshad, and Y. Zhang, “Classifiers fusion for im- with Deep Learning Based Inception Network Model for In-
proved vessel recognition with application in quantification of tracranial Hemorrhage Diagnosis, Neural Comput & Applic,
generalized arteriolar narrowing,” Journal of Innovative Op- London, UK, 2021.
tical Health Sciences, vol. 13, no. 01, p. 1950021, 2020. [37] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “In-
[21] X. X. Yin, L. Yin, and S. Hadjiloucas, “Pattern classification ception-v4, inception-resnet and the impact of residual
approaches for breast cancer identification via MRI: state-of- connections on learning,” in Proceedings of the Thirty-first
the-art and vision for the future,” Applied Sciences, vol. 10, AAAI conference on artificial intelligence, vol. 4, p. 12, San
no. 20, p. 7201, 2020. Francisco, California, USA, February, 2017.
[22] D. Pandey, X. Yin, H. Wang, and Y. Zhang, “Accurate vessel [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
segmentation using maximum entropy incorporating line for image recognition,” in Proceedings of the Computer Vision
detection and phase-preserving denoising,” Computer Vision and Pattern Recognition (CVPR), pp. 770–778, IEEE, Las
and Image Understanding, vol. 155, pp. 162–172, 2017. Vegas, Nevada, July, 2016.
[23] X. X. Yin, S. Hadjiloucas, Y. Zhang et al., “Pattern identifi- [39] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep
cation of biomedical images with time series: contrasting THz learning techniques for medical image segmentation:
pulse imaging with DCE-MRIs,” Artificial Intelligence in achievements and challenges,” Journal of Digital Imaging,
Medicine, vol. 67, pp. 1–23, 2016. vol. 32, no. 4, pp. 582–596, 2019.
[24] T. J. Sejnowski, “The unreasonable effectiveness of deep [40] T. Zhou, R. Su, and S. Canu, “A review: deep learning for
learning in artificial intelligence,” Proceedings of the National medical image segmentation using multi-modality fusion,”
Academy of Sciences, vol. 117, no. 48, pp. 30033–30038, 2020. Array, vol. 3–4, p. 100004, 2016.
[25] P. J. R Prasad, O. J Elle, F. Lindseth, F. Albregtsen, and [41] T. J. Anchordoquy, Y. Barenholz, D. Boraschi et al.,
R. P. Kumar, “Modifying U-Net for small data set: a simplified “Mechanisms and barriers in cancer nanomedicine:
U-Net version for liver parenchyma segmentation,” in Pro- addressing challenges, looking for solutions,” ACS Nano,
ceedings of the SPIE 11597, Medical Imaging 2021: Computer- vol. 11, no. 1, pp. 12–18, 2017.
Aided Diagnosis, February 2021. [42] J. Jin, H. Zhu, J. Zhang et al., “Multiple U-Net-Based auto-
[26] D. Chen, S. Liu, P. Kingsbury et al., “Deep learning and al- matic segmentations and radiomics feature stability on ul-
ternative learning strategies for retrospective real-world trasound images for patients with ovarian cancer,” Frontiers
clinical data,” Npj Digital Medicine, vol. 2, no. 1, p. 43, 2019. in Oncology, vol. 10, p. 614201, 2021.
[27] M. Reyes, R. Meier, S. Pereira et al., “On the interpretability of [43] Y. Ma, H. Hao, J. Xie et al., “ROSE: a retinal OCT-angiography
artificial intelligence in radiology: challenges and opportu- vessel segmentation data set and new model,” IEEE Trans-
nities,” Radiology: Artificial Intelligence, vol. 2, no. 3, actions on Medical Imaging, vol. 40, no. 3, pp. 928–939, 2020.
p. e190043, 2020. [44] P. Saiviroonporn, K. Rodbangyang, T. Tongdee et al., “Car-
[28] S. Zheng, X. Lin, W. Zhang et al., “MDCC-Net: multiscale diothoracic ratio measurement using artificial intelligence:
double-channel convolution U-Net framework for colorectal observer and method validation studies,” BMC Medical Im-
tumor segmentation,” Computers in Biology and Medicine, aging, vol. 21, pp. 1–11, 2021.
vol. 130, p. 104183, 2021. [45] Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and
[29] X. Liu, L. Song, S. Liu, and Y. Zhang, “A review of deep- J. Liang, “Fine-tuning convolutional neural networks for
learning-based medical image segmentation methods,” Sus- biomedical image analysis: actively and incrementally,” in
tainability, vol. 13, no. 3, p. 1224, 2021. Proceedings of the IEEE Conference on Computer Vision and
[30] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and Pattern Recognition (CVPR), pp. 7340–7351, Honolulu, HI,
O. Ronneberger, “3D U-net: learning dense volumetric USA, July, 2017.
Journal of Healthcare Engineering 15

[46] H. Huang, L. Lin, R. Tong et al., “UNet 3+: a full-scale [61] D. Li, D. A. Dharmawan, B. P. Ng, and S. Rahardja, “Residual
connected UNet for medical image segmentation,” 2020, U-net for retinal vessel segmentation,” in Proceedings of the
https://arxiv.org/abs/2004.08790. 2019 IEEE International Conference on Image Processing,
[47] G. Mattyus, W. Luo, and R. Urtasun, “DeepRoadMapper: pp. 1425–1429, ICIP), Taipei, Taiwan, September 2019.
extracting road topology from aerial images,” Proceedings of [62] A. J. Kent and A. Hopfstock, “Topographic mapping: past,
the IEEE International Conference on Computer Vision present and future,” The Cartographic Journal, vol. 55, no. 4,
(ICCV), pp. 3438–3446, Venice, Italy, October, 2017. pp. 305–308, 2018.
[48] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, [63] A. Kent, “Topographic maps: methodological approaches for
“A tutorial on the cross-entropy method,” Annals of Opera- analyzing cartographic style,” Journal of Map & Geography
tions Research, vol. 134, no. 1, pp. 19–67, 2005. Libraries, vol. 5, no. 2, pp. 131–156, 2009.
[49] Q. Dou, L. Yu, H. Chen et al., “3D deeply supervised network [64] D. John and C. Zhang, “An attention-based U-Net for
for automated segmentation of volumetric medical images,” detecting deforestation within satellite sensor imagery,” In-
Medical Image Analysis, vol. 41, pp. 40–54, 2017. ternational Journal of Applied Earth Observation and Geo-
[50] F. Isensee, R. Sparks, and S. Ourselin, “Batchgenerators — a information, vol. 107, p. 102685, 2022.
Python Framework for Data Augmentation,” 2020, https:// [65] R. Su, D. Zhang, J. Liu, and C. Cheng, “MSU-net: multi-scale
zenodo.org/record/3632567#.YkGUnOdBzIU. U-net for 2D medical image segmentation,” Frontiers in
[51] Y. Zhang, S. Liu, C. Li, and J. Wang, “Rethinking the dice loss Genetics, vol. 12, p. 639930, 2021.
for deep learning lesion segmentation in medical images,” [66] A.-J. Lin, B. Chen, J. Xu, Z. Zhang, and G. Lu, “DS-TransUNet:
Journal of Shanghai Jiaotong University, vol. 26, no. 1, Dual Swin Transformer U-Net for Medical Image Segmen-
pp. 93–102, 2021. tation,” 2021, https://arxiv.org/abs/2106.06716.
[52] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: a [67] N. Beheshti and L. Johnsson, “Squeeze U-net: a memory and
benchmark,” in Proceedings of the Computer Vision – ECCV energy efficient image segmentation network,” in Proceedings
2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and of the 2020 IEEE/CVF Conference on Computer Vision and
C. Schmid, Eds., October, 2012. Pattern Recognition Workshops (CVPRW), pp. 1495–1504,
[53] F. Xiao, L. Peng, L. Fu, and X. Gao, “Salient object detection Seattle, WA, USA, June, 2020.
based on eye tracking data,” Signal Processing, vol. 144, [68] C. Ozgun, A. Abdulkadir, S. S. Lienkamp, T. Brox, and
pp. 392–397, 2018. O. Ronneberger, “3D U-net: learning dense volumetric seg-
[54] O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale mentation from sparse annotation,” in Proceedings of the
Medical Image Computing and Computer-Assisted Interven-
visual recognition challenge,” International Journal of Com-
tion -- MICCAI 2016, pp. 424–432, Springer International
puter Vision, vol. 115, no. 3, pp. 211–252, 2015.
Publishing, Athens, Greece, October, 2016.
[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
[69] H. Huang, L. Lin, R. Tong et al., “UNet 3+: A Full-Scale
classification with deep convolutional neural networks,”
Connected UNet for Medical Image Segmentation,” in Pro-
Advances in Neural Information Processing Systems, vol. 25,
ceedings of the 2020 IEEE International Conference on
2012.
Acoustics, Speech and Signal Processing (ICASSP 2020),
[56] S. Liu and W. Deng, “Very deep convolutional neural network
pp. 1055–1059, Barcelona, Spain, May, 2020.
based image classification using small training sample size,” in
[70] C. Wang, C. Li, J. Liu et al., “U2-ONet: a two-level nested
Proceedings of the 2015 3rd IAPR Asian Conference on Pattern
octave U-structure network with a multi-scale Attention
Recognition (ACPR), vol. 5, pp. 730–734, Kuala Lumpur, mechanism for moving object segmentation,” Remote Sensing,
Malaysia, November, 2015. vol. 13, no. 1, 2021.
[57] Q. Chen, H. Yue, X. Pang et al., “Mr-ResNeXt: a multi-res- [71] Y. Yang and S. Mehrkanoon, “AA-TransUNet: Attention
olution network architecture for detection of obstructive sleep Augmented TransUNet For Nowcasting Tasks,” 2022, https://
apnea,” in Neural Computing for Advanced Applications. arxiv.org/abs/2202.04996.
NCAA 2020. Communications in Computer and Information [72] X. Jiang, Y. Wang, Y. Wang, W. Liu, and S. Li, “CapsNet,
Science, H. Zhang, Z. Zhang, Z. Wu, and T. Hao, Eds., CNN, FCN: comparative performance evaluation for image
vol. 1265, Singapore, Springer, 2020. classification,” International Journal of Machine Learning and
[58] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated Computing, vol. 9, no. 6, pp. 840–848, 2019.
residual transformations for deep neural networks,” in Pro- [73] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale
ceedings of the 2017 IEEE Conference on Computer Vision and structural similarity for image quality assessment,” and
Pattern Recognition (CVPR), pp. 5987–5995, Honolulu, HI, Computers, vol. 2, pp. 1398–1402, 2003.
USA, July, 2017. [74] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal
[59] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, loss for dense object detection,” IEEE Transactions on Pattern
“Densely connected convolutional networks,” in Proceedings Analysis and Machine Intelligence, vol. 42, no. 02, pp. 318–327,
of the 2017 IEEE Conference on Computer Vision and Pattern 2020.
Recognition (CVPR), pp. 2261–2269, Honolulu, HI, USA, July, [75] F. Isensee, P. F. Jäger, P. M. Full, P. Vollmuth, and
2017. K. H. Maier-Hein, “nnU-net for brain tumor segmentation,”
[60] J. I. Orlando, P. Seebock, H. Bogunovic et al., “U2-Net: a in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Trau-
bayesian U-net model with epistemic uncertainty feedback for matic Brain Injuries. BrainLes 2020, A. Crimi and S. Bakas,
photoreceptor layer segmentation in pathological OCTscans,” Eds., vol. 12659, Cham. Switzerland, Springer, 2021.
in Proceedings of the 2019 IEEE 16th International Symposium [76] A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., “An image is
on Biomedical Imaging (ISBI 2019), pp. 1441–1445, Venice, worth 16x16 words: transformers for image recognition at
Italy, April 2019. scale,” 2021, https://arxiv.org/abs/2010.11929.
16 Journal of Healthcare Engineering

[77] P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really


better than one?,” 2019, https://arxiv.org/abs/1905.10650.
[78] J. B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head at-
tention: collaborate instead of concatenate,” 2020, https://
arxiv.org/abs/2006.16362.
[79] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction,
Springer, New York, NY, USA, 2009.
[80] L. Liu, J. Cheng, Q. Quan, F.-X. Wu, Y.-P. Wang, and J. Wang,
“A survey on U-shaped networks in medical image seg-
mentations,” Neurocomputing, vol. 409, pp. 244–258, 2020.

You might also like