MS-UNet: A multi-scale UNet with feature recalibration approach for automatic liver and tumor segmentation in CT images Devidas T.. Several methods utilized machine learning algorithms
Trang 1Available online 24 February 2021
0895-6111/© 2021 Elsevier Ltd All rights reserved
MS-UNet: A multi-scale UNet with feature recalibration approach for
automatic liver and tumor segmentation in CT images
Devidas T Kushnurea,b,* , Sanjay N Talbara
aDepartment of Electronics and Telecommunication Engineering, Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India
bDepartment of Electronics and Telecommunication Engineering, Vidya Pratishthan’s Kamalnayan Bajaj Institute of Engineering and Technology, Baramati,
Maharashtra, India
A R T I C L E I N F O
Keywords:
Deep learning
Convolutional neural network
Liver and tumor segmentation
Multi-scale feature
Feature recalibration
CT images
A B S T R A C T Automatic liver and tumor segmentation play a significant role in clinical interpretation and treatment planning
of hepatic diseases To segment liver and tumor manually from the hundreds of computed tomography (CT) images is tedious and labor-intensive; thus, segmentation becomes expert dependent In this paper, we proposed the multi-scale approach to improve the receptive field of Convolutional Neural Network (CNN) by representing multi-scale features that extract global and local features at a more granular level We also recalibrate channel- wise responses of the aggregated multi-scale features that enhance the high-level feature description ability of the network The experimental results demonstrated the efficacy of a proposed model on a publicly available 3Dircadb dataset The proposed approach achieved a dice similarity score of 97.13 % for liver and 84.15 % for
tumor The statistical significance analysis by a statistical test with a p-value demonstrated that the proposed model is statistically significant for a significance level of 0.05 (p-value < 0.05) The multi-scale approach
im-proves the segmentation performance of the network and reduces the computational complexity and network parameters The experimental results show that the performance of the proposed method outperforms compared with state-of-the-art methods
1 Introduction
According to the status report on the Global Burden of Cancer
worldwide (GLOBOCAN), 2018 estimates liver cancer growth and
mortality rate rapidly increasing across the world It is the sixth most
common cancer and the second most leading cause of cancer deaths
worldwide (Bray et al., 2018) In the human body, the liver is one of the
massive and essential organs involved in detoxification, filtering blood
from the digestive tract, and supplying it to the body parts (Bilic et al.,
2019) Thus, the liver becomes the first site that often affects due to the
spread of metastases tumor from a primary site such as colorectal,
breast, pancreatic, ovarian, and lung The growth of liver tumors due to
metastases tumor is secondary liver cancer Also, liver cancer, which
originates in the liver cells (hepatocytes) such as Hepatocellular
Carci-noma (HCC), is the primary liver cancer HCC contains a hereditarily
and molecularly exceptionally heterogeneous group of malignant
growths that usually emerge in the chronically damaged liver HCC
af-fects the hepatocytes or liver cells, which leads to a mutation in the
structure and shape of the affected liver cells that determine the progress
of cancer These perceptible variations in shape and tissue structure allow for the non-invasive identification of HCC in imaging (Christ et al.,
2017)
The radio imaging modalities such as ultrasound, computed tomog-raphy (CT), and magnetic resonance imaging (MRI) are utilized to detect anomalies in the upper and lower abdomen part of the body Radio imaging is a non-invasive, painless, and precise technique to identify an inside injury to help clinical specialists diagnose the complication and plan the treatment for saving the life of the patient Medical imaging techniques have become more popular nowadays to diagnose and further treat the disease and its progress (Bilic et al., 2019) Due to ease and less time required to capture the human body’s exact inner struc-ture, the CT scan becomes the medical expert’s choice to diagnose liver-related complications and anomalies (Luo et al., 2014)
Clinically liver and tumor segmentation from CT images is an important task to deal with hepatic disease diagnosis and treatment planning The liver volume assessment is the directive before
* The corresponding author is a research scholar at the Department of Electronics and Telecommunication Engineering, Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India
E-mail address: devidas.kushnure@vpkbiet.org (D.T Kushnure)
Contents lists available at ScienceDirect Computerized Medical Imaging and Graphics
https://doi.org/10.1016/j.compmedimag.2021.101885
Received 25 May 2020; Received in revised form 22 January 2021; Accepted 24 January 2021
Trang 2hepatectomy, and it assists the doctors and surgeons in planning liver
resection, liver transplantation, portal vein embolization, associating
liver partition and portal vein ligation for staged hepatectomy (ALPPS)
(Gotra et al., 2017) and post-treatment assessment It is also essential for
applications such as computer-aided diagnosis (CAD) and deciding the
interventional radiological treatment The liver tumor volume
extrac-tion with high accuracy is beneficial for the planning of Selective
In-ternal Radiation Therapy (SIRT) (Radioembolization) to diminishing the
risk of abundance or low radiation dose as per the patient’s liver volume
(Moghbel et al., 2018) Therapy planning for the liver, primary and
metastatic tumors using the percutaneous ablation is the minimally
invasive surgical procedure guided through image navigation (Spinczyk
et al., 2019) The liver segmentation is the significant stage to detect the
hepatic complications earlier with radio imaging The CT is the medical
expert’s preferred imaging modality to deal with hepatic diseases
because of its robustness, vast availability, less acquisition process, and
higher spatial resolution In clinical routine medical experts delineate
the liver and tumor manually from CT images; manual segmentation is
considered the gold standard in medical practice and research
How-ever, the liver and tumors’ manual outline is tedious and
time-consuming, which could delay the diagnosis process The
seg-mentation depends on the expert’s knowledge and experience that may
cause an erroneous segmentation outcome Due to these reasons, it is
essential to provide the computer-based framework that will
automati-cally segment liver and tumor with the accuracy acceptable to the
clinical significance and offer the second opinion to the physician to
conclude with more accuracy in less time Many researchers and the
scientific community focus on developing the framework for automatic
liver and tumor segmentation with modern image processing and
computer vision algorithms
In the last three decades, much scientific research on automatic and
interactive segmentation strategies proposed in the literature Even
though automatic liver and liver tumor segmentation from CT volume
remains challenging, the liver is a soft organ, and its shape is
excep-tionally dependent on the surrounding organs inside the abdomen
Apart from that, the liver pathology is inconsistent that may modify its
signal intensity, density, and distortion in shape, less intensity difference
between the liver and tumor region, uniformity in intensities of the liver
with its surrounding organs, and feeble boundaries between the liver to
its surrounding organs such as stomach and heart illustrated in Fig 1
Usually, liver CT images are obtained by utilizing an injection protocol
that enhances the liver in the CT images for medical interpretation
However, the injection phase decides the enhancement variation, and
noise in the CT images increases with the enhancement that causes an
increase in liver noise, which is already noisy without any enhancement
(Moghbel et al., 2018) Because of these challenges, the liver and tumor segmentation turn into a challenging assignment that has pulled in much research consideration in recent years
2 Related literature
In literature, several interactive and automatic methods for liver and tumor segmentation in CT volumes are proposed In 2007 Grand chal-lenge benchmarks on liver and liver tumor segmentation conducted in conjunction with the MICCAI conference Most of the presented methods
at the challenge were based on statistical shape models for automatic segmentation (Heimann et al., 2009) Furthermore, methods (Luo et al.,
2014) based on the liver gray intensities, structure, and texture features proposed automatic liver segmentation The gray level-based methods utilized the liver gray intensities for segmentation These methods used intensity-based algorithms like region growing, active contour, graph cut, thresholding, and clustering The structure-based methods utilized the liver’s repetitive geometry to create the probabilistic model to reconstruct the liver shape The methods used for the liver segmentation statistical shape model, statistical pose model, and probabilistic atlas The texture-based methods utilized the texture features to segment the liver The algorithms based on machine learning and pattern recognition classify the liver region based on the texture feature description After-ward, a computer-aided diagnosis (CAD) system is envisaged as the second pair of eyes to the expert radiologists if the CAD system works with accuracy Several methods utilized machine learning algorithms like probabilistic neural network (PNN), support vector machine (SVM), region growing, and alternative fuzzy C-means (AFCN), Hidden Markov Model (HMM) algorithms to design CAD systems for liver and tumor segmentation (Moghbel et al., 2018)
Over the past few years, Deep learning algorithms based on Con-volutional Neural Network (CNN) have become famous for visual recognition applications because of their powerful nonlinear feature extraction capabilities using many different filters at a different layer of the network and capability to process a large amount of data (Yamashita
appli-cations like image classification, object detection, action recognition The CNN-based networks like Alexnet, VGGNet, GoogleNet, ResNet proved the capability for visual recognition tasks in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Ueda et al., 2019) After CNN’s success for an efficient classification task, the researcher exploi-ted the same backbone architecture for the semantic segmentation task The Fully Convolutional Neural Network (FCN) (Shelhamer et al., 2017) based architecture employed the existing well-known classifier models for semantic segmentation by replacing the dense classifier layers The
Fig 1 Sample images from the 3Dircadb dataset that denotes the complications in liver and tumor segmentation in the abdomen CT scan: (a) Low-intensity
dif-ference between nearby organs (liver, stomach, and heart) and tumor (b) Ambiguous boundary between the liver, heart and (c) stomach
Trang 3FCN decoder architecture was considered to be the most successful for
segmentation The decoder network was utilized to upsample the
segmented map into the input image size For semantic segmentation of
images, SegNet (Badrinarayanan et al., 2017) was proposed It has the
encoder-decoder network followed by a pixel-wise classification layer
The encoder architecture utilized the identical network topology like 13
convolutional layers VGGNet
In medical image processing, the semantic segmentation task is
uti-lized to segment the organ’s anatomical structure and segmentation of
tumors The automatic segmentation of the region of interest from
medical images using CNN-based architecture proved the effectiveness
archi-tecture proposed for biomedical image processing An encoder-decoder
design becomes a decision for the medical image segmentation that has
an encoding part, which falls information of input image into a group of
high-level features and decoding parts, where high-level features
uti-lized to rebuild a pixel-wise segmentation at a single or multiple
upsampling steps (Ronneberger et al., 2015) After this paper, the
al-gorithms proposed based on FCN utilized the UNet derived architecture
for medical image segmentation The Liver Tumor Segmentation
Benchmark (LiTS) challenge was organized in conjunction with the ISBI
2017 and MICCAI 2017 The methods were presented based on the CNN
deep learning approach; the majority of methods were UNet derived
architecture Almost all of the techniques utilized specific preprocessing
on input data like HU-value windowing, normalization, and
standardi-zation Additionally, most of the techniques applied connected lesion
components post-processing method on the segmented map to discard
the portion of lesion outside the liver region (Bilic et al., 2019)
Furthermore, most available liver and liver tumor segmentation
networks are based on FCN with UNet encoder (contraction) and
decoder (expansion) structure In UNet, all layers are CNNs to achieve
the pixel level prediction in a forward step UNet has an encoding and
decoding path built using the convolution, pooling, and upsampling
layers For improving the segmentation capability of UNet, the encoding
path features concatenated with a decoding path at the respective stage
using skip connection To enhance the segmentation output further few
methods (Budak et al., 2020)(Gruber et al., 2019) proposed were
uti-lized two UNet architecture jointly for liver and tumor segmentation
The liver and kidney segmentation complex CNN architecture (
ImageNet-trained ResNet-34 for feature encoder to reduce the
conver-gence time and overfitting problem
Further, to improve the prediction capabilities of FCN based
archi-tectures, the residual connections (Drozdzal et al., 2016) (Zhang et al.,
2019) were employed from a forward path in the intermediate feature
map and post-processing to refine the performance of the segmentation
The Deep CNN model (Han, 2017) based on ResNet used long-range
UNet and short-range ResNet residual connection with post-processing
using 3D connected component labeling of all voxels labeled as a
lesion Further, post-processing methods were proposed to refine the
segmentation performance of the algorithms The FCN based
encoder-decoder model (Zhang et al., 2017) (Christ et al., 2017)
(Chlebus et al., 2017) demonstrated the effect of the level set, graph cut,
CRF, and random forest algorithms utilized for post-processing to refine
segmentation results The super-pixel-based (Qin et al., 2019) CNN
proposed divided the CT image into super-pixel by aggregating the
adjacent pixels with the same intensity and classified into three classes
interior liver, liver boundary, and non-liver background and utilized the
CNN to predict the liver boundary
Recently, many extensions of UNet by modifying the core structure
proposed to segment liver and tumor The novel hybrid densely
con-nected UNet (Li et al., 2018) proposed to explore intra-slice and
inter-slice features by introducing a hybrid feature fusion layer using 2D
and 3D DenseUNet The 2D DenseUNet extract the intra-slice features,
and 3D Dense-UNet extract the inter-slice volumetric context, and these
features were fused to portray 3D interpretation using the feature fusion
layer The 3D residual attention-aware RA-UNet (Noh et al., 2015) proposed using the residual learning approach to express multi-scale attention information and combine low-level features with high-level features The modified UNet architecture (Seo et al., 2020) proposed
to exploit the object-dependent feature extraction using a modified skip connection with an additional convolutional layer and residual path The modified skip connection extract high-level global features of small objects and high-level features of high-resolution edge information of the large object Recently, UNet++ architecture (Zhou et al., 2020) was proposed by a redesign of the skip connection to exploit the multi-scale features using a feature fusion scheme from each encoding layer to the decoding layer
In semantic segmentation, the CNN extracts the critical features of the image and effectively decides the coarse boundary of the target However, it is observed that end of the encoder, the size of feature maps
is remarkably reduced; as a result, obstructing the accuracy of CNN The consecutive downsampling by pooling operations reduces the input image resolution to a small size feature map It results in loss of spatial information about the object, which is essential in analyzing medical images for accurate segmentation of target objects There are various methods such as deconvolution (Noh et al., 2015), skip connections (Ronneberger et al., 2015) have been proposed which utilized the concept of transpose convolution for upsampling, and skip connection for connecting upper convolution layers with the deep layers in such a way that network can maximize the utilization of the high-level features
to preserve the spatial information However, these methods cannot recover the spatial information loss that occurred in the pooling and convolution operation
Moreover, CNN models need to process the features at a different scale to extract the meaningful contextual information of the object for achieving successful semantic segmentation Multi-scale feature char-acterization achieved by FCN with variable pooling layers and combining the features from previous layers with a deeper layer to maintain the global and local information of the object and achieved effective semantic segmentation (Long et al., 2015) The pyramid scene parsing network (PSPNet) (Zhao et al., 2017) utilized global context information accumulated from region-based features employing a pyr-amid pooling In pyrpyr-amid pooling, global and local information is characterized effectively by transforming input features from multiple pooling and aggregating all the features to achieve effective semantic segmentation Later, the DeepLab system (Chen et al., 2018)(Chen et al.,
2017) proposed to preserve the spatial resolution by utilizing the atrous convolution module The atrous convolution is employed in series or parallel to achieve the expansion of the receptive field of CNN and atrous spatial pyramid pooling with adopting multiple atrous rates utilized to gain the multi-scale context depiction Channel-UNet (Chen et al., 2019) proposed optimized the mapping of information between pixels in convolution layers with spatial-channel convolution and adopted an iterative learning mechanism that expands the receptive field of convolution layers
In this paper, we propose the CNN with encoder-decoder UNet based multi-scale feature representation and recalibration architecture for liver and liver tumor segmentation We utilize the bottleneck Res2Net module’s ability to represent multi-scale features and improve the receptive field of the convolutional neural network (CNN) Further, we recalibrate the multi-scale features channel-wise with a squeeze-and- excitation (SE) network We perform the experimentation on the 3Dir-cadb dataset available publicly The results illustrated that the multi- scale UNet outperforms as compared to the state-of-the-art methods for liver and tumor segmentation
Following contributions are incorporated in the paper
• We proposed the MS-UNet with a feature recalibration approach by exploiting the core idea of UNet encoder-decoder architecture The architectural difference is that our network utilized the Res2Net module for multi-scale feature representation and the SE network for
Trang 4feature recalibration in encoder and decoder stages The
combina-tion of the Res2Net module followed by the SE network enhance the
feature representation capability and learning potential of the
network The computational complexity and the parameters of the
proposed network are reduced because of the Res2Net module
•We employed the multi-scale Res2Net module in the network to
enhances the receptive field of CNN to cover the entire region of
interest from the input features and characterize global and local
information of the input at the more granular level by extracting
multi-scale features Therefore, the input feature represented with
multiple features at different scales is aggregated hierarchically,
signifying detailed information of the input features
•The aggregated multi-scale features extract the more granular
in-formation of the input that limits the learning capacity of the
network To improve the network learning ability and focus on more
prominent features of the object, we utilized the SE network The SE
network recalibrates the channel wise feature responses by modeling
interdependencies between the channels with squeeze and excitation
operations Feature recalibrations enhance the network sensitivity to
informative features of the object so that the network’s ability to
learn prominent features improves Also, the feature extraction
capability of successive network layers increases that boost the
seg-mentation performance
•We experimentally verified the MS-UNet performance for liver and
tumor segmentation in terms of multi-scale feature extraction ability
by varying scaling factor The model performance has been
evalu-ated by using different statistical measures by varying scaling factors
We trained the network from scratch and performed the
experi-mentation on a publicly available 3Dircadb dataset manually
anno-tated by the medical experts and demonstrated the network
segmentation performance The proposed model is statistically
sig-nificant for the significance level 0.05 (p-value < 0.05) performed
using a statistical test for hypothesis testing
The remaining paper is organized as follows The explanation of the proposed methodology in section 3, experimental setup and result analysis in Section 4 and section 5, concludes the paper
3 Proposed work
3.1 Proposed methodology
We proposed a deep convolutional neural network with multi-scale feature extraction and recalibration architecture for automatic liver and tumor segmentation Fig 2 shows the proposed MS-UNet encoder- decoder architecture We embed the Res2Net bottleneck module with the SE network in place of two 3 × 3 convolution operations in UNet encoder-decoder stages The bottleneck Res2Net module has the same architecture, such as bottleneck ResNet except a single 3 × 3 convolu-tion operaconvolu-tion replaced by small 3 × 3 convoluconvolu-tions in hierarchical order to achieve multi-scale feature extraction and improvement in the receptive field The Res2Net bottleneck architecture was designed to improve the layer-wise multi-scale feature representation at a more granular level and receptive field of CNN We employed the Res2Net bottleneck module instead of 3 × 3 convolution in UNet to leverage the multi-scale feature extraction ability with an improved receptive field to enhance the segmentation performance The Res2Net module can extract the input features at a more granular level with its multi-scale characterization ability It improves the receptive field of the CNN by scaling the input features into small blocks and processed through the
Fig 2 Proposed MS-UNet encoder and decoder architecture (Yellow blocks: Res2Net module and SE network, Gray block: SE network) (For interpretation of the
references to colour in the Figure, the reader is referred to the web version of this article)
Trang 5multiple convolution blocks with different scale features to enhance the
features at multi-scale These multi-scale features are captured by
mul-tiple convolution layers with the local receptive field They empower the
network to extract informative features by aggregating both spatial and
channel-wise features
In semantic segmentation, spatial information plays a significant role
in locating the region of interest in the image To empower the network
potential to characterize the global and local information at a granular
level and uplift the network learning ability at each stage To achieve
this, we employed the feature refinement through SENet The SENet
recalibrates the features in two steps One, it globalizes the fused multi-
scale features channel-wise into the one-dimensional vector called
squeeze operation Secondly, it recalibrates the features by passing
through two dense layers and describes the weights for the input
channels called excitation operation Then channel weights scale with
input multi-scale features and improve the feature representation
po-tential of the network The network gains the perception of coarse-
grained context in the shallow layers and localization of fine-grained
attributes in the deeper layers, which leads to boost segmentation
performance
UNet with the number of stages in the network along with layer-wise
activation function, convolution filter size, and the number of features
and feature shape All the convolutional operations used in the Res2Net
module following the order Conv2D, Batch normalization, and Relu
The entire training and testing pipeline illustrated in Fig 3 The input
CT dataset is preprocessed first, then train the MS-UNet for liver and
tumor, test model on test data, and evaluate network performance using
statistical measures
3.2 Multi-scale features
In the CNN, encoder architecture extracts high-level information at
each stage by downsampling the input using a pooling operation
However, pooling causes contextual information loss The skip
connec-tion provides the low-resoluconnec-tion informaconnec-tion to the respective stages of
the decoder to recover the contextual information However, this
method cannot retrieve the loss due to the pooling layer and results in a
coarse pixel map The multi-scaling approach enables CNN to extract
different features at different scales It enhances the receptive field
layer-wise at a more granular layer, which leads to the refinement of the
network’s feature characterization potential
The layer-wise feature representation ability of CNNs improved at a
more granular level by improving the receptive field using the
bottle-neck Res2Net module (Gao et al., 2019) The detailed architecture of the
Res2Net module is shown in Fig 2 For multi-scale feature
representa-tion, the 3 × 3 convolution filters of n channels are replaced with a
bunch of small filters in the Res2Net module, each with w channels in
such a way n = s × w without additional computational burden The
small filter groups are connected in a hierarchical residual-fashion that
increases the representation of output features with different scales
Lastly, the feature maps from all the subsets are concatenated and pass
through the 1 × 1 filter to fuse complete information The input features
are evenly split up into s subset after 1 × 1 convolution in such a way
that every set has the same spatial size and 1/s channels as compared
with the input features The number of features denoted by f i where
i ε{1, 2, 3, …, s} Each f i has a corresponding 3 × 3 convolution filter,
excluding the first feature f1 the output of denoted O i( ), and the output
of O i( )is multi-scale features denoted by M i The output M i can be
written as (Eq 1),
M i=
⎧
⎨
⎩
f i , i = 1;
O i(f i ) , i = 2;
O i(f i+M i− 1 ) , 2 < i ≤ s.
(1)
Eq 1 depicts the multi-scale features, where O i( )is 3 × 3
convolu-tional operator could encounter feature information from all the features
split up {f j , j ≤ i} The fusion of all the features results in the output of Res2Net with multiple dissimilar features with different combinations of the receptive field In the Res2Net module, the features are split and concatenated The splits are working in a multi-scale manner, which benefited to the extraction of both global and local features information
of input features The concatenation of features at a different scale to better fuse the information The process of split and concatenation en-ables the convolutions to transform features more efficiently The fea-tures from multiple 3 × 3 filters combined that result in many identical features due to the aggregation effect
Table 1
Details of operations performed, settings of layers in each encoding and decoding stage of the proposed network
# Stages Encoder path # output features and
feature size
Decoder path # output
features and feature size
1
Input 256 × 256 ×1 Conv2D [output Layer] [1 × 1,
Sigmoid]
256 × 256
× 1 Conv2D [3 × 3,
BatchNorm, Relu]
256 × 256
× 64 Conv2D [3 × 3, BatchNorm, Relu] 256 × 256 × 64 Res2Net + SE
256 × 256
× 64
Res2Net + SE
256 × 256
× 64
Res2Net- [64, scaling = 4, Relu]
Res2Net-[64, scaling
= 4, Relu]
SE- [64, r = 8, Relu and Sigmoid]
SE- [64, r = 8, Relu and Sigmoid]
2
Max Pooling [2
× 2] 128 × 128 × 64
Upsampling (Deconvolution layer) 256 × 256
× 128 [2 × 2, strides = 2 × 2]
Res2Net + SE
128 × 128
× 128
Res2Net + SE
128 × 128
× 128
Res2Net- [128, scaling = 4, Relu]
Res2Net- [128, scaling = 4, Relu]
SE- [128, r = 8, Relu and Sigmoid]
SE- [128, r = 8, Relu and Sigmoid]
3
Max Pooling [2
× 2] 64 × 64 ×128
Upsampling (Deconvolution layer) 128 × 128
× 256 [2 × 2, strides = 2 × 2]
Res2Net + SE
64 × 64 ×
256
Res2Net + SE
64 × 64 ×
256
Res2Net- [256, scaling = 4, Relu]
Res2Net- [256, scaling = 4, Relu]
SE- [256, r = 8, Relu and Sigmoid]
SE- [256, r = 8, Relu and Sigmoid]
4
Max Pooling [2
× 2] 32 × 32 ×256
Upsampling (Deconvolution layer) 64 × 64 ×
512 [2 × 2, strides = 2 × 2]
Res2Net + SE
32 × 32 ×
512
Res2Net + SE
32 × 32 ×
512
Res2Net- [512, scaling = 4, Relu]
Res2Net- [512, scaling = 4, Relu]
SE- [512, r = 8, Relu and Sigmoid]
SE- [512, r = 8, Relu and Sigmoid]
5
Max Pooling [2
× 2] 16 × 16 ×512
Upsampling (Deconvolution layer) [2 × 2, strides = 2 × 2]
32 × 32 ×
1024 Res2Net + SE
16 × 16 × 1024
Res2Net- [1024, scaling = 4, Relu]
SE- [1024, r = 8, Relu and Sigmoid]
Trang 63.3 Multi-scale feature recalibration
Along with the multi-scale feature characterization ability of the
Res2Net module, the SE network (Hu et al., 2018), as indicated in Fig 2,
is added before the residual connection The SE network’s primary
purpose is to model the channel-wise feature responses by expressing the
interdependencies between the channels and recalibrate fused features
after concatenation
In squeeze operation, input features transformed in such a fashion,
which apply global average pooling on the input features of size W ×
H × C received from the 1 × 1 convolution of Res2Net block and
con-verted all the channels into the one-dimensional vector with dimensions
equal to the number of channels C Let, input features M = [M1, M2, M3,
…, M C]where M C εRH×W is one channel from input feature with size H ×
W The global pooling represents the one-dimensional vector Z of size
RC For Cth- channel, the element in the vector is given by (Eq 2),
Z C=F sqe(M C) = 1
H × W
∑H i=1
∑W j=1
The Z is the transformation of input features M, and it is the
aggre-gation of transformed features that can be interpreted as a cluster of the
local descriptors whose statistics are meaningful for the entire image
In the second operation, the aggregated information is utilized to
grab channel-wise dependencies To isolate the channel and improve the
generalization capability of a network simple gating mechanism is
employed using two fully connected layers with ReLU and sigmoid
activation The first fully connected layers transform Z with δ – ReLU
activation and then σ - sigmoid activation function expressed as (Eq 3),
E = F Ex(Z, W) = σ(g(Z, W) = σ(W2δ(W1Z)) (3)
Where W1∈ RC r×C and W2∈ RC× C
r , r- is the dimensionality reduction
factor which decides the computational cost controlling capacity of
SENet Here we employed a dimensionality reduction factor r = 8 that
exhibits the best segmentation performance for the medical images
(Rundo et al., 2019) Therefore, we employed the dimensionality
reduction factor r = 8 for our experimentation The excitation operation
returns the same channel dimension of input features (M) The final output of the SE block is the scaling of excitation output channel weights with input features, which puts more emphasis on essential features and less on negligible features The scaling expressed as (Eq 4),
̃
Where ̃M =[M̃1, ̃ M2, ̃ M3, …, ̃ M C
]
and scaling operation F scale(M C , E C) is
channel-wise multiplication between the E C∈ [0, 1] and M C∈ RH×W The multi-scale features and recalibration approach characterized the high-level features in a better way At each encoding stage, the network maintains the contextual information of the input object These features are concatenated with the corresponding encoding stages with decoding stages through skip connection to reconstruct the object shape
in the segmentation map
4 Experimental setup and result analysis
4.1 Data preparation
The experimentation performed using the publicly available 3D Image Reconstruction for Comparison of Algorithm Database (3Dircadb) was utilized for the training and testing the network It consists of 20 venous phase enhanced CT volumes from various European hospitals with different CT scanners The CT scans of patients (10 women and 10 men) with hepatic tumors in 15 cases The database has been manually annotated by medical experts for liver and tumor The input size is 512 ×
512, and in-plane resolution has a range from 0.86 × 0.86 mm2 to 0.56 × 0.56 mm2 The CT scan slices are available in DICOM format The number of slices ranges from 74 to 260 Slice thickness varies from 4 mm
to 1 mm The database provides significant variations in the shape and size of the liver and tumor The tumors are available in different coui-naud segments in the liver (3Dircadb, n.d.)
4.2 Data preprocessing
In CT scan volume, the relative densities of internal body organs are
Fig 3 Illustration of liver and tumor segmentation pipeline for MS-UNet
Trang 7measured using Hounsfield Units (HU) In general, the HU values range
from -1000 to 1000 The tumor growth in the liver parenchyma, which is
a region of interest for segmentation To focus on the liver region, the
adjacent organs, and irrelevant tissues in the abdomen may trouble the
segmentation performance The radiodensities in CT volumes for soft
liver tissues vary from 40 HU to 50 HU (Jin et al., 2018) For removing
irrelevant organs and unnecessary details from the CT images, the liver
region and tumor become clean for segmentation
We preprocess the entire CT data slice-by-slice-fashion First, we
downsampled the 512 × 512 CT images into 256 × 256 to reduce the
computational burden Secondly, we took a global windowing step on
CT slices one-to-one For windowing HU values, we used the window of
(-100, 400) HU so that most of the irrelevant organs were removed from
the CT slices that make the liver and tumor region clean for
segmenta-tion Afterward, the dataset normalized on the same scale between [0,1]
that simplifies the network learning function and convergence speed of
the network by supplying the more easily proportionate images as input
enhanced liver and liver tumor region for segmentation The HU value
windowing and image enhancement results, along with the histogram
plot illustrated in Fig 4 These preprocessing operations offer a clean
liver and tumor region for segmentation
4.3 Training strategy for the proposed network
The proposed network trained using Adam optimizer with initial
learning rate 1 × e− 5 reduced on the plateau with the patience of 5
epochs up to minimum learning rate 1 × e− 10 and mini-batch size of 8
employed for training the network To avoid overfitting, we regularize
the network weights using a weight decay factor of 1 × e− 4 Table 2
indicates the detailed training configuration of the network
4.3.1 Loss function
The model compiled using the dice coefficient as a metric and loss
function as a dice loss, which is the complement of the Dice coefficient
(Jin et al., 2018) The dice loss is expressed as (Eq 5),
L Dsc=1 −
2∑N
i=1
p i×g i
∑N
i=1
P2
i+∑
N
i=1
g2
i
(5)
Where p i and g i are the binary predicted segmentation voxels and ground
truth voxels, respectively, and N is the number of voxels It measures the
similarity of two samples directly and accordingly network weights
optimized by minimizing the loss
4.3.2 Data augmentation
The deep neural networks are data hungry It needs enormous data to
train without over-fitting and generalize well on the test data The non-
availability of a vast labeled medical image database publicly limits the
deep learning model to apply in the medical applications However,
despite the limited publicly available databases, it is possible to train the
deep learning model employing the data augmentation technique It
allows for augmenting the database by applying standard geometric
transformations like translation, rotation, and scaling
We train our network by employing data augmentation We augment
the training images by employing image transformations on the dataset,
which are rotation, scaling, shifting, flipping, and elastic deformation
The data augmentation is beneficial to reduce the risk of overfitting
during the training process and refines the generalization potential of
the model on test data (Yamashita et al., 2018)
4.3.3 Implementation platform details
The proposed network has been implemented in the Keras high-level
neural network application programming interface (API) (Chollet,
2015) with TensorFlow (Abadi et al., 2016) as a backend The work-station utilized for training and testing the model has Intel(R)Xeon(R) CPU E5-16200, 3.60GHZ, 16 GB RAM, NVIDIA GEFORCE TITAN-X GPU with 12 GB Memory with windows 10 operating system
4.4 Performance metrics
To evaluate the segmentation performance between the ground truth and the segmentation map of the proposed network, we utilize the performance metrics based on the volumetric size similarity and surface distance measures (Heimann et al., 2009) (Jiang et al., 2018) In defi-nition, ground truth is denoted by A, and the segmented result is denoted
by B The Dice Similarity Coefficient (DSC) score expressed as follows (Eq 6),
DSC(A, B) = 2|A ∩ B|
The Volumetric Overlap Error (VOE) expressed (Eq 8 and 9) using the Jaccard Coefficient (JC) or Intersection over Union (IoU) (Eq 7),
Jaccard Coefficient(A, B) = |A ∩ B|
VOE(A, B) =
(
1 − |A ∩ B|
|A ∪ B|
)
Relative Absolute Volume Difference (RAVD) denoted as follows (Eq 10),
RAVD = Abs
(
|B| − |A|
|A|
)
The surface distance measures are Average Symmetric Surface Dis-tance (ASSD) and Maximum Symmetric Surface DisDis-tance (MSSD) ASSD
denoted as (Eq 12), Let S(A) indicates the number of surface voxels of A The shortest distance of a random voxel V to S(A) is expressed as follows
(Eq 11),
d(v, S(A) ) = min
S A∈S(A)‖v − S A‖ (11) Where ‖‖ denote Euclidean distance
ASSD(A, B) = 1
|S(A) | + |S(B) |
(
∑
S A∈S(A)
d(S A , S(B) ) + ∑
S B∈S(B)
d(S B , S(A) )
)
(12) Maximum symmetric surface distance (MSSD) denoted as (Eq 13),
MSSD(A, B) = max
{ max
S A∈S(A) d(S A , S(B) ), max
S B∈S(B) d(S B , S(A) )
}
(13) The DSC, IoU, VOE, and RAVD measured in percentage, and surface distance measures (ASSD and MSSD) measured in millimeters (mm) For DSC and IoU 100 % is the best segmentation, and 0 % is the worst segmentation, VOE, and RAVD 0% is the best segmentation, and 100 %
is the worst segmentation These metrics expressed in percentage The ASSD and MSSD measured in millimeters (mm) For ASSD and MSSD
0 mm is the best segmentation, and the upper value has not bound; the maximum value shows the worst segmentation
4.5 Experimental results and analysis
The proposed network has multi-scale feature extraction and feature recalibration ability that leads to better segmentation performance To verify the multi-scale feature representation potential of the proposed
network, we evaluated the proposed network for scaling factor s = 2,
s = 4, and s = 8 and measure the performance using the measures
Trang 8Fig 4 Preprocessing effect on input image along with the histogram: the first column indicates the sample input CT slices with histogram, the second column
indicates HU Windowing with Histogram, and the third column indicates the histogram equalized images and equalized histogram
Trang 9defined in the previous section; Table 3 presents the performance of the
proposed network for the various scaling factor The proposed network
achieved a dice similarity coefficient of 97.13 % for liver and 84.15 %
for tumor segmentation when scaling factor s = 4 The scaling factor
improves the CNN capability to extract features at a multi-scale level
with an improved receptive field
The segmentation performance improved by multi-scale features
representation and recalibration approach for liver and tumor is shown
in Fig 5 The result depicted that the multi-scale features capture more
detailed liver and tumor information and segment complicated liver
parenchyma and tumor with less segmentation error The boundary
marked images show that the error between ground truth and the
segmented results The network captures fuzzy liver boundaries with
less segmentation error The liver and tumor have less intensity
varia-tion captured by the network with less segmentavaria-tion error However, the
network shows negative results where the tumor size is sufficiently
small, shown in Fig 6 The network performs well on liver pixel
seg-mentation, but the tumors with small size are not segmented with more
accuracy than that of a large tumor The tumor segmentation error
in-creases as the number of small tumors inin-creases in the 2D CT images
The proposed network can segment the liver anatomical structure
from the abdomen CT images with less segmentation error The results
show that the network can segment the liver with the feeble boundaries
between the liver and nearby organs with less segmentation error
However, the tumor pixels are not accurately classified by the network if
tumor size decreases The segmentation error is increasing, and the
network produces false-negative results for the tumor segmentation for
small-sized tumors The multi-scale approach offers a reasonable
seg-mentation quality for complex liver anatomical structure and large size
tumors Still, tumor segmentation performance for small size tumor
needs to upgrade at an adequate level It is observed that the network
performs well on the liver and large tumors, but it produces false-
negative results if the size of the tumor decreases
We also analyzed the effect of multi-scale features on the network
complexity in terms of the number of parameters, layers, floating-point
operations per second (FLOPS), and prediction time per image, as shown
in Table 4 As the scaling factor increases, the total number of
param-eters, FLOPS, and prediction time decreases, indicating the network
complexity reduces with the scaling factor However, the total layers
increasing with the increase in scaling factor that improve feature
extraction capability and learning ability of the network The analysis
shows a tradeoff between the scaling factor and the segmentation per-formance of the network
4.6 Comparison with other methods
We verified the performance of the proposed methods with state-of- art-methods for liver and tumor segmentation We perform the experi-mentation on a 3Dircadb publicly available dataset that provides considerable variation and complexity of liver and tumors Segmenta-tion results on the 3Dircadb database for various methods are shown in Tables 5 and 6, respectively The dice similarity score is one of the sig-nificant performance measures preferred to evaluate the segmentation algorithm performance of the medical images The proposed method offers a dice similarity score of 97.13 % for liver and 84.15 % for a liver tumor, far better than the baseline UNet architecture (Ronneberger
et al., 2015) Our method is superior to the ResNet based method (Han,
2017) proposed for liver tumor with an improved dice score of 3.33 % for liver and 22.15 % for tumor segmentation We compared the results with the recently proposed method with modified UNet architecture
for liver and 13.08 % for tumors
Our proposed model complexity compared with the other UNet based methods shown in Table 7 The Res2Net block reduces the number
of parameters and increases the number of layers in the network, resulting in better learning of the network and reducing the computa-tional burden to a great extent compared to ResNet and mU-Net models The prediction time reduced due to less complexity compared to ResNet and mU-Net models Our model compared with the baseline UNet model, a small number of parameters and FLOPS increased, but seg-mentation performance improved significantly
We illustrated the statistical significance analysis of the proposed
model We calculated the p-value to demonstrate the statistical signifi-cance of the proposed model with the other models The p-value is the
statistical way to test the hypothesis of the method and is decided by using statistical tests on the data The segmentation performance of the different models compared with the proposed model using the
statisti-cally significant analysis by calculating the p-value between the two
models The proposed model verified for the statistical significance with the significance level α=0.05 or confidence level 0.95 The statistical significance of the model decided by the p-value using the non-
parametric statistical Wilcoxon signed-rank test (Demsar, 2006), which is used for hypothesis testing The test was performed on the predicted results obtained from different models The pair of dice scores
of each test sample from the other models are utilized to perform the statistical Wilcoxon signed-rank test (Zabihollahy et al., 2019) We test the model by setting the null hypothesis (H0): two models have no statistically significant difference in performance, and the alternative hypothesis (Ha): the proposed model is having a statistically significant performance than the other model We performed the test on a different group of samples using the dice score of each sample calculated using the ground truth provided in the dataset for comparing the model per-formance The significance level set for the test is α=0.05 If the p-value
is smaller than the significance level (α), then the null hypothesis is rejected in favor of the alternative hypothesis The meaning is that the proposed model has statistically significant performance than another
Table 2
Proposed network training configuration
Initial Learning rate 1 × e− 5
Minimum Learning rate 1 × e− 10
Weight regularization factor (L2 Regularization) 1 × e− 4
Scaling factor of Res2Net block 4
Convolutional 2D operations and activation function
utilized in Res2Net Conv2D + BatchNorm +Relu
SE Net block dimensionality reduction factor 8
Loss function Dice loss
Table 3
Experimental results on 3Dircadb database for liver and tumor segmentation
2 Liver Segmentation Tumor Segmentation 95.87 68.85 91.23 55.34 18.87 44.66 1.72 2.24 10.83 3.23 20.31 15.01
4 Liver Segmentation Tumor Segmentation 97.13 84.15 94.42 72.64 5.57 27.36 0.41 0.22 4.08 1.64 10.21 7.04
8 Liver Segmentation Tumor Segmentation 96.71 78.19 95.13 64.19 6.37 35.81 0.04 0.36 4.52 1.92 12.15 7.84
Trang 10Fig 5 Sample segmentation results (Scaling
factor s = 4) Row 1- input images, Row 2 - liver and tumor Ground Truth (GT) images (Red - liver, and orange - tumor), Row 3- an overlay of
GT and input images (Red - liver, and Purple- tumor), Row 4- Segmented liver and tumor (Dark green- liver and faint green - tumor), Row 5- an overlay of segmented output and input (Dark green- liver and faint green -tumor), Row
6 - Boundary marked images with GT and segmented result, Row 7 - magnified liver region (Red- liver GT, Blue-Tumor GT, Green- Liver segmented region and yellow - segmented tumor region) (For interpretation of the references to colour in the Figure, the reader is referred to the web version of this article)