MS UNet a multi scale UNet with feature recalibration approach for automatic liver and tumor segmentation in CT images

MS-UNet: A multi-scale UNet with feature recalibration approach for automatic liver and tumor segmentation in CT images Devidas T.. Several methods utilized machine learning algorithms

Trang 1

Available online 24 February 2021

MS-UNet: A multi-scale UNet with feature recalibration approach for

automatic liver and tumor segmentation in CT images

Devidas T Kushnurea,b,* , Sanjay N Talbara

aDepartment of Electronics and Telecommunication Engineering, Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India

bDepartment of Electronics and Telecommunication Engineering, Vidya Pratishthan’s Kamalnayan Bajaj Institute of Engineering and Technology, Baramati,

Maharashtra, India

A R T I C L E I N F O

Keywords:

Deep learning

Convolutional neural network

Liver and tumor segmentation

Multi-scale feature

Feature recalibration

CT images

A B S T R A C T Automatic liver and tumor segmentation play a significant role in clinical interpretation and treatment planning

of hepatic diseases To segment liver and tumor manually from the hundreds of computed tomography (CT) images is tedious and labor-intensive; thus, segmentation becomes expert dependent In this paper, we proposed the multi-scale approach to improve the receptive field of Convolutional Neural Network (CNN) by representing multi-scale features that extract global and local features at a more granular level We also recalibrate channel- wise responses of the aggregated multi-scale features that enhance the high-level feature description ability of the network The experimental results demonstrated the efficacy of a proposed model on a publicly available 3Dircadb dataset The proposed approach achieved a dice similarity score of 97.13 % for liver and 84.15 % for

tumor The statistical significance analysis by a statistical test with a p-value demonstrated that the proposed model is statistically significant for a significance level of 0.05 (p-value < 0.05) The multi-scale approach

im-proves the segmentation performance of the network and reduces the computational complexity and network parameters The experimental results show that the performance of the proposed method outperforms compared with state-of-the-art methods

1 Introduction

According to the status report on the Global Burden of Cancer

worldwide (GLOBOCAN), 2018 estimates liver cancer growth and

mortality rate rapidly increasing across the world It is the sixth most

common cancer and the second most leading cause of cancer deaths

worldwide (Bray et al., 2018) In the human body, the liver is one of the

massive and essential organs involved in detoxification, filtering blood

from the digestive tract, and supplying it to the body parts (Bilic et al.,

2019) Thus, the liver becomes the first site that often affects due to the

spread of metastases tumor from a primary site such as colorectal,

breast, pancreatic, ovarian, and lung The growth of liver tumors due to

metastases tumor is secondary liver cancer Also, liver cancer, which

originates in the liver cells (hepatocytes) such as Hepatocellular

Carci-noma (HCC), is the primary liver cancer HCC contains a hereditarily

and molecularly exceptionally heterogeneous group of malignant

growths that usually emerge in the chronically damaged liver HCC

af-fects the hepatocytes or liver cells, which leads to a mutation in the

structure and shape of the affected liver cells that determine the progress

of cancer These perceptible variations in shape and tissue structure allow for the non-invasive identification of HCC in imaging (Christ et al.,

2017)

The radio imaging modalities such as ultrasound, computed tomog-raphy (CT), and magnetic resonance imaging (MRI) are utilized to detect anomalies in the upper and lower abdomen part of the body Radio imaging is a non-invasive, painless, and precise technique to identify an inside injury to help clinical specialists diagnose the complication and plan the treatment for saving the life of the patient Medical imaging techniques have become more popular nowadays to diagnose and further treat the disease and its progress (Bilic et al., 2019) Due to ease and less time required to capture the human body’s exact inner struc-ture, the CT scan becomes the medical expert’s choice to diagnose liver-related complications and anomalies (Luo et al., 2014)

Clinically liver and tumor segmentation from CT images is an important task to deal with hepatic disease diagnosis and treatment planning The liver volume assessment is the directive before

* The corresponding author is a research scholar at the Department of Electronics and Telecommunication Engineering, Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India

E-mail address: devidas.kushnure@vpkbiet.org (D.T Kushnure)

Contents lists available at ScienceDirect Computerized Medical Imaging and Graphics

https://doi.org/10.1016/j.compmedimag.2021.101885

Received 25 May 2020; Received in revised form 22 January 2021; Accepted 24 January 2021

Trang 2

hepatectomy, and it assists the doctors and surgeons in planning liver

resection, liver transplantation, portal vein embolization, associating

liver partition and portal vein ligation for staged hepatectomy (ALPPS)

(Gotra et al., 2017) and post-treatment assessment It is also essential for

applications such as computer-aided diagnosis (CAD) and deciding the

interventional radiological treatment The liver tumor volume

extrac-tion with high accuracy is beneficial for the planning of Selective

In-ternal Radiation Therapy (SIRT) (Radioembolization) to diminishing the

risk of abundance or low radiation dose as per the patient’s liver volume

(Moghbel et al., 2018) Therapy planning for the liver, primary and

metastatic tumors using the percutaneous ablation is the minimally

invasive surgical procedure guided through image navigation (Spinczyk

et al., 2019) The liver segmentation is the significant stage to detect the

hepatic complications earlier with radio imaging The CT is the medical

expert’s preferred imaging modality to deal with hepatic diseases

because of its robustness, vast availability, less acquisition process, and

higher spatial resolution In clinical routine medical experts delineate

the liver and tumor manually from CT images; manual segmentation is

considered the gold standard in medical practice and research

How-ever, the liver and tumors’ manual outline is tedious and

time-consuming, which could delay the diagnosis process The

seg-mentation depends on the expert’s knowledge and experience that may

cause an erroneous segmentation outcome Due to these reasons, it is

essential to provide the computer-based framework that will

automati-cally segment liver and tumor with the accuracy acceptable to the

clinical significance and offer the second opinion to the physician to

conclude with more accuracy in less time Many researchers and the

scientific community focus on developing the framework for automatic

liver and tumor segmentation with modern image processing and

computer vision algorithms

In the last three decades, much scientific research on automatic and

interactive segmentation strategies proposed in the literature Even

though automatic liver and liver tumor segmentation from CT volume

remains challenging, the liver is a soft organ, and its shape is

excep-tionally dependent on the surrounding organs inside the abdomen

Apart from that, the liver pathology is inconsistent that may modify its

signal intensity, density, and distortion in shape, less intensity difference

between the liver and tumor region, uniformity in intensities of the liver

with its surrounding organs, and feeble boundaries between the liver to

its surrounding organs such as stomach and heart illustrated in Fig 1

Usually, liver CT images are obtained by utilizing an injection protocol

that enhances the liver in the CT images for medical interpretation

However, the injection phase decides the enhancement variation, and

noise in the CT images increases with the enhancement that causes an

increase in liver noise, which is already noisy without any enhancement

(Moghbel et al., 2018) Because of these challenges, the liver and tumor segmentation turn into a challenging assignment that has pulled in much research consideration in recent years

2 Related literature

In literature, several interactive and automatic methods for liver and tumor segmentation in CT volumes are proposed In 2007 Grand chal-lenge benchmarks on liver and liver tumor segmentation conducted in conjunction with the MICCAI conference Most of the presented methods

at the challenge were based on statistical shape models for automatic segmentation (Heimann et al., 2009) Furthermore, methods (Luo et al.,

2014) based on the liver gray intensities, structure, and texture features proposed automatic liver segmentation The gray level-based methods utilized the liver gray intensities for segmentation These methods used intensity-based algorithms like region growing, active contour, graph cut, thresholding, and clustering The structure-based methods utilized the liver’s repetitive geometry to create the probabilistic model to reconstruct the liver shape The methods used for the liver segmentation statistical shape model, statistical pose model, and probabilistic atlas The texture-based methods utilized the texture features to segment the liver The algorithms based on machine learning and pattern recognition classify the liver region based on the texture feature description After-ward, a computer-aided diagnosis (CAD) system is envisaged as the second pair of eyes to the expert radiologists if the CAD system works with accuracy Several methods utilized machine learning algorithms like probabilistic neural network (PNN), support vector machine (SVM), region growing, and alternative fuzzy C-means (AFCN), Hidden Markov Model (HMM) algorithms to design CAD systems for liver and tumor segmentation (Moghbel et al., 2018)

Over the past few years, Deep learning algorithms based on Con-volutional Neural Network (CNN) have become famous for visual recognition applications because of their powerful nonlinear feature extraction capabilities using many different filters at a different layer of the network and capability to process a large amount of data (Yamashita

appli-cations like image classification, object detection, action recognition The CNN-based networks like Alexnet, VGGNet, GoogleNet, ResNet proved the capability for visual recognition tasks in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Ueda et al., 2019) After CNN’s success for an efficient classification task, the researcher exploi-ted the same backbone architecture for the semantic segmentation task The Fully Convolutional Neural Network (FCN) (Shelhamer et al., 2017) based architecture employed the existing well-known classifier models for semantic segmentation by replacing the dense classifier layers The

Fig 1 Sample images from the 3Dircadb dataset that denotes the complications in liver and tumor segmentation in the abdomen CT scan: (a) Low-intensity

dif-ference between nearby organs (liver, stomach, and heart) and tumor (b) Ambiguous boundary between the liver, heart and (c) stomach

Trang 3

FCN decoder architecture was considered to be the most successful for

segmentation The decoder network was utilized to upsample the

segmented map into the input image size For semantic segmentation of

images, SegNet (Badrinarayanan et al., 2017) was proposed It has the

encoder-decoder network followed by a pixel-wise classification layer

The encoder architecture utilized the identical network topology like 13

convolutional layers VGGNet

In medical image processing, the semantic segmentation task is

uti-lized to segment the organ’s anatomical structure and segmentation of

tumors The automatic segmentation of the region of interest from

medical images using CNN-based architecture proved the effectiveness

archi-tecture proposed for biomedical image processing An encoder-decoder

design becomes a decision for the medical image segmentation that has

an encoding part, which falls information of input image into a group of

high-level features and decoding parts, where high-level features

uti-lized to rebuild a pixel-wise segmentation at a single or multiple

upsampling steps (Ronneberger et al., 2015) After this paper, the

al-gorithms proposed based on FCN utilized the UNet derived architecture

for medical image segmentation The Liver Tumor Segmentation

Benchmark (LiTS) challenge was organized in conjunction with the ISBI

2017 and MICCAI 2017 The methods were presented based on the CNN

deep learning approach; the majority of methods were UNet derived

architecture Almost all of the techniques utilized specific preprocessing

on input data like HU-value windowing, normalization, and

standardi-zation Additionally, most of the techniques applied connected lesion

components post-processing method on the segmented map to discard

the portion of lesion outside the liver region (Bilic et al., 2019)

Furthermore, most available liver and liver tumor segmentation

networks are based on FCN with UNet encoder (contraction) and

decoder (expansion) structure In UNet, all layers are CNNs to achieve

the pixel level prediction in a forward step UNet has an encoding and

decoding path built using the convolution, pooling, and upsampling

layers For improving the segmentation capability of UNet, the encoding

path features concatenated with a decoding path at the respective stage

using skip connection To enhance the segmentation output further few

methods (Budak et al., 2020)(Gruber et al., 2019) proposed were

uti-lized two UNet architecture jointly for liver and tumor segmentation

The liver and kidney segmentation complex CNN architecture (

ImageNet-trained ResNet-34 for feature encoder to reduce the

conver-gence time and overfitting problem

Further, to improve the prediction capabilities of FCN based

archi-tectures, the residual connections (Drozdzal et al., 2016) (Zhang et al.,

2019) were employed from a forward path in the intermediate feature

map and post-processing to refine the performance of the segmentation

The Deep CNN model (Han, 2017) based on ResNet used long-range

UNet and short-range ResNet residual connection with post-processing

using 3D connected component labeling of all voxels labeled as a

lesion Further, post-processing methods were proposed to refine the

segmentation performance of the algorithms The FCN based

encoder-decoder model (Zhang et al., 2017) (Christ et al., 2017)

(Chlebus et al., 2017) demonstrated the effect of the level set, graph cut,

CRF, and random forest algorithms utilized for post-processing to refine

segmentation results The super-pixel-based (Qin et al., 2019) CNN

proposed divided the CT image into super-pixel by aggregating the

adjacent pixels with the same intensity and classified into three classes

interior liver, liver boundary, and non-liver background and utilized the

CNN to predict the liver boundary

Recently, many extensions of UNet by modifying the core structure

proposed to segment liver and tumor The novel hybrid densely

con-nected UNet (Li et al., 2018) proposed to explore intra-slice and

inter-slice features by introducing a hybrid feature fusion layer using 2D

and 3D DenseUNet The 2D DenseUNet extract the intra-slice features,

and 3D Dense-UNet extract the inter-slice volumetric context, and these

features were fused to portray 3D interpretation using the feature fusion

layer The 3D residual attention-aware RA-UNet (Noh et al., 2015) proposed using the residual learning approach to express multi-scale attention information and combine low-level features with high-level features The modified UNet architecture (Seo et al., 2020) proposed

to exploit the object-dependent feature extraction using a modified skip connection with an additional convolutional layer and residual path The modified skip connection extract high-level global features of small objects and high-level features of high-resolution edge information of the large object Recently, UNet++ architecture (Zhou et al., 2020) was proposed by a redesign of the skip connection to exploit the multi-scale features using a feature fusion scheme from each encoding layer to the decoding layer

In semantic segmentation, the CNN extracts the critical features of the image and effectively decides the coarse boundary of the target However, it is observed that end of the encoder, the size of feature maps

is remarkably reduced; as a result, obstructing the accuracy of CNN The consecutive downsampling by pooling operations reduces the input image resolution to a small size feature map It results in loss of spatial information about the object, which is essential in analyzing medical images for accurate segmentation of target objects There are various methods such as deconvolution (Noh et al., 2015), skip connections (Ronneberger et al., 2015) have been proposed which utilized the concept of transpose convolution for upsampling, and skip connection for connecting upper convolution layers with the deep layers in such a way that network can maximize the utilization of the high-level features

to preserve the spatial information However, these methods cannot recover the spatial information loss that occurred in the pooling and convolution operation

Moreover, CNN models need to process the features at a different scale to extract the meaningful contextual information of the object for achieving successful semantic segmentation Multi-scale feature char-acterization achieved by FCN with variable pooling layers and combining the features from previous layers with a deeper layer to maintain the global and local information of the object and achieved effective semantic segmentation (Long et al., 2015) The pyramid scene parsing network (PSPNet) (Zhao et al., 2017) utilized global context information accumulated from region-based features employing a pyr-amid pooling In pyrpyr-amid pooling, global and local information is characterized effectively by transforming input features from multiple pooling and aggregating all the features to achieve effective semantic segmentation Later, the DeepLab system (Chen et al., 2018)(Chen et al.,

2017) proposed to preserve the spatial resolution by utilizing the atrous convolution module The atrous convolution is employed in series or parallel to achieve the expansion of the receptive field of CNN and atrous spatial pyramid pooling with adopting multiple atrous rates utilized to gain the multi-scale context depiction Channel-UNet (Chen et al., 2019) proposed optimized the mapping of information between pixels in convolution layers with spatial-channel convolution and adopted an iterative learning mechanism that expands the receptive field of convolution layers

In this paper, we propose the CNN with encoder-decoder UNet based multi-scale feature representation and recalibration architecture for liver and liver tumor segmentation We utilize the bottleneck Res2Net module’s ability to represent multi-scale features and improve the receptive field of the convolutional neural network (CNN) Further, we recalibrate the multi-scale features channel-wise with a squeeze-and- excitation (SE) network We perform the experimentation on the 3Dir-cadb dataset available publicly The results illustrated that the multi- scale UNet outperforms as compared to the state-of-the-art methods for liver and tumor segmentation

Following contributions are incorporated in the paper

• We proposed the MS-UNet with a feature recalibration approach by exploiting the core idea of UNet encoder-decoder architecture The architectural difference is that our network utilized the Res2Net module for multi-scale feature representation and the SE network for

Trang 4

feature recalibration in encoder and decoder stages The

combina-tion of the Res2Net module followed by the SE network enhance the

feature representation capability and learning potential of the

network The computational complexity and the parameters of the

proposed network are reduced because of the Res2Net module

•We employed the multi-scale Res2Net module in the network to

enhances the receptive field of CNN to cover the entire region of

interest from the input features and characterize global and local

information of the input at the more granular level by extracting

multi-scale features Therefore, the input feature represented with

multiple features at different scales is aggregated hierarchically,

signifying detailed information of the input features

•The aggregated multi-scale features extract the more granular

in-formation of the input that limits the learning capacity of the

network To improve the network learning ability and focus on more

prominent features of the object, we utilized the SE network The SE

network recalibrates the channel wise feature responses by modeling

interdependencies between the channels with squeeze and excitation

operations Feature recalibrations enhance the network sensitivity to

informative features of the object so that the network’s ability to

learn prominent features improves Also, the feature extraction

capability of successive network layers increases that boost the

seg-mentation performance

•We experimentally verified the MS-UNet performance for liver and

tumor segmentation in terms of multi-scale feature extraction ability

by varying scaling factor The model performance has been

evalu-ated by using different statistical measures by varying scaling factors

We trained the network from scratch and performed the

experi-mentation on a publicly available 3Dircadb dataset manually

anno-tated by the medical experts and demonstrated the network

segmentation performance The proposed model is statistically

sig-nificant for the significance level 0.05 (p-value < 0.05) performed

using a statistical test for hypothesis testing

The remaining paper is organized as follows The explanation of the proposed methodology in section 3, experimental setup and result analysis in Section 4 and section 5, concludes the paper

3 Proposed work

3.1 Proposed methodology

We proposed a deep convolutional neural network with multi-scale feature extraction and recalibration architecture for automatic liver and tumor segmentation Fig 2 shows the proposed MS-UNet encoder- decoder architecture We embed the Res2Net bottleneck module with the SE network in place of two 3 × 3 convolution operations in UNet encoder-decoder stages The bottleneck Res2Net module has the same architecture, such as bottleneck ResNet except a single 3 × 3 convolu-tion operaconvolu-tion replaced by small 3 × 3 convoluconvolu-tions in hierarchical order to achieve multi-scale feature extraction and improvement in the receptive field The Res2Net bottleneck architecture was designed to improve the layer-wise multi-scale feature representation at a more granular level and receptive field of CNN We employed the Res2Net bottleneck module instead of 3 × 3 convolution in UNet to leverage the multi-scale feature extraction ability with an improved receptive field to enhance the segmentation performance The Res2Net module can extract the input features at a more granular level with its multi-scale characterization ability It improves the receptive field of the CNN by scaling the input features into small blocks and processed through the

Fig 2 Proposed MS-UNet encoder and decoder architecture (Yellow blocks: Res2Net module and SE network, Gray block: SE network) (For interpretation of the

references to colour in the Figure, the reader is referred to the web version of this article)

Trang 5

multiple convolution blocks with different scale features to enhance the

features at multi-scale These multi-scale features are captured by

mul-tiple convolution layers with the local receptive field They empower the

network to extract informative features by aggregating both spatial and

channel-wise features

In semantic segmentation, spatial information plays a significant role

in locating the region of interest in the image To empower the network

potential to characterize the global and local information at a granular

level and uplift the network learning ability at each stage To achieve

this, we employed the feature refinement through SENet The SENet

recalibrates the features in two steps One, it globalizes the fused multi-

scale features channel-wise into the one-dimensional vector called

squeeze operation Secondly, it recalibrates the features by passing

through two dense layers and describes the weights for the input

channels called excitation operation Then channel weights scale with

input multi-scale features and improve the feature representation

po-tential of the network The network gains the perception of coarse-

grained context in the shallow layers and localization of fine-grained

attributes in the deeper layers, which leads to boost segmentation

performance

UNet with the number of stages in the network along with layer-wise

activation function, convolution filter size, and the number of features

and feature shape All the convolutional operations used in the Res2Net

module following the order Conv2D, Batch normalization, and Relu

The entire training and testing pipeline illustrated in Fig 3 The input

CT dataset is preprocessed first, then train the MS-UNet for liver and

tumor, test model on test data, and evaluate network performance using

statistical measures

3.2 Multi-scale features

In the CNN, encoder architecture extracts high-level information at

each stage by downsampling the input using a pooling operation

However, pooling causes contextual information loss The skip

connec-tion provides the low-resoluconnec-tion informaconnec-tion to the respective stages of

the decoder to recover the contextual information However, this

method cannot retrieve the loss due to the pooling layer and results in a

coarse pixel map The multi-scaling approach enables CNN to extract

different features at different scales It enhances the receptive field

layer-wise at a more granular layer, which leads to the refinement of the

network’s feature characterization potential

The layer-wise feature representation ability of CNNs improved at a

more granular level by improving the receptive field using the

bottle-neck Res2Net module (Gao et al., 2019) The detailed architecture of the

Res2Net module is shown in Fig 2 For multi-scale feature

representa-tion, the 3 × 3 convolution filters of n channels are replaced with a

bunch of small filters in the Res2Net module, each with w channels in

such a way n = s × w without additional computational burden The

small filter groups are connected in a hierarchical residual-fashion that

increases the representation of output features with different scales

Lastly, the feature maps from all the subsets are concatenated and pass

through the 1 × 1 filter to fuse complete information The input features

are evenly split up into s subset after 1 × 1 convolution in such a way

that every set has the same spatial size and 1/s channels as compared

with the input features The number of features denoted by f i where

i ε{1, 2, 3, …, s} Each f i has a corresponding 3 × 3 convolution filter,

excluding the first feature f1 the output of denoted O i( ), and the output

of O i( )is multi-scale features denoted by M i The output M i can be

written as (Eq 1),

M i=

⎧

⎨

⎩

f i , i = 1;

O i(f i ) , i = 2;

O i(f i+M i− 1 ) , 2 < i ≤ s.

(1)

Eq 1 depicts the multi-scale features, where O i( )is 3 × 3

convolu-tional operator could encounter feature information from all the features

split up {f j , j ≤ i} The fusion of all the features results in the output of Res2Net with multiple dissimilar features with different combinations of the receptive field In the Res2Net module, the features are split and concatenated The splits are working in a multi-scale manner, which benefited to the extraction of both global and local features information

of input features The concatenation of features at a different scale to better fuse the information The process of split and concatenation en-ables the convolutions to transform features more efficiently The fea-tures from multiple 3 × 3 filters combined that result in many identical features due to the aggregation effect

Table 1

Details of operations performed, settings of layers in each encoding and decoding stage of the proposed network

# Stages Encoder path # output features and

feature size

Decoder path # output

features and feature size

1

Input 256 × 256 ×1 Conv2D [output Layer] [1 × 1,

Sigmoid]

256 × 256

× 1 Conv2D [3 × 3,

BatchNorm, Relu]

256 × 256

× 64 Conv2D [3 × 3, BatchNorm, Relu] 256 × 256 × 64 Res2Net + SE

256 × 256

× 64

Res2Net + SE

256 × 256

× 64

Res2Net- [64, scaling = 4, Relu]

Res2Net-[64, scaling

= 4, Relu]

SE- [64, r = 8, Relu and Sigmoid]

2

Max Pooling [2

× 2] 128 × 128 × 64

Upsampling (Deconvolution layer) 256 × 256

× 128 [2 × 2, strides = 2 × 2]

Res2Net + SE

128 × 128

× 128

Res2Net + SE

128 × 128

× 128

3

Max Pooling [2

× 2] 64 × 64 ×128

Upsampling (Deconvolution layer) 128 × 128

× 256 [2 × 2, strides = 2 × 2]

Res2Net + SE

64 × 64 ×

256

Res2Net + SE

64 × 64 ×

256

4

Max Pooling [2

× 2] 32 × 32 ×256

Upsampling (Deconvolution layer) 64 × 64 ×

512 [2 × 2, strides = 2 × 2]

Res2Net + SE

32 × 32 ×

512

Res2Net + SE

32 × 32 ×

512

5

Max Pooling [2

× 2] 16 × 16 ×512

Upsampling (Deconvolution layer) [2 × 2, strides = 2 × 2]

32 × 32 ×

1024 Res2Net + SE

16 × 16 × 1024

Trang 6

3.3 Multi-scale feature recalibration

Along with the multi-scale feature characterization ability of the

Res2Net module, the SE network (Hu et al., 2018), as indicated in Fig 2,

is added before the residual connection The SE network’s primary

purpose is to model the channel-wise feature responses by expressing the

interdependencies between the channels and recalibrate fused features

after concatenation

In squeeze operation, input features transformed in such a fashion,

which apply global average pooling on the input features of size W ×

H × C received from the 1 × 1 convolution of Res2Net block and

con-verted all the channels into the one-dimensional vector with dimensions

equal to the number of channels C Let, input features M = [M1, M2, M3,

…, M C]where M C εRH×W is one channel from input feature with size H ×

W The global pooling represents the one-dimensional vector Z of size

RC For Cth- channel, the element in the vector is given by (Eq 2),

Z C=F sqe(M C) = 1

H × W

∑H i=1

∑W j=1

The Z is the transformation of input features M, and it is the

aggre-gation of transformed features that can be interpreted as a cluster of the

local descriptors whose statistics are meaningful for the entire image

In the second operation, the aggregated information is utilized to

grab channel-wise dependencies To isolate the channel and improve the

generalization capability of a network simple gating mechanism is

employed using two fully connected layers with ReLU and sigmoid

activation The first fully connected layers transform Z with δ – ReLU

activation and then σ - sigmoid activation function expressed as (Eq 3),

E = F Ex(Z, W) = σ(g(Z, W) = σ(W2δ(W1Z)) (3)

Where W1∈ RC r×C and W2∈ RC× C

r , r- is the dimensionality reduction

factor which decides the computational cost controlling capacity of

SENet Here we employed a dimensionality reduction factor r = 8 that

exhibits the best segmentation performance for the medical images

(Rundo et al., 2019) Therefore, we employed the dimensionality

reduction factor r = 8 for our experimentation The excitation operation

returns the same channel dimension of input features (M) The final output of the SE block is the scaling of excitation output channel weights with input features, which puts more emphasis on essential features and less on negligible features The scaling expressed as (Eq 4),

̃

Where ̃M =[M̃1, ̃ M2, ̃ M3, …, ̃ M C

]

and scaling operation F scale(M C , E C) is

channel-wise multiplication between the E C∈ [0, 1] and M C∈ RH×W The multi-scale features and recalibration approach characterized the high-level features in a better way At each encoding stage, the network maintains the contextual information of the input object These features are concatenated with the corresponding encoding stages with decoding stages through skip connection to reconstruct the object shape

in the segmentation map

4 Experimental setup and result analysis

4.1 Data preparation

The experimentation performed using the publicly available 3D Image Reconstruction for Comparison of Algorithm Database (3Dircadb) was utilized for the training and testing the network It consists of 20 venous phase enhanced CT volumes from various European hospitals with different CT scanners The CT scans of patients (10 women and 10 men) with hepatic tumors in 15 cases The database has been manually annotated by medical experts for liver and tumor The input size is 512 ×

512, and in-plane resolution has a range from 0.86 × 0.86 mm2 to 0.56 × 0.56 mm2 The CT scan slices are available in DICOM format The number of slices ranges from 74 to 260 Slice thickness varies from 4 mm

to 1 mm The database provides significant variations in the shape and size of the liver and tumor The tumors are available in different coui-naud segments in the liver (3Dircadb, n.d.)

4.2 Data preprocessing

In CT scan volume, the relative densities of internal body organs are

Fig 3 Illustration of liver and tumor segmentation pipeline for MS-UNet

Trang 7

measured using Hounsfield Units (HU) In general, the HU values range

from -1000 to 1000 The tumor growth in the liver parenchyma, which is

a region of interest for segmentation To focus on the liver region, the

adjacent organs, and irrelevant tissues in the abdomen may trouble the

segmentation performance The radiodensities in CT volumes for soft

liver tissues vary from 40 HU to 50 HU (Jin et al., 2018) For removing

irrelevant organs and unnecessary details from the CT images, the liver

region and tumor become clean for segmentation

We preprocess the entire CT data slice-by-slice-fashion First, we

downsampled the 512 × 512 CT images into 256 × 256 to reduce the

computational burden Secondly, we took a global windowing step on

CT slices one-to-one For windowing HU values, we used the window of

(-100, 400) HU so that most of the irrelevant organs were removed from

the CT slices that make the liver and tumor region clean for

segmenta-tion Afterward, the dataset normalized on the same scale between [0,1]

that simplifies the network learning function and convergence speed of

the network by supplying the more easily proportionate images as input

enhanced liver and liver tumor region for segmentation The HU value

windowing and image enhancement results, along with the histogram

plot illustrated in Fig 4 These preprocessing operations offer a clean

liver and tumor region for segmentation

4.3 Training strategy for the proposed network

The proposed network trained using Adam optimizer with initial

learning rate 1 × e− 5 reduced on the plateau with the patience of 5

epochs up to minimum learning rate 1 × e− 10 and mini-batch size of 8

employed for training the network To avoid overfitting, we regularize

the network weights using a weight decay factor of 1 × e− 4 Table 2

indicates the detailed training configuration of the network

4.3.1 Loss function

The model compiled using the dice coefficient as a metric and loss

function as a dice loss, which is the complement of the Dice coefficient

(Jin et al., 2018) The dice loss is expressed as (Eq 5),

L Dsc=1 −

2∑N

i=1

p i×g i

∑N

i=1

P2

i+∑

N

i=1

g2

i

(5)

Where p i and g i are the binary predicted segmentation voxels and ground

truth voxels, respectively, and N is the number of voxels It measures the

similarity of two samples directly and accordingly network weights

optimized by minimizing the loss

4.3.2 Data augmentation

The deep neural networks are data hungry It needs enormous data to

train without over-fitting and generalize well on the test data The non-

availability of a vast labeled medical image database publicly limits the

deep learning model to apply in the medical applications However,

despite the limited publicly available databases, it is possible to train the

deep learning model employing the data augmentation technique It

allows for augmenting the database by applying standard geometric

transformations like translation, rotation, and scaling

We train our network by employing data augmentation We augment

the training images by employing image transformations on the dataset,

which are rotation, scaling, shifting, flipping, and elastic deformation

The data augmentation is beneficial to reduce the risk of overfitting

during the training process and refines the generalization potential of

the model on test data (Yamashita et al., 2018)

4.3.3 Implementation platform details

The proposed network has been implemented in the Keras high-level

neural network application programming interface (API) (Chollet,

2015) with TensorFlow (Abadi et al., 2016) as a backend The work-station utilized for training and testing the model has Intel(R)Xeon(R) CPU E5-16200, 3.60GHZ, 16 GB RAM, NVIDIA GEFORCE TITAN-X GPU with 12 GB Memory with windows 10 operating system

4.4 Performance metrics

To evaluate the segmentation performance between the ground truth and the segmentation map of the proposed network, we utilize the performance metrics based on the volumetric size similarity and surface distance measures (Heimann et al., 2009) (Jiang et al., 2018) In defi-nition, ground truth is denoted by A, and the segmented result is denoted

by B The Dice Similarity Coefficient (DSC) score expressed as follows (Eq 6),

DSC(A, B) = 2|A ∩ B|

The Volumetric Overlap Error (VOE) expressed (Eq 8 and 9) using the Jaccard Coefficient (JC) or Intersection over Union (IoU) (Eq 7),

Jaccard Coefficient(A, B) = |A ∩ B|

VOE(A, B) =

(

1 − |A ∩ B|

|A ∪ B|

)

Relative Absolute Volume Difference (RAVD) denoted as follows (Eq 10),

RAVD = Abs

(

|B| − |A|

|A|

)

The surface distance measures are Average Symmetric Surface Dis-tance (ASSD) and Maximum Symmetric Surface DisDis-tance (MSSD) ASSD

denoted as (Eq 12), Let S(A) indicates the number of surface voxels of A The shortest distance of a random voxel V to S(A) is expressed as follows

(Eq 11),

d(v, S(A) ) = min

S A∈S(A)‖v − S A‖ (11) Where ‖‖ denote Euclidean distance

ASSD(A, B) = 1

|S(A) | + |S(B) |

(

∑

S A∈S(A)

d(S A , S(B) ) + ∑

S B∈S(B)

d(S B , S(A) )

)

(12) Maximum symmetric surface distance (MSSD) denoted as (Eq 13),

MSSD(A, B) = max

{ max

S A∈S(A) d(S A , S(B) ), max

S B∈S(B) d(S B , S(A) )

}

(13) The DSC, IoU, VOE, and RAVD measured in percentage, and surface distance measures (ASSD and MSSD) measured in millimeters (mm) For DSC and IoU 100 % is the best segmentation, and 0 % is the worst segmentation, VOE, and RAVD 0% is the best segmentation, and 100 %

is the worst segmentation These metrics expressed in percentage The ASSD and MSSD measured in millimeters (mm) For ASSD and MSSD

0 mm is the best segmentation, and the upper value has not bound; the maximum value shows the worst segmentation

4.5 Experimental results and analysis

The proposed network has multi-scale feature extraction and feature recalibration ability that leads to better segmentation performance To verify the multi-scale feature representation potential of the proposed

network, we evaluated the proposed network for scaling factor s = 2,

s = 4, and s = 8 and measure the performance using the measures

Trang 8

Fig 4 Preprocessing effect on input image along with the histogram: the first column indicates the sample input CT slices with histogram, the second column

indicates HU Windowing with Histogram, and the third column indicates the histogram equalized images and equalized histogram

Trang 9

defined in the previous section; Table 3 presents the performance of the

proposed network for the various scaling factor The proposed network

achieved a dice similarity coefficient of 97.13 % for liver and 84.15 %

for tumor segmentation when scaling factor s = 4 The scaling factor

improves the CNN capability to extract features at a multi-scale level

with an improved receptive field

The segmentation performance improved by multi-scale features

representation and recalibration approach for liver and tumor is shown

in Fig 5 The result depicted that the multi-scale features capture more

detailed liver and tumor information and segment complicated liver

parenchyma and tumor with less segmentation error The boundary

marked images show that the error between ground truth and the

segmented results The network captures fuzzy liver boundaries with

less segmentation error The liver and tumor have less intensity

varia-tion captured by the network with less segmentavaria-tion error However, the

network shows negative results where the tumor size is sufficiently

small, shown in Fig 6 The network performs well on liver pixel

seg-mentation, but the tumors with small size are not segmented with more

accuracy than that of a large tumor The tumor segmentation error

in-creases as the number of small tumors inin-creases in the 2D CT images

The proposed network can segment the liver anatomical structure

from the abdomen CT images with less segmentation error The results

show that the network can segment the liver with the feeble boundaries

between the liver and nearby organs with less segmentation error

However, the tumor pixels are not accurately classified by the network if

tumor size decreases The segmentation error is increasing, and the

network produces false-negative results for the tumor segmentation for

small-sized tumors The multi-scale approach offers a reasonable

seg-mentation quality for complex liver anatomical structure and large size

tumors Still, tumor segmentation performance for small size tumor

needs to upgrade at an adequate level It is observed that the network

performs well on the liver and large tumors, but it produces false-

negative results if the size of the tumor decreases

We also analyzed the effect of multi-scale features on the network

complexity in terms of the number of parameters, layers, floating-point

operations per second (FLOPS), and prediction time per image, as shown

in Table 4 As the scaling factor increases, the total number of

param-eters, FLOPS, and prediction time decreases, indicating the network

complexity reduces with the scaling factor However, the total layers

increasing with the increase in scaling factor that improve feature

extraction capability and learning ability of the network The analysis

shows a tradeoff between the scaling factor and the segmentation per-formance of the network

4.6 Comparison with other methods

We verified the performance of the proposed methods with state-of- art-methods for liver and tumor segmentation We perform the experi-mentation on a 3Dircadb publicly available dataset that provides considerable variation and complexity of liver and tumors Segmenta-tion results on the 3Dircadb database for various methods are shown in Tables 5 and 6, respectively The dice similarity score is one of the sig-nificant performance measures preferred to evaluate the segmentation algorithm performance of the medical images The proposed method offers a dice similarity score of 97.13 % for liver and 84.15 % for a liver tumor, far better than the baseline UNet architecture (Ronneberger

et al., 2015) Our method is superior to the ResNet based method (Han,

2017) proposed for liver tumor with an improved dice score of 3.33 % for liver and 22.15 % for tumor segmentation We compared the results with the recently proposed method with modified UNet architecture

for liver and 13.08 % for tumors

Our proposed model complexity compared with the other UNet based methods shown in Table 7 The Res2Net block reduces the number

of parameters and increases the number of layers in the network, resulting in better learning of the network and reducing the computa-tional burden to a great extent compared to ResNet and mU-Net models The prediction time reduced due to less complexity compared to ResNet and mU-Net models Our model compared with the baseline UNet model, a small number of parameters and FLOPS increased, but seg-mentation performance improved significantly

We illustrated the statistical significance analysis of the proposed

model We calculated the p-value to demonstrate the statistical signifi-cance of the proposed model with the other models The p-value is the

statistical way to test the hypothesis of the method and is decided by using statistical tests on the data The segmentation performance of the different models compared with the proposed model using the

statisti-cally significant analysis by calculating the p-value between the two

models The proposed model verified for the statistical significance with the significance level α=0.05 or confidence level 0.95 The statistical significance of the model decided by the p-value using the non-

parametric statistical Wilcoxon signed-rank test (Demsar, 2006), which is used for hypothesis testing The test was performed on the predicted results obtained from different models The pair of dice scores

of each test sample from the other models are utilized to perform the statistical Wilcoxon signed-rank test (Zabihollahy et al., 2019) We test the model by setting the null hypothesis (H0): two models have no statistically significant difference in performance, and the alternative hypothesis (Ha): the proposed model is having a statistically significant performance than the other model We performed the test on a different group of samples using the dice score of each sample calculated using the ground truth provided in the dataset for comparing the model per-formance The significance level set for the test is α=0.05 If the p-value

is smaller than the significance level (α), then the null hypothesis is rejected in favor of the alternative hypothesis The meaning is that the proposed model has statistically significant performance than another

Table 2

Proposed network training configuration

Initial Learning rate 1 × e− 5

Minimum Learning rate 1 × e− 10

Weight regularization factor (L2 Regularization) 1 × e− 4

Scaling factor of Res2Net block 4

Convolutional 2D operations and activation function

utilized in Res2Net Conv2D + BatchNorm +Relu

SE Net block dimensionality reduction factor 8

Loss function Dice loss

Table 3

Experimental results on 3Dircadb database for liver and tumor segmentation

2 Liver Segmentation Tumor Segmentation 95.87 68.85 91.23 55.34 18.87 44.66 1.72 2.24 10.83 3.23 20.31 15.01

Trang 10

Fig 5 Sample segmentation results (Scaling

factor s = 4) Row 1- input images, Row 2 - liver and tumor Ground Truth (GT) images (Red - liver, and orange - tumor), Row 3- an overlay of

GT and input images (Red - liver, and Purple- tumor), Row 4- Segmented liver and tumor (Dark green- liver and faint green - tumor), Row 5- an overlay of segmented output and input (Dark green- liver and faint green -tumor), Row

6 - Boundary marked images with GT and segmented result, Row 7 - magnified liver region (Red- liver GT, Blue-Tumor GT, Green- Liver segmented region and yellow - segmented tumor region) (For interpretation of the references to colour in the Figure, the reader is referred to the web version of this article)

Định dạng
Số trang	14
Dung lượng	6,67 MB