Luận văn thạc sĩ research on application of deep learning approach in multiclass segmentation for medical images

LIST OF FIGURES [Figure 2.1 Threshold Togic unit: an artificial neuron applies a step function ‘afier calculating the weighted sum of its inputs 9] ¬ Sw ee a [Figure 22 Perceptron archi

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Control Engineering and Automation

Supervisor: Assoc Prof PhD Van-Truong Pham

‘Advisor Signature School: School of Electrical and Electronic Engineering

HA NOI, 9/2023

Trang 2

ACKNOWLEDGEMENTS

I would like to lake a moment (o express my gratitude to the individuals who have

been instrumental in shaping the trajectory of my academic journey and the completion of this Teport Assoc Prof Van-Truong Pham, my mentor and guide, has

been a constant source of inspiration His dedication to my growth as a researcher and his insightful guidance have been invaluable His willingness to share his cx- pertise and time has been instrumental in helping me navigate through challenges

and overcome obslacles

Talso want to extend my beartlell appreciation lo Assoc Prof Thi-Thao Tran,

whose contributions have left a lasting impact on my work Though not my primary

advisor, her constructive Locdback, suggestions, and discussions have added depth

and perspective to my research Her commitment to fostering a culture of learning

and exploration has been immensely beneficial

In the broader context, I am deoply thankful for the unwavering support of

my family Their belief in my capabilities and their encouragement during both the

highs and lows have been my driving force Their sacrifices and unwavering faith

have kept me motivated, and I am indebted to them for their constant presence in

my life

Lastly, I would like lo acknowledge the academic communily, my poors, and

fellow researchers who have provided valuable insights and diverse perspectives

that have enriched my work The exchange of idcas and collaborative discussions

bave shaped my understanding and contributed to the depth of this report

As [ reflect upon this journey, I am filled with gratitude for the people who

have contributed to this reports fruition Their collective cfforts have not only facilitated the completion of this project but have also nurtured my growth as a

learner and rescarcher

Trang 3

ABSTRACT

In recent times, the prominence of deep learning-based lechniques for medical im-

age segmentation has surged These approaches primarily revolve around innovat- ing architectural designs and refining loss functions Conventional lags functions in this context often rely on global measures, such as Cross-Entropy and Dice Loss, or

overall image intensity, yet they may fall short in addressing complexities like ov-

clusion and intensity variations In response, this study introduces an original loss function, melding both local and global image features, reformulated within the

Mumford-Shah framework This novel approach is extended to the domain of mul-

ticlass segmentation The proposed deep convolutional neural network leverages this new loss function to facilitate end-to-end training while concurrently achiev-

ing mulli-class segmentation Furthermore, motivated by the PiDiNet architecture,

I propose a new Attention-Pildi-UNet architecture This augmentation empawers

the model to fuse contextual information across dense layers, efficiently capture se-

mantic insights, and avert overfitting, resulting in precise segmentation outcomes The proposed approach is rigorously assessed across four distinct biomedical segmentation datascls cncumpassing various imaging modalitics, spanning 2D to 3D dimensions, including dermoscopy, cardiac magnetic resonance, and brain magnetic resonance Evaluation results on datasets like Lesion Boundary Segmenta- tion, the dermoscopic dataset, automated cardiac diagnosis, and 6-month infant brain MRI Segmentation corroborate the algorithm's superior performance compared to existing state-of-the-art methods This robustly underscores the potency

of our multiclass segmentation approach for diverse biomedical images

Student's signature

Trang 6

LIST OF FIGURES

[Figure 2.1 Threshold Togic unit: an artificial neuron applies a step function

‘afier calculating the weighted sum of its inputs 9] ¬ Sw ee a

[Figure 22 Perceptron architecture of two neurons input, one neuron Dias,

[Figure 23 Multilayer Perceptron architecture has two inputs, four neurons

in one hidden layer and three neurons in output layer BH om KONE 8

(Figie 27 Hien Tapers Deep Newal Nework HT « 8

[Figure 2.5 Logistic activation function saturation [39] 2 6 0 oe ll

[Figure 26 — ReLU acthation| - -. + 13

[lgure 28_ Minh aciharion 0Ï] 15

[Figure 2.9 With the stoppage regularization a random set of all the neurons ]

is “dropped out" in each training iteration in one or more Ïayers, wHl

the exception of the output layer [39] aie mW CNN ee waren x NT

igure 2 je visual signal progresses through the Drain, neurons respond

To more complex patterns in larger receptive fields [39] 1 18

[Figure 2.11 Square local receptive fields in CNN layen [39] 19

[Figure 2.12 Relations between layers and zero paddine [39Ï| 20

[Figure 273_ Reducing dimensionality the input feature map using a stride with

[Figure 2.14 Two different filters are being applied to get other two feature

PHERSIE soesce w ew em Reed UNI Om ER ï 21 [Figure 2.13 Three color channels images and convolutional layers with many

features’ maps B9] 22

[Figure216" Max pooling layer with 2 x2 pooling kernel, no padding and step

[Figure 2.17 Invariance to small translations [39][ - 24

[Figure 2.18 Example of semantic segmentation [39] 25

Trang 7

[Ffsure4.T — The representatie segmeniation resulrs oƑƒmy method on diJerent 1

[ “skin lesions size from my testing set in the ISIC-2018 dataset] 40

[Figure 4.2 — Representative resultsin PH2 datasell - 42

ugure ie representative result of the right ventricle (yellow), myocardium |

(green), and Teft ventricle (blue) of three examples using my metliod on ]

the ACDC 2017 challenge] ‹ 44

[Figure 44 The representative result on various slices of testing sample IDs ]

[ TT, T6, and T7, respectnely The TT weighted, the T2 weighted, and my ]

segmentation result are indicated from left to right, respectively] 46

[Figure £35 The Tearning curves by the proposed method when training im- ]

ages from four databases in terms of average DSC of classes (a) The |

TSIC-2018 dataset (b) The PH2 dataset (c) The ACDC dataset (d) The ]

iSeg-2017 challenge] KM Sas Ee 8 wea eee AD

Trang 8

LIST OF TABLES

[Table 4.1 Comparison with other popular approaches on the ISIC 2018

lataset Results have been taken from except for the last four methods] 41

[Table 4.2 Comparison with other popular approaches on the PH2 dataset

Results have been taken from [81] except for the last four methods] 43 [Jabie 4-3 Comparison with other popular approaches on the ACDC dataset

DSCs on RV, Myo, LV and the average DSC have been calculated Re- _|

except for the last four methods] 45

[Table 14 —The DSC, MHD, ASD, and average metrics of segmented classes

in validation dataset of F out of top 8 teams in [3] of the iSeg-2017 ]

challenge and my proposed approach (MHD: mm, ASD: mm] 47

[Table 4.5 Comparison with other loss function in DSC on three the datasets] 48

Trang 9

CNN Convolutional Neural Networks

FCN Fully Convulutional Networks

CDCM Compact Dilation Convolntion-based Module

CSAM Compact Spatial Attention Module

MSE Mean Squared Error

DSC Dice Similarity Coefficient

IoC Intersection-over-Union

ASD Average Surface Distance

MIID Modified [iausdorff Distance

iScg 6-month Infant Brain MRI Scementation

ACDC Automated Cardiac Diagnosis Challenge

RV Right Veniricle

LV Left Ventricle

up The mean of vector input, assessed over the entire mini-batch B

On Incriial moment around the yaw axis

if The standard deviation of the vector of input

mp The plenty of caves in the mini-batch

alt ‘The normalized inputs for case i

® Element-wise multiplication

B The output shift (offset) parameter vector for the layer

& Small number which prevents zcro division (commonly 10-7)

f) ‘The Batch Normalization output operation

Bị Output of #** encoder block

dD Output of i" decoder block

th Height of # output feature map

W Width of # output feature map

a Additional weight in class i

Q Spatial domain of the image

N Number of segmentation classes

9 Trainablc parameters pf the CNN

Pu(@) Softmax output for the v“" pixel value of the class #2

T Onc-hot vector of the graund truth

Trang 10

CHAPTER 1 INTRODUCTION

1.1 Motivation for Participating in Medical Image Segmentation Challenges

Medical image segmentation challenges provide a unique opportunity for re-

searchers and practitioners to address critical problems in healthcare through the

development of advanced computational techniques By participating in these challenges, participants aim to contribute to the improvement of diagnosis, treatment planning, and patient care In this section, I discuss the motivations behind participating in four specific medical image segmentation challenges: the Lesion

Boundary Segmentation challenge on the ISIC-2018 dataset, the dermoscopic PH2 database, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis bench-

mark, and the 6-month infant brain MRI Segmentation (iSeg) benchmark

Skin cancer is a prevalent and potentially deadly condition that demands early

and accurate detection The ISIC-2018 challenge [1}{2] and the PH2 challenge [3] focus on segmenting skin lesions, aiming to improve the accuracy and efficiency

of diagnosis The motivation to participate in this challenge arises from the urgent need to develop automated segmentation methods that can assist dermatologists in identifying and diagnosing skin cancer Successful segmentation of lesion bound-

aries can enable more accurate diagnosis and early intervention, ultimately enhancing patient outcomes Participating in this challenge provides an opportunity to

contribute to dermatological research, develop advanced segmentation techniques,

and potentially revolutionize skin cancer diagnosis

Cardiovascular diseases are a leading cause of mortality globally The MIC- CAI sub-challenge on automatic cardiac diagnosis [4] addresses the need for accurate cardiac segmentation to aid in diagnosing heart conditions The motivation

to participate in this challenge stems from the potential to advance cardiac imaging and diagnosis through automated segmentation methods Precise segmentation

of cardiac structures can assist cardiologists in assessing heart function, identifying anomalies, and guiding treatment decisions By participating in this challenge,

researchers can collaborate with experts in cardiology, contribute to cutting-edge

medical research, and develop solutions that have a direct impact on patient care

Segmentation of infant brain MRI scans is crucial for studying early brain development and identifying abnormalities The iSeg benchmark challenge [5]] fo-

cuses On accurate segmentation of infant brain structures, aiding in early detection

of neurological disorders The motivation to participate in this challenge lies in the

potential to contribute to pediatric neuroimaging and improve the understanding

of infant brain development Accurate segmentation of brain structures can assist

clinicians and researchers in diagnosing conditions and monitoring developmental

Trang 11

milestones Participation in the iSeg benchmark offers the chance to advance pediatric imaging, collaborate with experts in the field, and create tools that facilitate

early intervention and improved patient outcomes

In conclusion, participating in medical image segmentation challenges pro-

vides a unique avenue to address critical healthcare challenges The motivations

behind participating in these challenges range from improving diagnosis accuracy

and treatment planning to collaborating with experts and contributing to cutting- edge medical research These challenges offer a platform for researchers to develop and showcase innovative solutions that have the potential to revolutionize health-

care practices and enhance patient care

1.2 Advancements in Medical Image Segmentation and Innovative Approaches

Image segmentation is a pivotal and challenging topic in the field of computer

vision (6) Its objective is to partition an image in a way that accurately locates,

identifies, and quantifies objects This process holds crucial importance in medical

imaging, supporting additional clinical analysis, diagnosis, therapy planning, and disease progression measurement Within the domain of medical image segmen-

tation, several primary obstacles exist These include a scarcity of well-labeled

benchmarks for training, a deficiency of annotated images [7], a lack of consis-

tent segmentation techniques, poor image resolution, and significant variability in

image quality across patients (8) Precise calculation of segmentation accuracy and uncertainty is vital for gauging performance in other applications (9 Con- sequently, this underscores the imperative for advanced methodologies, such as Artificial Intelligence (AI)-based approaches, to enable automated, generalizable, and efficient medical image segmentation

In the context of developing AI systems, the attributes of generalization and robustness bear critical significance, particularly in clinical trials {10} Conse-

quently, the development of a resilient architecture suited for diverse biomedical

applications becomes paramount Recently, convolutional neural networks (CNNs) have emerged as advanced tools for automating the segmentation of medical im-

ages [IT}I3] This includes various modalities such as X-rays, CT scans, and

MRIs, with promising outcomes compared to conventional segmentation meth-

ods [1415] Among different CNN versions, encoder-decoder networks like Fully

Convolutional Networks (FCN) [16] and their advancement such as U-Net [T7]

have gained substantial traction as semantic segmentation techniques for 2D im-

ages A deep fully convolutional neural network designed for semantic pixel-wise

segmentation that requires fewer trainable parameters yet yields high-quality seg-

mentation maps was introduced by [8] Addressing dense prediction challenges,

anovel convolutional network module was proposed by [I9] This module utilized

2

Trang 12

dilated convolutions to systematically aggregate multi-scale contextual features, resulting in a significant performance enhancement for advanced automated seg-

mentation systems Moreover, [20] introduced DeepLab as a segmentation method

DeeplabV3 [21], without DenseCRF fine-tuning, demonstrated considerable im-

provements over earlier DeepLab iterations, utilizing a synthetic approach with

fewer convolutional layers than FCN and U-Net architectures, along with skip connections between the encoder and decoder paths An efficient scene parsing network for comprehending complex receptive fields was proposed by 2 This approach utilized global pyramidal characteristics to facilitate the acquisition of additional contextual information

Throughout the training process, CNN model parameters are typically re- fined using gradient descent techniques, as outlined by 3 wherein errors are quantified by a loss function that contrasts predicted labels against ground truth labels For classification endeavors, prevalent loss functions encompass cross-

entropy (CE) and the L2 norm, often referred to as the mean squared error (MSE),

as frequently cited in the works [24)25] Conversely, problems centered on seg-

mentation have commonly engaged the Dice Coefficient (DC) and cross-entropy

(CE) [17}26] Despite the recent strides made in CNN deployment for biomedical

image segmentation, prevalent loss functions frequently revolve around pixel-wise

similarity evaluation Notably, the DC and CE are tailored towards specific region

feature extraction While this framework often yields impressive cla:

segmentation outcomes, low loss function values do not always signify meaning-

ful segmentation Instances arise where noisy images produce several indistinct

contours, signaling erroneous predictions, and the indistinctness of object bound-

aries stems from the difficulty in classifying pixels near the contour An additional challenge arises from susceptibility to local minima due to aberrations within the training database, high dimensionality, and the non-convex attributes of loss func-

tions, as illuminated by [27]

ication and

Among frequent deep-CNN approaches, fully convolutional network (FCN)

[28] and U-Net [17] have been designed that deconvolutional operations replace

fully connected layers to strengthen temporal coherence; also, skip connections are used for inheriting spatial information in deeper layers Depthwise convolution

is defined as the depthwise convolution followed by the pointwise convolution,

which helps prevent the model from getting overfitting by reducing the number of

Trang 13

connections in the model

Dilated convolution [30] expands window size without increasing the number

of weights by adding zero-values into convolution kernels while maintaining computation cost Adaptive Dilated Convolution BI generates and fuses multi-scale features of similar spatial sizes by setting various dilation rates for different chan-

nels Applying dilated convolution, Compact Dilation Convolution-based Module

(CDCM) is adopted in my proposed model for more useful features

Region-based Tversky loss [33] and Focal Tversky loss [34] control the in-

formation flow implicitly through pixel-level affinity and tackle class-imbalanced

problems; however, their contour optimization processes are not good enough There has been an ongoing concern about exploiting the active contour models as

loss functions in deep-learning solutions for better contour optimization Region-

based active contour Chan-Vese model 85] has been successful for training images with two regions, each having a different mean of pixel intensity Inheriting the advantage of Mumford-Shah functional and the AC loss with some adjustments

obtains the LMS loss [36] Acquiring the requirements for boundary optimiza-

ig the class-imbalanced problem, I propose a new Focal Active Contour loss function

tion and addr

This study yields several noteworthy contributions:

+ Innovative Loss Function: | introduce a novel loss function tailored for the training process of deep-learning models By incorporating elements of active contour methodology into the loss functions, I aim to tackle a persistent challenge encountered in medical imaging and computer vision - the problem of intensity inhomogeneity within image data This amalgamation of techniques

offers a promising avenue to address this issue effectively It not only helps

deep-learning models achieve more accurate and robust segmentation results

but also paves the way for more precise and reliable image analysis across var-

ious applications, ultimately advancing the capabilities of Al-driven solutions

in the field

End-to-End CNN Model Development: Inspired by PiDiNet, I propose a

new architecture by modifying this network from FCN-shape into U-Net-

shape, using CDCM modules (without CSAM followed); combining with an Attention module, Depthwise-and-Pointwise module,

Thorough Evaluation and Comparison: A comprehensive evaluation of both my proposed model and the introduced loss function is conducted across

2D and 3D datasets These evaluations are benchmarked against existing state-

of-the-art methods Notably, my approach consistently demonstrates promis-

4

Trang 14

ing outcomes when compared to baseline algorithms This observation is

substantiated across diverse datascts including the Lesion Boundary Scymen-

tation ISIC-2018 dataset, the dermoscopic PH2 dataset, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis benchmark, and the 6-month infant brain MRI Segmentation (Seg) benchmark

Trang 15

CHAPTER 2 THEORETICAL BASIS

2.1 Artificial Neural Networks

Deep learning is a machine learning technique that is very significant It

teaches a computer (PC) to filter inputs through layers in order to predict and cat- egorize data Observations may take the form of images, text, or sound The way the human brain filters knowledge is the driving force behind deep learning Its aim is to imitate how the human brain seeks to conjure up some real magic There are about 100 billion neurons in the human brain A single neuron interacts with approximately 100,000 of its peers That is what I am attempting to build, although

in a computer manner As a result, the neuron (or Node) receives a signal or signals (input values) that pass through it The output signal is transmitted by that

neuron This knowledge is broken down into numbers and bits of binary data that

a computer can understand,

What about synapses? Every one of the neurotransmitters gets assigned weights, which are important to Artificial Neural Networks (ANNs) Weights are the way ANNs learn By changing the weights, the ANN chooses to what degree

signals get passed along and the weights are changed while training your network

For some decades ago, McCulloch suggested a immensely basic architecture

of a biological neuron [37], which has one or more binary (on/off) inputs and one

binary output, was later called an artificial neuron When more than a certain number of its inputs are involved, the artificial neuron stimulates its output They demonstrated in their paper that even with such a simplistic model, a network of artificial neurons can be built to compute any logical proposition

The Perceptron, which is one of the most basic ANN architectures, was

Frank Rosenblatt [38] created The threshold logic unit (TLU) is derived from a

marginally different artificial neuron (Figure [2-T) or sometimes a linear threshold unit (LTU) The inputs and outputs now are both numbers (rather than binary on/off values), and each input relation has a weight assigned to it The TLU calculates a

weighted sum of its inputs (¢ = wjxị + w2x2 + + Waxy Ty), then such sum is added by a step function and returned the result: hy(x) fh

A Perceptron comprises a layer of Threshold Logic Units (TLUs), each intricately

connected to all the inputs This layer is recognized as a fully connected layer

or a dense layer when each neuron within the layer establishes connections with every neuron in the preceding layer The Perceptron’s inputs are channeled to input

neurons, which serve as pass-through units, directly outputting the received input

The assembly of these input neurons constitutes the input layer It’s worth noting that an additional bias term is commonly integrated (xo = 1), typically introduced

6

Trang 16

Figure 2.1 Threshold logic unit: an artifi

calculating the weighted sum of its inputs BY

ial neuron applies a step function after

through a specialized neuron known as a bias neuron, perpetually yielding an out-

put of 1 A visual representation of this setup can be seen in Figure[2.2} illustrating

a Perceptron equipped with two inputs and three outputs In this case, the Percep- tron functions as a multi-output classifier, concurrently categorizing instances into three distinct binary classes Perceptrons are trained using a variety of rules that

Outputs

>>, Output `

(always outputs 1) ie layer

consider the network’s error when making predictions The Perceptron learning

rule refines correlations, progressively minimizing error In greater detail, the Perceptron is sequentially exposed to individual training instances, yielding predictions for each instance If an output neuron generates an incorrect prediction,

the correlation weights pertaining to inputs that would have led to the accurate

prediction are adjusted This rule is represented by Equation 2.1}

next step Weg,

In this equation:

+ w;,; is the weight linking the i” input neuron and the j'” output neuron

+

Trang 17

+ x; is the i" input value of the current training sample

ith

+ y; is the target output of the j’" output neuron for the current training sample

* §, is the output of the j'" output neuron for the current training instance

* 7 denotes the learning rate during training (typically adjusted as needed)

Given that the decision boundaries of individual output neurons remain lin-

ear, Perceptrons inherently struggle to capture intricate patterns However, stacking multiple Perceptrons collectively mitigates these limitations This composite structure is known as a Multilayer Perceptron (MLP), as illustrated in Figure [2.3] The architecture encompasses an input layer (comprising pass-through neurons), one

or more hidden layers of TLUs, and ultimately a output layer of TLUs Each layer incorporates a bias neuron except for the output layer, and these layers are fully connected to one another, creating a comprehensive neural network A deep

Figure 2.3 Multilayer Perceptron architecture has two inputs, four neurons in one

hidden layer and three neurons in output layer By

neural network (DNN) is described as an ANN with a large number of hidden layers

2.2 Deep neural network

Deep Learning revolves around the exploration of deep neural networks (DNNs),

which frequently consist of intricate sequences of computations Representing the

output of hidden layers as A‘(Z), the computation for a neural network with L

hidden layers is depicted as:

so) = fo" (0 (20 ( (0(@C)))))]_ e»

8

Trang 18

Each pre-activation function zÍ)(a) entails a linear operation governed by the

weight matrix WÉ) and bias ĐÍ);

(0= 1)" layer layer

OQ 2G:

Figure 2.4 Hidden layers in Deep Neural Network fay

effectively However, in 1986, David Rumelhart introduced a groundbreaking ap-

proach that revolutionized the field This approach implemented the backpropagation training algorithm, which remains a cornerstone of neural network training In essence, it leverages Gradient Descent [43] along with an efficient means

of automatically calculating gradients The backpropagation algorithm computes

the gradient of the network's error with respect to each model parameter in just

two passes through the network — one forward and one backward This algorithm efficiently determines how relation weights and bias terms should be adjusted to minimize error It repetitively undertakes a regular Gradient Descent step using

9

Trang 19

these computed gradients, iteratively moving towards a solution

Key aspects of the backpropagation algorithm include:

Mini-Ratch Processing and Epachs: ‘The algorithm operates on one mini-

batch at a time (typically comprising a power of two instances for computational efficiency), cycling through the entire training dataset multiple times

— each complete cycle is termed an epoch This iterative process aids in the

gradual reduction of losses

Forward Pass: The input layer sends the first hidden layer each mini-batch Subsequently, the algorithm computes the contributions of all neurons within

this layer for each sample in the mini-batch This result is then propagated

forward to the subscqucnt layer, ropeating this process layer by layer until the output layer is reached This forward pass is akin to making projections, with

the distinction that intermediary outcomes are retained for utilization during the backward pass

Error Calculation: Subscquomt to the forward pass, the algorithm calculates

the network’s performance error

Output Contribution Evaluation: The algorithm assesses the contribution

of each output relation to the error Leveraging the chain rule, this process is

executed analytically, ensuring officiency and precision

Backward Error Propagation: By employing the chain rule, the algorithm

quantifies the extent to which each error input stems from each link within

the layer directly belaw This backward process extends until the input layer

is reached As previously highlighted, this backward propagation effectively assesses the crror gradient (hroughout the catire neural network, traversing the

network’s relation weights

Gradient Descent Phase: The final step involves adjusting all the nctwork’s

relation weights using the computed error gradients during a Gradient Descent phase

The backpropagation algorithm's significance warrants reiteration: it initiates with a prediction (forward pass), calculates the error for each training, step, retraces

through each layer to compute error contributions from connections (reverse pass),

and subscquently adjusts connection weights to minimize error (Gradient Descent

step) To facilitate the proper functioning of this algorithm, a pivotal enhancement

was made to the MLP’s architecture: the replacement of the step function with the

10

Trang 20

+e

is characterized by a continuous nonzero derivative across its domain, enabling Gradient Descent to make progress at each step In contrast, the step function features flat segments, leading to the absence of gradients for computation

logistic (sigmoid) function [44], denoted as 6 (2) = 7 The logistic function

However, a challenge arises: as the algorithm progresses down to lower layers, gradients diminish due to the cumulative effect of multiplications by values less than I Consequently, the Gradient Descent updates predominantly influence lower

layer relation weights, preventing convergence to a single solution—a predicament

known as the vanishing gradients problem Conversely, gradients can surge in

magnitude, causing layers to receive excessively large weight updates, ultimately

leading to divergence—an issue termed the exploding gradients problem A technique involving the logistic activation function and initialization procedure was

presented in [45] This study demonstrated that each layer’s output variance ex-

ceeds its input variance significantly As the network advances, variance escalates with each layer, culminating in activation saturation in the upper layers Notably,

saturation is exacerbated by the logistic function’s mean of 0.5, which diverges

With respect to the logistic activation function (depicted in Figure E3 it’s evident that the function saturates at 0 or | as inputs become increasingly large (negative or positive), leading to derivatives that approach zero Consequently,

there exists minimal gradient available for back propagation, and any existing gra-

dient becomes diluted as it traverses the network's upper layers during back prop-

agation Therefore, Glorot and Bengio [45] suggested a way to reduce the unstable

Sigmoid activation function

Figure 2.5 Logistic activation function saturation 9}

gradient issue dramatically, it is Glorot and He Initialization

"1

Trang 21

2.2.1 Glorotand He Initialization

The proper propagation of signals in both forward and backward passes is

crucial in neural networks During prediction (forward pass) and gradient compu-

tation (backward pass), signals must traverse accurately in both directions Authors

emphasize that for correct signal flow, the output variance of a layer should match

the input variance, ensuring proper signal propagation Furthermore, gradients

need to be adjusted both before and after they travel through the back direction

of the layer Achieving these conditions isn’t guaranteed even when the input and

neuron layer have an equal number of connections (referred to as the fanj, and

farow of the layer)

However, Glorot and Bengio introduced a practical approach that has proven effective: initializing the connection weights of each layer with random values de-

fined by equations 24) and (23), which involve normal distribution and uniform

distribution with the parameters outlined Notably, fanayg = (fanin + fanow) /2-

This initialization strategy is referred to as Xavier initialization or Glorot initialization in (5) The significance of this technique has been recognized for over a decade Applying Glorot initialization significantly accelerates training and is one

of the influential strategies that have contributed to the success of Deep Learning

(2.4)

fan

Similar techniques for different activation functions have been presented in certain

papers [46] These approaches share a common framework with variations in the

variance scale: 6? = - In the case of the uniform distribution, the value of r

Hi

is computed as r = V30° Particularly, the initialization technique tailored for the

Rectified Linear Unit (ReLU) activation function, which will be discussed in the

subsequent subsection, is sometimes referred to as He initialization

2.2.2 Non-Saturating Activation Functions

The backpropagation algorithm not only performs effectively with the logistic

equation but also proves successful with various other activation functions Several

common options are presented below

(a) ReLU activation

To address the vanishing gradient problem [47] associated with sigmoid acti-

vation, the Linear Unit or Rectified Linear Unit (ReLU) was introduced The

12

Trang 22

ReLU activation function is illustrated in Figure [2.6] Unlike the sigmoid function, ReLU doesn’t suffer from vanishing gradients Specifically, its derivative is

0 for x <0 and | otherwise This characteristic eliminates the issue of vanishing

gradients, Additionally, ReLU promotes model sparsity, as gradients that turn to

0 essentially indicate that a neuron becomes inactive Moreover, ReLU computations are computationally faster compared to functions like sigmoid and tanh The

computation of ReLU, which often involves taking the maximum between (0.x),

requires less computational resources Consequently, ReLU has become the standard activation function in today’s deep learning landscape

Nonetheless, the exploration for improved activation functions continued In

October 2017, Google Brain introduced the Swish activation function [49], aiming

to enhance existing options The Swish activation function is characterized by

the simple equation (x) = —*—, as depicted in Figure} Swish stands out

1+e>

Swish

Figure 2.7 Swish activation

as a smooth function, unlike ReLU, which experiences a sudden directional shift near x = 0 Swish transitions seamlessly from 0 to non-zero values and then back

13

Trang 23

upwards Importantly, Swish exhibits a non-monotonic behavior—this sets it apart

from functions like ReLU, which are either stable or shift in a specific direction

This characteristic is highlighted in the authors’ paper, where they underscore that

Swish’s non-monotonicity distinguishes it from most other activation functions

The Swish activation function offers several advantages over ReLU due to its unique characteristics:

+ Bounded and Sparse Activation: Similar to ReLU, Swish benefits from spar-

sity, Extremely negative weights are zeroed out, which contributes to a sparse

representation

Unbounded Above: Swish is not limited to saturating outputs to a maximum value for very large inputs (e.g., 1 for all neurons) This distinguishes Swish from other activation functions, including ReLU

Smooth Curve and Smooth Landscape: The smoothness of the Swish curve

extends to its derivative, leading to a smoother landscape for optimization

This smoothness aids in efficiently navigating the model towards minimal loss

during optimization

Utilization of Negative Values: Unlike ReLU, where negative values are set

to zero, Swish retains negative values, particularly values close to zero This property is beneficial for capturing subtle patterns in the data, making Swish more flexible in handling different types of information

In essence, Swish’s bounded, smooth, and flexible behavior makes it a com-

pelling alternative to ReLU, offering improvements in terms of capturing complex patterns and optimizing model performance

(c) Mish activation

Mish activation draws inspiration from Swish activation The equation

for the Mish activation function is defined as f(x) = x tanh(In(1+e*)) The graphical representation of the Mish activation function is depicted in Figure22.3}

While Mish shares many of the same advantages as Swish, the authors of (50) introduce the idea that the error space could potentially be smoother with Mish However, it’s important to note that the primary drawback of the Mish activation

function is its significantly higher computational cost

2.2.3 Batch Normalization

Although initializing with ReLU (or its variants) can significantly reduce the

likelihood of vanishing/exploding problems at the start of training, it doesn’t guar-

14

Trang 24

Figure 2.8 Mish activation 50}

antee that these issues won't arise during the course of training In the paper [51],

a technique called Batch Normalization (BN) is introduced to address these problems This method involves inserting a new operation within the model immedi- ately after the hidden layer

The process consists of normalizing and zero-centering each input Subse-

quently, the results are scaled and shifted using two learnable parameter vectors

per layer: one for scaling and another for shifting This approach allows the model

to learn the most appropriate scale and mean for each input layer To achieve this,

the mean and standard deviation of each input must be computed to ensure central-

ization, and the inputs need to be normalized This is accomplished by estimating

the mean and standard deviation of the inputs over the current mini-batch The entire procedure is succinctly summarized below:

Trang 25

In this algorithm:

* Ug is the mean of vector input, assessed over the entire mini-batch B

Og is the standard deviation of the vector of input, also evaluated over the entire mini-batch

mg is the plenty of cases in the mini-batch

x() is the normalized inputs for case i

is the output scale parameter vector for the layer

® expresses element-wise multiplication

B is the output shift (offset) parameter vector for the layer The corresponding shift parameter offsets each input

€ is a small number which prevents zero division (commonly 1077) This is

named a smoothing term

z( is the BN output operation The version of the inputs is rescaled and

modified

In the training phase, Batch Normalization (BN) standardizes its inputs by normal-

izing and centering them, followed by rescaling and shifting During the testing

phase, BN employs two additional parameters, namely 1 (the mean vector of inputs from the last batch in the training set) and o (the standard deviation vector

of inputs from the last batch in the training set) These parameters are estimated

using an exponential moving average [52] for making predictions on new instances during testing It’s important to note that while jt and o are computed during train-

ing, they are used only after training (to replace the batch input means and standard

deviations in the BN algorithm during inference),

The issue of vanishing gradients has been mitigated to a point where saturated

functions like the logistic function and even the tanh function can be effectively uti-

lized The sensitivity of weight initialization in the networks has also been notably reduced Researchers have been able to employ significantly higher learning rates, leading to a substantial acceleration of the learning process Furthermore, Batch

Normalization acts as a regularizer, reducing the necessity for other regularization

techniques to prevent overfitting

16

Trang 26

2.2.4 Dropout

In the realm of deep neural networks, with their thousands or even millions

of parameters, there exists an entire spectrum of possibilities This wide parameter space grants them incredible versatility to adapt to a diverse array of complex datasets However, this immense flexibility also heightens the risk of overfitting the training data, necessitating the incorporation of regularization techniques One

Figure 2.9 With the stoppage regularization a random set of all the neurons is

"dropped out" in each training iteration in one or more layers, with the excep-

tion of the output layer BI

such technique that has gained significant traction in deep neural network regular-

ization is Dropout [53] Dropout has proven to be remarkably effective, often lead-

ing to a 1-2 percent improvement in accuracy for modern neural networks While this might not sound like a dramatic enhancement, consider that a 2% increase

corresponds to a reduction in error rate of nearly 40% for a model that already

boasts 95% accuracy (reducing from 5% to around 3% error rate) The concept behind dropout is relatively straightforward: during each training iteration, every

neuron (excluding output neurons) has a certain probability, denoted as p, of being

temporarily removed This means that the neuron’s contribution is entirely disre- garded during that iteration, but it will contribute in subsequent iterations (Figure (29) The parameter p is termed the dropout rate, and it generally lies within the range of 10% to 50% It is important to note that neurons no longer experience dropout during the testing or inference phase

2.3, Convolution Neural Network

Since the 1980s, researchers have harnessed the power of convolutional neural networks (CNNs) in the realm of image recognition This development was spurred by investigations into the intricate workings of the visual cortex of the

brain [54156] Over the years, CNNs have undergone significant advancements

and have reached a point where they can achieve performance beyond human capa-

17

Trang 27

bilities in various complex visual tasks These advancements have been driven by

the growth in computing capabilities and the abundance of available training data

As a result, CNNs play a pivotal role in applications like image analysis services,

self-driving vehicles, automated video classification systems, and more Impor- tantly, CNNs are not confined solely to visual perception; they have also demonstrated remarkable prowess in various other domains, including speech recognition and the processing of natural language

2.3.1 The Architecture of the Visual Cortex

In their work presented in [56], the authors demonstrated that numerous neu-

rons in the visual cortex exhibit a distinct property known as a local field of re- ception This property implies that these neurons respond exclusively to visual stimuli within a limited visual area, as illustrated in Figure [2.10] where dashed cir-

cles denote the local receptive fields of five neurons It’s important to note that the

receptive fields of different neurons can overlap, and when considered collectively,

they comprehensively cover the entire visual field Moreover, the researchers made

Figure 2.10 The visual signal progresses through the brain, neurons respond to

more complex patterns in larger receptive fields By

a significant observation that certain neurons exclusively responded to images fea- turing horizontal outlines, while others exhibited responses to lines of various ori-

entations Additionally, they identified neurons with larger receptive fields that

reacted to more intricate patterns formed by combining lower-level patterns These findings led to the formulation of a hypothesis suggesting that higher-level neurons utilize the outputs of neighboring lower-level neurons (as illustrated in Figure

[2.10] where each neuron is connected to only a subset of neurons from the previous

layer) This intricate neural architecture enables the detection of a wide array of complex patterns across different regions of the visual field

The culmination of these insights was the introduction of the neocognitron

in 1980 [57], which ultimately paved the way for the development of convolu-

tional neural networks A notable milestone in this progression was the creation

of LeNet-5 architecture introduced in LeNet-5, widely employed for classify-

Trang 28

ing handwritten digits by financial institutions, integrated several well-established

building blocks, such as swish functions and fully connected layers However, it also introduced two novel components: convolutional layers and pooling layers

2.3.2, Convolutional Layers

A fundamental characteristic of a CNN is that neurons in the convolutional

layers are connected to pixels within their respective receptive fields, rather than being connected to individual pixels in the input image, as explained earlier (as depicted in Figure [2.TT) Additionally, each neuron in the subsequent convolu-

tional layers is connected only to neurons in a small local region of the previous

layer This architectural arrangement enables the network to progressively focus

on lower-level features in the initial hidden layers and then combine these features

in subsequent layers This hierarchical structure mirrors the organization of visual

information in real-world images, which contributes to the CNN's remarkable performance in image recognition tasks A neuron situated at row i and column j

(2:12) To ensure that a layer maintains the same height and width as the preceding

layer, it is common to include zero values around the input data This technique is

referred to as zero padding The use of receptive fields, as depicted in Figure [2.13]

facilitates the connection of a larger input layer to a much smaller subsequent layer This leads to a significant reduction in the computational complexity of the model

The transition from one receptive field to another is referred to as the stride In the presented illustration, a 5 x 7 input layer is linked to a 3 x 4 layer using 3 x 3 recep-

tive fields and a stride of 2, with zero padding applied It’s important to note that

the stride doesn’t necessarily have to be the same in both directions, as illustrated

in this example

For instance, a neuron located at row i and column j within the higher layer

19

Trang 29

t3 Zero padding

Figure 2.12 Relations between layers and zero padding 9)

is connected to the outputs of neurons in the previous layer situated in rows i x sự

to ix sp +f, — 1 and columns j x sy to jx Sw + fiw — 1, where sj and sy repre- sent the vertical and horizontal strides, respectively This mechanism of stride and

receptive fields allows CNNs to efficiently capture features across different scales

and positions in the input data The weights of a neuron can be thought of as a

Figure 2.13 Reducing dimensionality the input feature map using a stride with step

of 2 39]

small image representing the receptive field Figure E14 illustrates two possible

sets of weights, known as filters The first filter is depicted as a black rectangle at

the center with a vertical white line running through it (this filter corresponds to a 7x7 receptive field, where most values are 0 except for the central vertical column,

which is filled with 1s) Neurons with these weights essentially focus solely on the

central vertical line in their receptive field, disregarding other input values The

second filter is presented as a black area with a white horizontal line in the middle Similarly, neurons with these weights emphasize the central horizontal line in their

receptive field, filtering out the remaining information

Consider a scenario where all neurons in a layer utilize the same vertical line

20

Trang 30

filter (along with the corresponding bias term), and the network is provided with

the bottom image in Figure [2-14] (the input image) In this case, the layer’s output

will resemble the top-left image The vertical white lines are accentuated, while

the rest of the image becomes blurred Similarly, if all neurons employ the same horizontal line filter, the result would be the upper-right image; here, the horizontal white lines are emphasized, and the rest becomes less distinct Consequently, when

a layer of neurons shares the same filter, it generates a feature map that highlights

the regions where the filter is most responsive within the input image

Non

W W

Figure 2.14 Two different filters are being applied to get other two feature maps 9)

It’s important to note that filters are not manually designed; rather, the con-

volutional layer learns the most relevant filters automatically during training As

the learning progresses through subsequent layers, these filters are combined into

more complex and sophisticated patterns, allowing the network to identify intricate

features and patterns in the data

Up to this point, I have simplified the depiction of each convolutional layer’s

performance as a 2D feature map However, in reality, each convolutional layer

consists of multiple filters, resulting in a more accurate 3D representation (as seen

in Figure 2.14), Each filter in the convolutional layer employs different parame-

ters and creates a distinct feature map The receptive field of a neuron remains

consistent with the description provided earlier, but it spans across all the feature

maps from the preceding layers In essence, a convolutional layer employs multiple trainable filters on its inputs simultaneously, enabling it to identify various features

across its inputs

Moreover, input images often have multiple sublayers, each corresponding to

a color channel For instance, grayscale images have just one channel, whereas

certain images possess additional channels—such as satellite photos that capture

21

Trang 31

diverse light frequencies, including infrared

To elaborate, consider a specific convolutional layer denoted as /, where the

neuron outputs in the i-th row and j-th column of a feature map k are connected to the outputs of neurons from the preceding layer /— 1 These connections involve neurons positioned in rows ranging from i x s to i x s„ + fj, — 1, and columns

ranging from j x sy to j x Sy + fy — 1 It’s important to note that the neuron out-

puts from the same neurons in the previous layer, despite being related to various

neurons on the i-th row and j-th column, pertain to different feature maps

Green Blue

Figure 2.15 Three color channels images and convolutional layers with many fea~

tures’ maps [39]

Equation 2-6}encapsulates the previously described concepts into a compre-

hensive mathematical expression, detailing the computation of a neuron’s output within a convolutional layer While the weighted sum of inputs along with a bias term might seem intricate due to the diverse indices involved, all the calculations

harmonize to provide the desired outcome

Trang 32

horizontal and vertical strides and the plenty of feature maps in the preceding layer (layer / — 1) is defined as fy

ix s,+u, column j x sy+v,k, feature map k’

+ by is the bias part for feature map k (in layer /) It is like a button that pinches

the overall intensity of the feature map k

* Wu e is the weight between every neuron in feature map k of the layer / and

its input positioned at row u, column v and feature map k’

2.3.3 Pooling Layers

The purpose of the pooling layer is to downsample the input image, effec-

tively reducing computational load, memory usage, and the number of parameters,

which helps mitigate the risk of overfitting Similar to convolutional layers, the

pooling layer associates the outputs from the previous layer within a small rect- angular receptive field with each neuron in the pooling layer As before, the size

of the receptive field, the stride step size, and whether padding with zeros is used

need to be specified Unlike convolutional layers, pooling neurons lack weights;

instead, they aggregate the inputs using functions like max or mean

Currently, the most common type of pooling layer is the max pooling layer, illustrated in Figure [2.16] In this case, a pooling kernel of size 2 x 2 with a stride

of 2 and no padding is employed In max pooling, only the maximum input value within each receptive field is passed on to the next layer, while the other inputs

are discarded In the example from Figure [2-16} the input values in the lower left

receptive field are 1, 3, 5, and 2; hence, only the maximum value, 5, is propagated

to the subsequent layer Due to the stride of 2, the output image's width and height are halved compared to the input image A max pooling layer not only reduces

‘Bad

Figure 2.16 Max pooling layer with 2 x 2 pooling kernel, no padding and step size stride equal 2) [39]

computations, memory usage, and the number of parameters, but it also introduces

a degree of invariance to minor translations, as depicted in Figure 2.17} This can

23

Trang 33

Figure 2.17 Invariance to small translations 39]

be observed by looking at the three images (A, B, C) above, which undergo max

pooling with two 2 x 2 kernels, a stride of 2, and no zero-padding Images B and C

are identical to image A but shifted to the right by one and two pixels, respectively

The outputs of the max pooling layer for images A and B remain the same, illustrating translation invariance The output for image C, which is shifted by just one

pixel to the right, still maintains 75% invariance It’s possible to achieve a certain

level of translation invariance on a larger scale by incorporating max pooling layers

at intervals within a CNN Additionally, max pooling provides a limited amount of

rotational and scale invariance In scenarios where predictions don’t rely heavily

on these variations, such as in classification tasks, such invariance (even if limited) can be advantageous

However, max pooling does come with certain drawbacks It causes a re-

duction in resolution, as the output is halved in both dimensions (even with a small

kernel and a stride of 2), leading to a 25% reduction in area Invariance isn’t always

desirable in all applications Take semantic segmentation, for instance: if the input

image is shifted by a pixel to the right, the output image should also shift by one

pixel to maintain consistency Similarly, in cases like pixel-wise image classifica-

tion, where the goal is to assign each pixel to a specific class, equivalence rather than invariance is crucial: a slight change in inputs should lead to a corresponding minor change in outputs

2.3.4 Transposed Convolutional Layers in semantic segmentation

In semantic segmentation, each pixel is assigned a category based on the type

of object it belongs to, as illustrated in Figure [2-18] Notably, objects of the same class are not distinguished from one another For instance, all cars are grouped

together as a single large pixel region on the right side of the segmented image

The primary challenge in this task arises from the fact that, as images pass through

2

Trang 34

conventional CNNs, their spatial resolution gradually diminishes due to layers with strides larger than 1 Asa result, a typical CNN might recognize that a person is located somewhere on the left side of the image, but it would lack precise localization

tional Layer [59] is a preferred choice This layer can be thought of as expanding

the image by adding empty rows and columns (zero padding) and then applying a

(2.19) Some refer to it as a convolutional

n Figure P.1 ‘The Transposed Convolutional

convolutional layer (as shown in em

layer with fractional steps (e

Layer can be configured to sauce linear interpoTation, but its advantage lies in being trainable, which often leads to improved performance during training Unlike pooling or convolutional layers, the stride determines how much the input image is expanded in a transposed convolutional layer used for increasing the resolution of feature maps

Trang 35

2.3.5 Skip Layer

Utilizing Transposed Convolutional Layers is a viable approach to increasing

the size of feature maps, but it may still lack precision To address this challenge, skip connections from lower layers are introduced at a factor of 2 (rather than 32) into the output image This involves adding the output of a lower layer with dou- ble the resolution Subsequently, the results are downsampled by a factor of 16,

achieving a total downsampling factor of 32 (as depicted in Figure (2-20) This

helps in recovering some of the spatial resolution lost in previous pooling layers

To retrieve even finer details from even lower levels, the architecture is enhanced

by a second skip connection In summary, the output of the initial CNN is upsampled, followed by the addition of a lower layer output (at the corresponding scale), then further upsampled by a factor of 2 This is followed by adding another lower layer output and another upscaling, resulting in a total factor of 8 Additionally,

this technique can even be applied to increase the resolution of the original image,

a process known as super-resolution

Skip connection

Figure 2.20 Spatial resolution from lower layers is recovered by Skip layers 9)

2.4 U-Net based architectures

The U-Net architecture [17] has played a pivotal role in shaping the landscape

of deep learning-based image segmentation In the realm of automated medical

image segmentation, significant efforts have been directed towards refining and

advancing the U-Net framework Notably, attention-based methodologies have gar- nered considerable attention due to their efficacy in segmenting intricate features

in biomedical images across diverse imaging modalities

One such adaptation is the Residual Attention U-Net [6Ø]

the soft attention mechanism to bolster the network's ability to discern a compre-

hensive spectrum of COVID-19 effects within chest CT scans For the purpose

of lung segmentation in chest X-rays, the XLSor approach (61) employs the criss- cross attention block to aggregate long-range contextual information, contributing

Trang 36

introduced by [62] This model, following its creation, was subjected to compara-

tive evaluations against the original U-Net across various medical image segmentation datasets The findings revealed that the MultiResUNet consistently outper- formed the standard U-Net in terms of segmentation results

Another notable innovation, Attention U-Net [63], seeks to enhance the fea-

ture learning capability of the U-Net by integrating an attention gate This attention mechanism suppresses irrelevant information responses and accentuates critical information, thereby enhancing prediction accuracy and model sensitivity

However, while the application of attention mechanisms is promising, it is important to note that their direct utilization may potentially compromise the extraction of underlying feature representations This concern is particularly relevant when the assessment of the region of interest is flawed, leading to suboptimal net-

work performance

2.4.1 Attention mechanism

Attention mechanism has been favorably used in many computer vision tasks SENet (64), with the core Squeeze-and-Excitation (SE) module, congregates information globally, captures channel-wise relationships, and enhances output vi- sualization GSoP-Net [65] has been proposed to improve the Squeeze stage by

utilizing a global second-order pooling block to model high-order statistics while

accumulating features globally

2.4.2 Compact Dilation Convolution-based Module (CDCM)

In PiDiNet B2) CDCMs have been designed to refine feature maps, begin- ning from the end of each stage The input of n x C channels are exploited to sup- plement multi-scale edge information to output feature map of M (M <C) channels

and relieve the computation overhead Furthermore, each CDCM is followed by a

Compact Spatial Attention Module (CSAM) [32] to curtail the background noise

‘Convidilation = 11, padding = 11)

Figure 2.21 CDCM Module (32)

27

Tiêu đề	Research on application of deep learning approach in multiclass segmentation for medical images
Tác giả	Trinh Minh Nhat
Người hướng dẫn	Assoc. Prof. PhD. Van-Truong Pham, Assoc. Prof. Thi-Thao Tran
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Control Engineering and Automation
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Ha Noi

Định dạng
Số trang	72
Dung lượng	4,79 MB