LIST OF FIGURES [Figure 2.1 Threshold Togic unit: an artificial neuron applies a step function ‘afier calculating the weighted sum of its inputs 9] ¬ Sw ee a [Figure 22 Perceptron archi
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Control Engineering and Automation
Supervisor: Assoc Prof PhD Van-Truong Pham
‘Advisor Signature School: School of Electrical and Electronic Engineering
HA NOI, 9/2023
Trang 2ACKNOWLEDGEMENTS
I would like to lake a moment (o express my gratitude to the individuals who have
been instrumental in shaping the trajectory of my academic journey and the com- pletion of this Teport Assoc Prof Van-Truong Pham, my mentor and guide, has
been a constant source of inspiration His dedication to my growth as a researcher and his insightful guidance have been invaluable His willingness to share his cx- pertise and time has been instrumental in helping me navigate through challenges
and overcome obslacles
Talso want to extend my beartlell appreciation lo Assoc Prof Thi-Thao Tran,
whose contributions have left a lasting impact on my work Though not my primary
advisor, her constructive Locdback, suggestions, and discussions have added depth
and perspective to my research Her commitment to fostering a culture of learning
and exploration has been immensely beneficial
In the broader context, I am deoply thankful for the unwavering support of
my family Their belief in my capabilities and their encouragement during both the
highs and lows have been my driving force Their sacrifices and unwavering faith
have kept me motivated, and I am indebted to them for their constant presence in
my life
Lastly, I would like lo acknowledge the academic communily, my poors, and
fellow researchers who have provided valuable insights and diverse perspectives
that have enriched my work The exchange of idcas and collaborative discussions
bave shaped my understanding and contributed to the depth of this report
As [ reflect upon this journey, I am filled with gratitude for the people who
have contributed to this reports fruition Their collective cfforts have not only facilitated the completion of this project but have also nurtured my growth as a
learner and rescarcher
Trang 3ABSTRACT
In recent times, the prominence of deep learning-based lechniques for medical im-
age segmentation has surged These approaches primarily revolve around innovat- ing architectural designs and refining loss functions Conventional lags functions in this context often rely on global measures, such as Cross-Entropy and Dice Loss, or
overall image intensity, yet they may fall short in addressing complexities like ov-
clusion and intensity variations In response, this study introduces an original loss function, melding both local and global image features, reformulated within the
Mumford-Shah framework This novel approach is extended to the domain of mul-
ticlass segmentation The proposed deep convolutional neural network leverages this new loss function to facilitate end-to-end training while concurrently achiev-
ing mulli-class segmentation Furthermore, motivated by the PiDiNet architecture,
I propose a new Attention-Pildi-UNet architecture This augmentation empawers
the model to fuse contextual information across dense layers, efficiently capture se-
mantic insights, and avert overfitting, resulting in precise segmentation outcomes The proposed approach is rigorously assessed across four distinct biomedical seg- mentation datascls cncumpassing various imaging modalitics, spanning 2D to 3D dimensions, including dermoscopy, cardiac magnetic resonance, and brain mag- netic resonance Evaluation results on datasets like Lesion Boundary Segmenta- tion, the dermoscopic dataset, automated cardiac diagnosis, and 6-month infant brain MRI Segmentation corroborate the algorithm's superior performance com- pared to existing state-of-the-art methods This robustly underscores the potency
of our multiclass segmentation approach for diverse biomedical images
Student's signature
Trang 6LIST OF FIGURES
[Figure 2.1 Threshold Togic unit: an artificial neuron applies a step function
‘afier calculating the weighted sum of its inputs 9] ¬ Sw ee a
[Figure 22 Perceptron architecture of two neurons input, one neuron Dias,
[Figure 23 Multilayer Perceptron architecture has two inputs, four neurons
in one hidden layer and three neurons in output layer BH om KONE 8
(Figie 27 Hien Tapers Deep Newal Nework HT « 8
[Figure 2.5 Logistic activation function saturation [39] 2 6 0 oe ll
[Figure 26 — ReLU acthation| - -. + 13
[lgure 28_ Minh aciharion 0Ï] 15
[Figure 2.9 With the stoppage regularization a random set of all the neurons ]
is “dropped out" in each training iteration in one or more Ïayers, wHl
the exception of the output layer [39] aie mW CNN ee waren x NT
igure 2 je visual signal progresses through the Drain, neurons respond
To more complex patterns in larger receptive fields [39] 1 18
[Figure 2.11 Square local receptive fields in CNN layen [39] 19
[Figure 2.12 Relations between layers and zero paddine [39Ï| 20
[Figure 273_ Reducing dimensionality the input feature map using a stride with
[Figure 2.14 Two different filters are being applied to get other two feature
PHERSIE soesce w ew em Reed UNI Om ER ï 21 [Figure 2.13 Three color channels images and convolutional layers with many
features’ maps B9] 22
[Figure216" Max pooling layer with 2 x2 pooling kernel, no padding and step
[Figure 2.17 Invariance to small translations [39][ - 24
[Figure 2.18 Example of semantic segmentation [39] 25
Trang 7
[Ffsure4.T — The representatie segmeniation resulrs oƑƒmy method on diJerent 1
[ “skin lesions size from my testing set in the ISIC-2018 dataset] 40
[Figure 4.2 — Representative resultsin PH2 datasell - 42
ugure ie representative result of the right ventricle (yellow), myocardium |
(green), and Teft ventricle (blue) of three examples using my metliod on ]
the ACDC 2017 challenge] ‹ 44
[Figure 44 The representative result on various slices of testing sample IDs ]
[ TT, T6, and T7, respectnely The TT weighted, the T2 weighted, and my ]
segmentation result are indicated from left to right, respectively] 46
[Figure £35 The Tearning curves by the proposed method when training im- ]
ages from four databases in terms of average DSC of classes (a) The |
TSIC-2018 dataset (b) The PH2 dataset (c) The ACDC dataset (d) The ]
iSeg-2017 challenge] KM Sas Ee 8 wea eee AD
Trang 8LIST OF TABLES
[Table 4.1 Comparison with other popular approaches on the ISIC 2018
lataset Results have been taken from except for the last four methods] 41
[Table 4.2 Comparison with other popular approaches on the PH2 dataset
Results have been taken from [81] except for the last four methods] 43 [Jabie 4-3 Comparison with other popular approaches on the ACDC dataset
DSCs on RV, Myo, LV and the average DSC have been calculated Re- _|
except for the last four methods] 45
[Table 14 —The DSC, MHD, ASD, and average metrics of segmented classes
in validation dataset of F out of top 8 teams in [3] of the iSeg-2017 ]
challenge and my proposed approach (MHD: mm, ASD: mm] 47
[Table 4.5 Comparison with other loss function in DSC on three the datasets] 48
Trang 9
CNN Convolutional Neural Networks
FCN Fully Convulutional Networks
CDCM Compact Dilation Convolntion-based Module
CSAM Compact Spatial Attention Module
MSE Mean Squared Error
DSC Dice Similarity Coefficient
IoC Intersection-over-Union
ASD Average Surface Distance
MIID Modified [iausdorff Distance
iScg 6-month Infant Brain MRI Scementation
ACDC Automated Cardiac Diagnosis Challenge
RV Right Veniricle
LV Left Ventricle
up The mean of vector input, assessed over the entire mini-batch B
On Incriial moment around the yaw axis
if The standard deviation of the vector of input
mp The plenty of caves in the mini-batch
alt ‘The normalized inputs for case i
® Element-wise multiplication
B The output shift (offset) parameter vector for the layer
& Small number which prevents zcro division (commonly 10-7)
f) ‘The Batch Normalization output operation
Bị Output of #** encoder block
dD Output of i" decoder block
th Height of # output feature map
W Width of # output feature map
a Additional weight in class i
Q Spatial domain of the image
N Number of segmentation classes
9 Trainablc parameters pf the CNN
Pu(@) Softmax output for the v“" pixel value of the class #2
T Onc-hot vector of the graund truth
Trang 10CHAPTER 1 INTRODUCTION
1.1 Motivation for Participating in Medical Image Segmentation Challenges
Medical image segmentation challenges provide a unique opportunity for re-
searchers and practitioners to address critical problems in healthcare through the
development of advanced computational techniques By participating in these challenges, participants aim to contribute to the improvement of diagnosis, treat- ment planning, and patient care In this section, I discuss the motivations behind participating in four specific medical image segmentation challenges: the Lesion
Boundary Segmentation challenge on the ISIC-2018 dataset, the dermoscopic PH2 database, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis bench-
mark, and the 6-month infant brain MRI Segmentation (iSeg) benchmark
Skin cancer is a prevalent and potentially deadly condition that demands early
and accurate detection The ISIC-2018 challenge [1}{2] and the PH2 challenge [3] focus on segmenting skin lesions, aiming to improve the accuracy and efficiency
of diagnosis The motivation to participate in this challenge arises from the urgent need to develop automated segmentation methods that can assist dermatologists in identifying and diagnosing skin cancer Successful segmentation of lesion bound-
aries can enable more accurate diagnosis and early intervention, ultimately enhanc- ing patient outcomes Participating in this challenge provides an opportunity to
contribute to dermatological research, develop advanced segmentation techniques,
and potentially revolutionize skin cancer diagnosis
Cardiovascular diseases are a leading cause of mortality globally The MIC- CAI sub-challenge on automatic cardiac diagnosis [4] addresses the need for ac- curate cardiac segmentation to aid in diagnosing heart conditions The motivation
to participate in this challenge stems from the potential to advance cardiac imag- ing and diagnosis through automated segmentation methods Precise segmentation
of cardiac structures can assist cardiologists in assessing heart function, identify- ing anomalies, and guiding treatment decisions By participating in this challenge,
researchers can collaborate with experts in cardiology, contribute to cutting-edge
medical research, and develop solutions that have a direct impact on patient care
Segmentation of infant brain MRI scans is crucial for studying early brain development and identifying abnormalities The iSeg benchmark challenge [5]] fo-
cuses On accurate segmentation of infant brain structures, aiding in early detection
of neurological disorders The motivation to participate in this challenge lies in the
potential to contribute to pediatric neuroimaging and improve the understanding
of infant brain development Accurate segmentation of brain structures can assist
clinicians and researchers in diagnosing conditions and monitoring developmental
Trang 11milestones Participation in the iSeg benchmark offers the chance to advance pedi- atric imaging, collaborate with experts in the field, and create tools that facilitate
early intervention and improved patient outcomes
In conclusion, participating in medical image segmentation challenges pro-
vides a unique avenue to address critical healthcare challenges The motivations
behind participating in these challenges range from improving diagnosis accuracy
and treatment planning to collaborating with experts and contributing to cutting- edge medical research These challenges offer a platform for researchers to develop and showcase innovative solutions that have the potential to revolutionize health-
care practices and enhance patient care
1.2 Advancements in Medical Image Segmentation and Innovative Approaches
Image segmentation is a pivotal and challenging topic in the field of computer
vision (6) Its objective is to partition an image in a way that accurately locates,
identifies, and quantifies objects This process holds crucial importance in medical
imaging, supporting additional clinical analysis, diagnosis, therapy planning, and disease progression measurement Within the domain of medical image segmen-
tation, several primary obstacles exist These include a scarcity of well-labeled
benchmarks for training, a deficiency of annotated images [7], a lack of consis-
tent segmentation techniques, poor image resolution, and significant variability in
image quality across patients (8) Precise calculation of segmentation accuracy and uncertainty is vital for gauging performance in other applications (9 Con- sequently, this underscores the imperative for advanced methodologies, such as Artificial Intelligence (AI)-based approaches, to enable automated, generalizable, and efficient medical image segmentation
In the context of developing AI systems, the attributes of generalization and robustness bear critical significance, particularly in clinical trials {10} Conse-
quently, the development of a resilient architecture suited for diverse biomedical
applications becomes paramount Recently, convolutional neural networks (CNNs) have emerged as advanced tools for automating the segmentation of medical im-
ages [IT}I3] This includes various modalities such as X-rays, CT scans, and
MRIs, with promising outcomes compared to conventional segmentation meth-
ods [1415] Among different CNN versions, encoder-decoder networks like Fully
Convolutional Networks (FCN) [16] and their advancement such as U-Net [T7]
have gained substantial traction as semantic segmentation techniques for 2D im-
ages A deep fully convolutional neural network designed for semantic pixel-wise
segmentation that requires fewer trainable parameters yet yields high-quality seg-
mentation maps was introduced by [8] Addressing dense prediction challenges,
anovel convolutional network module was proposed by [I9] This module utilized
2
Trang 12dilated convolutions to systematically aggregate multi-scale contextual features, resulting in a significant performance enhancement for advanced automated seg-
mentation systems Moreover, [20] introduced DeepLab as a segmentation method
DeeplabV3 [21], without DenseCRF fine-tuning, demonstrated considerable im-
provements over earlier DeepLab iterations, utilizing a synthetic approach with
fewer convolutional layers than FCN and U-Net architectures, along with skip con- nections between the encoder and decoder paths An efficient scene parsing net- work for comprehending complex receptive fields was proposed by 2 This approach utilized global pyramidal characteristics to facilitate the acquisition of additional contextual information
Throughout the training process, CNN model parameters are typically re- fined using gradient descent techniques, as outlined by 3 wherein errors are quantified by a loss function that contrasts predicted labels against ground truth labels For classification endeavors, prevalent loss functions encompass cross-
entropy (CE) and the L2 norm, often referred to as the mean squared error (MSE),
as frequently cited in the works [24)25] Conversely, problems centered on seg-
mentation have commonly engaged the Dice Coefficient (DC) and cross-entropy
(CE) [17}26] Despite the recent strides made in CNN deployment for biomedical
image segmentation, prevalent loss functions frequently revolve around pixel-wise
similarity evaluation Notably, the DC and CE are tailored towards specific region
feature extraction While this framework often yields impressive cla:
segmentation outcomes, low loss function values do not always signify meaning-
ful segmentation Instances arise where noisy images produce several indistinct
contours, signaling erroneous predictions, and the indistinctness of object bound-
aries stems from the difficulty in classifying pixels near the contour An additional challenge arises from susceptibility to local minima due to aberrations within the training database, high dimensionality, and the non-convex attributes of loss func-
tions, as illuminated by [27]
ication and
Among frequent deep-CNN approaches, fully convolutional network (FCN)
[28] and U-Net [17] have been designed that deconvolutional operations replace
fully connected layers to strengthen temporal coherence; also, skip connections are used for inheriting spatial information in deeper layers Depthwise convolution
is defined as the depthwise convolution followed by the pointwise convolution,
which helps prevent the model from getting overfitting by reducing the number of
Trang 13connections in the model
Dilated convolution [30] expands window size without increasing the number
of weights by adding zero-values into convolution kernels while maintaining com- putation cost Adaptive Dilated Convolution BI generates and fuses multi-scale features of similar spatial sizes by setting various dilation rates for different chan-
nels Applying dilated convolution, Compact Dilation Convolution-based Module
(CDCM) is adopted in my proposed model for more useful features
Region-based Tversky loss [33] and Focal Tversky loss [34] control the in-
formation flow implicitly through pixel-level affinity and tackle class-imbalanced
problems; however, their contour optimization processes are not good enough There has been an ongoing concern about exploiting the active contour models as
loss functions in deep-learning solutions for better contour optimization Region-
based active contour Chan-Vese model 85] has been successful for training images with two regions, each having a different mean of pixel intensity Inheriting the advantage of Mumford-Shah functional and the AC loss with some adjustments
obtains the LMS loss [36] Acquiring the requirements for boundary optimiza-
ig the class-imbalanced problem, I propose a new Focal Active Contour loss function
tion and addr
This study yields several noteworthy contributions:
+ Innovative Loss Function: | introduce a novel loss function tailored for the training process of deep-learning models By incorporating elements of active contour methodology into the loss functions, I aim to tackle a persistent chal- lenge encountered in medical imaging and computer vision - the problem of intensity inhomogeneity within image data This amalgamation of techniques
offers a promising avenue to address this issue effectively It not only helps
deep-learning models achieve more accurate and robust segmentation results
but also paves the way for more precise and reliable image analysis across var-
ious applications, ultimately advancing the capabilities of Al-driven solutions
in the field
End-to-End CNN Model Development: Inspired by PiDiNet, I propose a
new architecture by modifying this network from FCN-shape into U-Net-
shape, using CDCM modules (without CSAM followed); combining with an Attention module, Depthwise-and-Pointwise module,
Thorough Evaluation and Comparison: A comprehensive evaluation of both my proposed model and the introduced loss function is conducted across
2D and 3D datasets These evaluations are benchmarked against existing state-
of-the-art methods Notably, my approach consistently demonstrates promis-
4
Trang 14ing outcomes when compared to baseline algorithms This observation is
substantiated across diverse datascts including the Lesion Boundary Scymen-
tation ISIC-2018 dataset, the dermoscopic PH2 dataset, the 2017 MICCAI sub-challenge on automatic cardiac diagnosis benchmark, and the 6-month infant brain MRI Segmentation (Seg) benchmark
Trang 15CHAPTER 2 THEORETICAL BASIS
2.1 Artificial Neural Networks
Deep learning is a machine learning technique that is very significant It
teaches a computer (PC) to filter inputs through layers in order to predict and cat- egorize data Observations may take the form of images, text, or sound The way the human brain filters knowledge is the driving force behind deep learning Its aim is to imitate how the human brain seeks to conjure up some real magic There are about 100 billion neurons in the human brain A single neuron interacts with approximately 100,000 of its peers That is what I am attempting to build, although
in a computer manner As a result, the neuron (or Node) receives a signal or sig- nals (input values) that pass through it The output signal is transmitted by that
neuron This knowledge is broken down into numbers and bits of binary data that
a computer can understand,
What about synapses? Every one of the neurotransmitters gets assigned weights, which are important to Artificial Neural Networks (ANNs) Weights are the way ANNs learn By changing the weights, the ANN chooses to what degree
signals get passed along and the weights are changed while training your network
For some decades ago, McCulloch suggested a immensely basic architecture
of a biological neuron [37], which has one or more binary (on/off) inputs and one
binary output, was later called an artificial neuron When more than a certain number of its inputs are involved, the artificial neuron stimulates its output They demonstrated in their paper that even with such a simplistic model, a network of artificial neurons can be built to compute any logical proposition
The Perceptron, which is one of the most basic ANN architectures, was
Frank Rosenblatt [38] created The threshold logic unit (TLU) is derived from a
marginally different artificial neuron (Figure [2-T) or sometimes a linear threshold unit (LTU) The inputs and outputs now are both numbers (rather than binary on/off values), and each input relation has a weight assigned to it The TLU calculates a
weighted sum of its inputs (¢ = wjxị + w2x2 + + Waxy Ty), then such sum is added by a step function and returned the result: hy(x) fh
A Perceptron comprises a layer of Threshold Logic Units (TLUs), each intricately
connected to all the inputs This layer is recognized as a fully connected layer
or a dense layer when each neuron within the layer establishes connections with every neuron in the preceding layer The Perceptron’s inputs are channeled to input
neurons, which serve as pass-through units, directly outputting the received input
The assembly of these input neurons constitutes the input layer It’s worth noting that an additional bias term is commonly integrated (xo = 1), typically introduced
6
Trang 16Figure 2.1 Threshold logic unit: an artifi
calculating the weighted sum of its inputs BY
ial neuron applies a step function after
through a specialized neuron known as a bias neuron, perpetually yielding an out-
put of 1 A visual representation of this setup can be seen in Figure[2.2} illustrating
a Perceptron equipped with two inputs and three outputs In this case, the Percep- tron functions as a multi-output classifier, concurrently categorizing instances into three distinct binary classes Perceptrons are trained using a variety of rules that
Outputs
>>, Output `
(always outputs 1) ie layer
consider the network’s error when making predictions The Perceptron learning
rule refines correlations, progressively minimizing error In greater detail, the Perceptron is sequentially exposed to individual training instances, yielding pre- dictions for each instance If an output neuron generates an incorrect prediction,
the correlation weights pertaining to inputs that would have led to the accurate
prediction are adjusted This rule is represented by Equation 2.1}
next step Weg,
In this equation:
+ w;,; is the weight linking the i” input neuron and the j'” output neuron
+
Trang 17+ x; is the i" input value of the current training sample
ith
+ y; is the target output of the j’" output neuron for the current training sample
* §, is the output of the j'" output neuron for the current training instance
* 7 denotes the learning rate during training (typically adjusted as needed)
Given that the decision boundaries of individual output neurons remain lin-
ear, Perceptrons inherently struggle to capture intricate patterns However, stacking multiple Perceptrons collectively mitigates these limitations This composite struc- ture is known as a Multilayer Perceptron (MLP), as illustrated in Figure [2.3] The architecture encompasses an input layer (comprising pass-through neurons), one
or more hidden layers of TLUs, and ultimately a output layer of TLUs Each layer incorporates a bias neuron except for the output layer, and these layers are fully connected to one another, creating a comprehensive neural network A deep
Figure 2.3 Multilayer Perceptron architecture has two inputs, four neurons in one
hidden layer and three neurons in output layer By
neural network (DNN) is described as an ANN with a large number of hidden layers
2.2 Deep neural network
Deep Learning revolves around the exploration of deep neural networks (DNNs),
which frequently consist of intricate sequences of computations Representing the
output of hidden layers as A‘(Z), the computation for a neural network with L
hidden layers is depicted as:
so) = fo" (0 (20 ( (0(@C)))))]_ e»
8
Trang 18Each pre-activation function zÍ)(a) entails a linear operation governed by the
weight matrix WÉ) and bias ĐÍ);
(0= 1)" layer layer
OQ 2G:
Figure 2.4 Hidden layers in Deep Neural Network fay
effectively However, in 1986, David Rumelhart introduced a groundbreaking ap-
proach that revolutionized the field This approach implemented the backprop- agation training algorithm, which remains a cornerstone of neural network train- ing In essence, it leverages Gradient Descent [43] along with an efficient means
of automatically calculating gradients The backpropagation algorithm computes
the gradient of the network's error with respect to each model parameter in just
two passes through the network — one forward and one backward This algorithm efficiently determines how relation weights and bias terms should be adjusted to minimize error It repetitively undertakes a regular Gradient Descent step using
9
Trang 19these computed gradients, iteratively moving towards a solution
Key aspects of the backpropagation algorithm include:
Mini-Ratch Processing and Epachs: ‘The algorithm operates on one mini-
batch at a time (typically comprising a power of two instances for computa- tional efficiency), cycling through the entire training dataset multiple times
— each complete cycle is termed an epoch This iterative process aids in the
gradual reduction of losses
Forward Pass: The input layer sends the first hidden layer each mini-batch Subsequently, the algorithm computes the contributions of all neurons within
this layer for each sample in the mini-batch This result is then propagated
forward to the subscqucnt layer, ropeating this process layer by layer until the output layer is reached This forward pass is akin to making projections, with
the distinction that intermediary outcomes are retained for utilization during the backward pass
Error Calculation: Subscquomt to the forward pass, the algorithm calculates
the network’s performance error
Output Contribution Evaluation: The algorithm assesses the contribution
of each output relation to the error Leveraging the chain rule, this process is
executed analytically, ensuring officiency and precision
Backward Error Propagation: By employing the chain rule, the algorithm
quantifies the extent to which each error input stems from each link within
the layer directly belaw This backward process extends until the input layer
is reached As previously highlighted, this backward propagation effectively assesses the crror gradient (hroughout the catire neural network, traversing the
network’s relation weights
Gradient Descent Phase: The final step involves adjusting all the nctwork’s
relation weights using the computed error gradients during a Gradient Descent phase
The backpropagation algorithm's significance warrants reiteration: it initiates with a prediction (forward pass), calculates the error for each training, step, retraces
through each layer to compute error contributions from connections (reverse pass),
and subscquently adjusts connection weights to minimize error (Gradient Descent
step) To facilitate the proper functioning of this algorithm, a pivotal enhancement
was made to the MLP’s architecture: the replacement of the step function with the
10
Trang 20+e
is characterized by a continuous nonzero derivative across its domain, enabling Gradient Descent to make progress at each step In contrast, the step function features flat segments, leading to the absence of gradients for computation
logistic (sigmoid) function [44], denoted as 6 (2) = 7 The logistic function
However, a challenge arises: as the algorithm progresses down to lower lay- ers, gradients diminish due to the cumulative effect of multiplications by values less than I Consequently, the Gradient Descent updates predominantly influence lower
layer relation weights, preventing convergence to a single solution—a predicament
known as the vanishing gradients problem Conversely, gradients can surge in
magnitude, causing layers to receive excessively large weight updates, ultimately
leading to divergence—an issue termed the exploding gradients problem A tech- nique involving the logistic activation function and initialization procedure was
presented in [45] This study demonstrated that each layer’s output variance ex-
ceeds its input variance significantly As the network advances, variance escalates with each layer, culminating in activation saturation in the upper layers Notably,
saturation is exacerbated by the logistic function’s mean of 0.5, which diverges
With respect to the logistic activation function (depicted in Figure E3 it’s evident that the function saturates at 0 or | as inputs become increasingly large (negative or positive), leading to derivatives that approach zero Consequently,
there exists minimal gradient available for back propagation, and any existing gra-
dient becomes diluted as it traverses the network's upper layers during back prop-
agation Therefore, Glorot and Bengio [45] suggested a way to reduce the unstable
Sigmoid activation function
Figure 2.5 Logistic activation function saturation 9}
gradient issue dramatically, it is Glorot and He Initialization
"1
Trang 212.2.1 Glorotand He Initialization
The proper propagation of signals in both forward and backward passes is
crucial in neural networks During prediction (forward pass) and gradient compu-
tation (backward pass), signals must traverse accurately in both directions Authors
emphasize that for correct signal flow, the output variance of a layer should match
the input variance, ensuring proper signal propagation Furthermore, gradients
need to be adjusted both before and after they travel through the back direction
of the layer Achieving these conditions isn’t guaranteed even when the input and
neuron layer have an equal number of connections (referred to as the fanj, and
farow of the layer)
However, Glorot and Bengio introduced a practical approach that has proven effective: initializing the connection weights of each layer with random values de-
fined by equations 24) and (23), which involve normal distribution and uniform
distribution with the parameters outlined Notably, fanayg = (fanin + fanow) /2-
This initialization strategy is referred to as Xavier initialization or Glorot initial- ization in (5) The significance of this technique has been recognized for over a decade Applying Glorot initialization significantly accelerates training and is one
of the influential strategies that have contributed to the success of Deep Learning
(2.4)
fan
Similar techniques for different activation functions have been presented in certain
papers [46] These approaches share a common framework with variations in the
variance scale: 6? = - In the case of the uniform distribution, the value of r
Hi
is computed as r = V30° Particularly, the initialization technique tailored for the
Rectified Linear Unit (ReLU) activation function, which will be discussed in the
subsequent subsection, is sometimes referred to as He initialization
2.2.2 Non-Saturating Activation Functions
The backpropagation algorithm not only performs effectively with the logistic
equation but also proves successful with various other activation functions Several
common options are presented below
(a) ReLU activation
To address the vanishing gradient problem [47] associated with sigmoid acti-
vation, the Linear Unit or Rectified Linear Unit (ReLU) was introduced The
12
Trang 22ReLU activation function is illustrated in Figure [2.6] Unlike the sigmoid func- tion, ReLU doesn’t suffer from vanishing gradients Specifically, its derivative is
0 for x <0 and | otherwise This characteristic eliminates the issue of vanishing
gradients, Additionally, ReLU promotes model sparsity, as gradients that turn to
0 essentially indicate that a neuron becomes inactive Moreover, ReLU computa- tions are computationally faster compared to functions like sigmoid and tanh The
computation of ReLU, which often involves taking the maximum between (0.x),
requires less computational resources Consequently, ReLU has become the stan- dard activation function in today’s deep learning landscape
Nonetheless, the exploration for improved activation functions continued In
October 2017, Google Brain introduced the Swish activation function [49], aiming
to enhance existing options The Swish activation function is characterized by
the simple equation (x) = —*—, as depicted in Figure} Swish stands out
1+e>
Swish
Figure 2.7 Swish activation
as a smooth function, unlike ReLU, which experiences a sudden directional shift near x = 0 Swish transitions seamlessly from 0 to non-zero values and then back
13
Trang 23upwards Importantly, Swish exhibits a non-monotonic behavior—this sets it apart
from functions like ReLU, which are either stable or shift in a specific direction
This characteristic is highlighted in the authors’ paper, where they underscore that
Swish’s non-monotonicity distinguishes it from most other activation functions
The Swish activation function offers several advantages over ReLU due to its unique characteristics:
+ Bounded and Sparse Activation: Similar to ReLU, Swish benefits from spar-
sity, Extremely negative weights are zeroed out, which contributes to a sparse
representation
Unbounded Above: Swish is not limited to saturating outputs to a maximum value for very large inputs (e.g., 1 for all neurons) This distinguishes Swish from other activation functions, including ReLU
Smooth Curve and Smooth Landscape: The smoothness of the Swish curve
extends to its derivative, leading to a smoother landscape for optimization
This smoothness aids in efficiently navigating the model towards minimal loss
during optimization
Utilization of Negative Values: Unlike ReLU, where negative values are set
to zero, Swish retains negative values, particularly values close to zero This property is beneficial for capturing subtle patterns in the data, making Swish more flexible in handling different types of information
In essence, Swish’s bounded, smooth, and flexible behavior makes it a com-
pelling alternative to ReLU, offering improvements in terms of capturing complex patterns and optimizing model performance
(c) Mish activation
Mish activation draws inspiration from Swish activation The equation
for the Mish activation function is defined as f(x) = x tanh(In(1+e*)) The graphical representation of the Mish activation function is depicted in Figure22.3}
While Mish shares many of the same advantages as Swish, the authors of (50) introduce the idea that the error space could potentially be smoother with Mish However, it’s important to note that the primary drawback of the Mish activation
function is its significantly higher computational cost
2.2.3 Batch Normalization
Although initializing with ReLU (or its variants) can significantly reduce the
likelihood of vanishing/exploding problems at the start of training, it doesn’t guar-
14
Trang 24
Figure 2.8 Mish activation 50}
antee that these issues won't arise during the course of training In the paper [51],
a technique called Batch Normalization (BN) is introduced to address these prob- lems This method involves inserting a new operation within the model immedi- ately after the hidden layer
The process consists of normalizing and zero-centering each input Subse-
quently, the results are scaled and shifted using two learnable parameter vectors
per layer: one for scaling and another for shifting This approach allows the model
to learn the most appropriate scale and mean for each input layer To achieve this,
the mean and standard deviation of each input must be computed to ensure central-
ization, and the inputs need to be normalized This is accomplished by estimating
the mean and standard deviation of the inputs over the current mini-batch The entire procedure is succinctly summarized below:
Trang 25In this algorithm:
* Ug is the mean of vector input, assessed over the entire mini-batch B
Og is the standard deviation of the vector of input, also evaluated over the entire mini-batch
mg is the plenty of cases in the mini-batch
x() is the normalized inputs for case i
is the output scale parameter vector for the layer
® expresses element-wise multiplication
B is the output shift (offset) parameter vector for the layer The corresponding shift parameter offsets each input
€ is a small number which prevents zero division (commonly 1077) This is
named a smoothing term
z( is the BN output operation The version of the inputs is rescaled and
modified
In the training phase, Batch Normalization (BN) standardizes its inputs by normal-
izing and centering them, followed by rescaling and shifting During the testing
phase, BN employs two additional parameters, namely 1 (the mean vector of in- puts from the last batch in the training set) and o (the standard deviation vector
of inputs from the last batch in the training set) These parameters are estimated
using an exponential moving average [52] for making predictions on new instances during testing It’s important to note that while jt and o are computed during train-
ing, they are used only after training (to replace the batch input means and standard
deviations in the BN algorithm during inference),
The issue of vanishing gradients has been mitigated to a point where saturated
functions like the logistic function and even the tanh function can be effectively uti-
lized The sensitivity of weight initialization in the networks has also been notably reduced Researchers have been able to employ significantly higher learning rates, leading to a substantial acceleration of the learning process Furthermore, Batch
Normalization acts as a regularizer, reducing the necessity for other regularization
techniques to prevent overfitting
16
Trang 262.2.4 Dropout
In the realm of deep neural networks, with their thousands or even millions
of parameters, there exists an entire spectrum of possibilities This wide parame- ter space grants them incredible versatility to adapt to a diverse array of complex datasets However, this immense flexibility also heightens the risk of overfitting the training data, necessitating the incorporation of regularization techniques One
Figure 2.9 With the stoppage regularization a random set of all the neurons is
"dropped out" in each training iteration in one or more layers, with the excep-
tion of the output layer BI
such technique that has gained significant traction in deep neural network regular-
ization is Dropout [53] Dropout has proven to be remarkably effective, often lead-
ing to a 1-2 percent improvement in accuracy for modern neural networks While this might not sound like a dramatic enhancement, consider that a 2% increase
corresponds to a reduction in error rate of nearly 40% for a model that already
boasts 95% accuracy (reducing from 5% to around 3% error rate) The concept behind dropout is relatively straightforward: during each training iteration, every
neuron (excluding output neurons) has a certain probability, denoted as p, of being
temporarily removed This means that the neuron’s contribution is entirely disre- garded during that iteration, but it will contribute in subsequent iterations (Figure (29) The parameter p is termed the dropout rate, and it generally lies within the range of 10% to 50% It is important to note that neurons no longer experience dropout during the testing or inference phase
2.3, Convolution Neural Network
Since the 1980s, researchers have harnessed the power of convolutional neu- ral networks (CNNs) in the realm of image recognition This development was spurred by investigations into the intricate workings of the visual cortex of the
brain [54156] Over the years, CNNs have undergone significant advancements
and have reached a point where they can achieve performance beyond human capa-
17
Trang 27bilities in various complex visual tasks These advancements have been driven by
the growth in computing capabilities and the abundance of available training data
As a result, CNNs play a pivotal role in applications like image analysis services,
self-driving vehicles, automated video classification systems, and more Impor- tantly, CNNs are not confined solely to visual perception; they have also demon- strated remarkable prowess in various other domains, including speech recognition and the processing of natural language
2.3.1 The Architecture of the Visual Cortex
In their work presented in [56], the authors demonstrated that numerous neu-
rons in the visual cortex exhibit a distinct property known as a local field of re- ception This property implies that these neurons respond exclusively to visual stimuli within a limited visual area, as illustrated in Figure [2.10] where dashed cir-
cles denote the local receptive fields of five neurons It’s important to note that the
receptive fields of different neurons can overlap, and when considered collectively,
they comprehensively cover the entire visual field Moreover, the researchers made
Figure 2.10 The visual signal progresses through the brain, neurons respond to
more complex patterns in larger receptive fields By
a significant observation that certain neurons exclusively responded to images fea- turing horizontal outlines, while others exhibited responses to lines of various ori-
entations Additionally, they identified neurons with larger receptive fields that
reacted to more intricate patterns formed by combining lower-level patterns These findings led to the formulation of a hypothesis suggesting that higher-level neu- rons utilize the outputs of neighboring lower-level neurons (as illustrated in Figure
[2.10] where each neuron is connected to only a subset of neurons from the previous
layer) This intricate neural architecture enables the detection of a wide array of complex patterns across different regions of the visual field
The culmination of these insights was the introduction of the neocognitron
in 1980 [57], which ultimately paved the way for the development of convolu-
tional neural networks A notable milestone in this progression was the creation
of LeNet-5 architecture introduced in LeNet-5, widely employed for classify-
Trang 28ing handwritten digits by financial institutions, integrated several well-established
building blocks, such as swish functions and fully connected layers However, it also introduced two novel components: convolutional layers and pooling layers
2.3.2, Convolutional Layers
A fundamental characteristic of a CNN is that neurons in the convolutional
layers are connected to pixels within their respective receptive fields, rather than being connected to individual pixels in the input image, as explained earlier (as depicted in Figure [2.TT) Additionally, each neuron in the subsequent convolu-
tional layers is connected only to neurons in a small local region of the previous
layer This architectural arrangement enables the network to progressively focus
on lower-level features in the initial hidden layers and then combine these features
in subsequent layers This hierarchical structure mirrors the organization of visual
information in real-world images, which contributes to the CNN's remarkable per- formance in image recognition tasks A neuron situated at row i and column j
(2:12) To ensure that a layer maintains the same height and width as the preceding
layer, it is common to include zero values around the input data This technique is
referred to as zero padding The use of receptive fields, as depicted in Figure [2.13]
facilitates the connection of a larger input layer to a much smaller subsequent layer This leads to a significant reduction in the computational complexity of the model
The transition from one receptive field to another is referred to as the stride In the presented illustration, a 5 x 7 input layer is linked to a 3 x 4 layer using 3 x 3 recep-
tive fields and a stride of 2, with zero padding applied It’s important to note that
the stride doesn’t necessarily have to be the same in both directions, as illustrated
in this example
For instance, a neuron located at row i and column j within the higher layer
19
Trang 29
t3 Zero padding
Figure 2.12 Relations between layers and zero padding 9)
is connected to the outputs of neurons in the previous layer situated in rows i x sự
to ix sp +f, — 1 and columns j x sy to jx Sw + fiw — 1, where sj and sy repre- sent the vertical and horizontal strides, respectively This mechanism of stride and
receptive fields allows CNNs to efficiently capture features across different scales
and positions in the input data The weights of a neuron can be thought of as a
Figure 2.13 Reducing dimensionality the input feature map using a stride with step
of 2 39]
small image representing the receptive field Figure E14 illustrates two possible
sets of weights, known as filters The first filter is depicted as a black rectangle at
the center with a vertical white line running through it (this filter corresponds to a 7x7 receptive field, where most values are 0 except for the central vertical column,
which is filled with 1s) Neurons with these weights essentially focus solely on the
central vertical line in their receptive field, disregarding other input values The
second filter is presented as a black area with a white horizontal line in the middle Similarly, neurons with these weights emphasize the central horizontal line in their
receptive field, filtering out the remaining information
Consider a scenario where all neurons in a layer utilize the same vertical line
20
Trang 30filter (along with the corresponding bias term), and the network is provided with
the bottom image in Figure [2-14] (the input image) In this case, the layer’s output
will resemble the top-left image The vertical white lines are accentuated, while
the rest of the image becomes blurred Similarly, if all neurons employ the same horizontal line filter, the result would be the upper-right image; here, the horizontal white lines are emphasized, and the rest becomes less distinct Consequently, when
a layer of neurons shares the same filter, it generates a feature map that highlights
the regions where the filter is most responsive within the input image
Non
W W
Figure 2.14 Two different filters are being applied to get other two feature maps 9)
It’s important to note that filters are not manually designed; rather, the con-
volutional layer learns the most relevant filters automatically during training As
the learning progresses through subsequent layers, these filters are combined into
more complex and sophisticated patterns, allowing the network to identify intricate
features and patterns in the data
Up to this point, I have simplified the depiction of each convolutional layer’s
performance as a 2D feature map However, in reality, each convolutional layer
consists of multiple filters, resulting in a more accurate 3D representation (as seen
in Figure 2.14), Each filter in the convolutional layer employs different parame-
ters and creates a distinct feature map The receptive field of a neuron remains
consistent with the description provided earlier, but it spans across all the feature
maps from the preceding layers In essence, a convolutional layer employs multiple trainable filters on its inputs simultaneously, enabling it to identify various features
across its inputs
Moreover, input images often have multiple sublayers, each corresponding to
a color channel For instance, grayscale images have just one channel, whereas
certain images possess additional channels—such as satellite photos that capture
21
Trang 31diverse light frequencies, including infrared
To elaborate, consider a specific convolutional layer denoted as /, where the
neuron outputs in the i-th row and j-th column of a feature map k are connected to the outputs of neurons from the preceding layer /— 1 These connections involve neurons positioned in rows ranging from i x s to i x s„ + fj, — 1, and columns
ranging from j x sy to j x Sy + fy — 1 It’s important to note that the neuron out-
puts from the same neurons in the previous layer, despite being related to various
neurons on the i-th row and j-th column, pertain to different feature maps
Green Blue
Figure 2.15 Three color channels images and convolutional layers with many fea~
tures’ maps [39]
Equation 2-6}encapsulates the previously described concepts into a compre-
hensive mathematical expression, detailing the computation of a neuron’s output within a convolutional layer While the weighted sum of inputs along with a bias term might seem intricate due to the diverse indices involved, all the calculations
harmonize to provide the desired outcome
Trang 32horizontal and vertical strides and the plenty of feature maps in the preceding layer (layer / — 1) is defined as fy
© Xixcsy-+u,jxsy-+vk 48 the production of the neuron positioned in layer / — 1, row
ix s,+u, column j x sy+v,k, feature map k’
+ by is the bias part for feature map k (in layer /) It is like a button that pinches
the overall intensity of the feature map k
* Wu e is the weight between every neuron in feature map k of the layer / and
its input positioned at row u, column v and feature map k’
2.3.3 Pooling Layers
The purpose of the pooling layer is to downsample the input image, effec-
tively reducing computational load, memory usage, and the number of parameters,
which helps mitigate the risk of overfitting Similar to convolutional layers, the
pooling layer associates the outputs from the previous layer within a small rect- angular receptive field with each neuron in the pooling layer As before, the size
of the receptive field, the stride step size, and whether padding with zeros is used
need to be specified Unlike convolutional layers, pooling neurons lack weights;
instead, they aggregate the inputs using functions like max or mean
Currently, the most common type of pooling layer is the max pooling layer, illustrated in Figure [2.16] In this case, a pooling kernel of size 2 x 2 with a stride
of 2 and no padding is employed In max pooling, only the maximum input value within each receptive field is passed on to the next layer, while the other inputs
are discarded In the example from Figure [2-16} the input values in the lower left
receptive field are 1, 3, 5, and 2; hence, only the maximum value, 5, is propagated
to the subsequent layer Due to the stride of 2, the output image's width and height are halved compared to the input image A max pooling layer not only reduces
‘Bad
Figure 2.16 Max pooling layer with 2 x 2 pooling kernel, no padding and step size stride equal 2) [39]
computations, memory usage, and the number of parameters, but it also introduces
a degree of invariance to minor translations, as depicted in Figure 2.17} This can
23
Trang 33Figure 2.17 Invariance to small translations 39]
be observed by looking at the three images (A, B, C) above, which undergo max
pooling with two 2 x 2 kernels, a stride of 2, and no zero-padding Images B and C
are identical to image A but shifted to the right by one and two pixels, respectively
The outputs of the max pooling layer for images A and B remain the same, illus- trating translation invariance The output for image C, which is shifted by just one
pixel to the right, still maintains 75% invariance It’s possible to achieve a certain
level of translation invariance on a larger scale by incorporating max pooling layers
at intervals within a CNN Additionally, max pooling provides a limited amount of
rotational and scale invariance In scenarios where predictions don’t rely heavily
on these variations, such as in classification tasks, such invariance (even if limited) can be advantageous
However, max pooling does come with certain drawbacks It causes a re-
duction in resolution, as the output is halved in both dimensions (even with a small
kernel and a stride of 2), leading to a 25% reduction in area Invariance isn’t always
desirable in all applications Take semantic segmentation, for instance: if the input
image is shifted by a pixel to the right, the output image should also shift by one
pixel to maintain consistency Similarly, in cases like pixel-wise image classifica-
tion, where the goal is to assign each pixel to a specific class, equivalence rather than invariance is crucial: a slight change in inputs should lead to a corresponding minor change in outputs
2.3.4 Transposed Convolutional Layers in semantic segmentation
In semantic segmentation, each pixel is assigned a category based on the type
of object it belongs to, as illustrated in Figure [2-18] Notably, objects of the same class are not distinguished from one another For instance, all cars are grouped
together as a single large pixel region on the right side of the segmented image
The primary challenge in this task arises from the fact that, as images pass through
2
Trang 34conventional CNNs, their spatial resolution gradually diminishes due to layers with strides larger than 1 Asa result, a typical CNN might recognize that a person is lo- cated somewhere on the left side of the image, but it would lack precise localization
tional Layer [59] is a preferred choice This layer can be thought of as expanding
the image by adding empty rows and columns (zero padding) and then applying a
(2.19) Some refer to it as a convolutional
n Figure P.1 ‘The Transposed Convolutional
convolutional layer (as shown in em
layer with fractional steps (e
Layer can be configured to sauce linear interpoTation, but its advantage lies in be- ing trainable, which often leads to improved performance during training Unlike pooling or convolutional layers, the stride determines how much the input image is expanded in a transposed convolutional layer used for increasing the resolution of feature maps
Trang 352.3.5 Skip Layer
Utilizing Transposed Convolutional Layers is a viable approach to increasing
the size of feature maps, but it may still lack precision To address this challenge, skip connections from lower layers are introduced at a factor of 2 (rather than 32) into the output image This involves adding the output of a lower layer with dou- ble the resolution Subsequently, the results are downsampled by a factor of 16,
achieving a total downsampling factor of 32 (as depicted in Figure (2-20) This
helps in recovering some of the spatial resolution lost in previous pooling layers
To retrieve even finer details from even lower levels, the architecture is enhanced
by a second skip connection In summary, the output of the initial CNN is upsam- pled, followed by the addition of a lower layer output (at the corresponding scale), then further upsampled by a factor of 2 This is followed by adding another lower layer output and another upscaling, resulting in a total factor of 8 Additionally,
this technique can even be applied to increase the resolution of the original image,
a process known as super-resolution
Skip connection
Figure 2.20 Spatial resolution from lower layers is recovered by Skip layers 9)
2.4 U-Net based architectures
The U-Net architecture [17] has played a pivotal role in shaping the landscape
of deep learning-based image segmentation In the realm of automated medical
image segmentation, significant efforts have been directed towards refining and
advancing the U-Net framework Notably, attention-based methodologies have gar- nered considerable attention due to their efficacy in segmenting intricate features
in biomedical images across diverse imaging modalities
One such adaptation is the Residual Attention U-Net [6Ø]
the soft attention mechanism to bolster the network's ability to discern a compre-
hensive spectrum of COVID-19 effects within chest CT scans For the purpose
of lung segmentation in chest X-rays, the XLSor approach (61) employs the criss- cross attention block to aggregate long-range contextual information, contributing
Trang 36introduced by [62] This model, following its creation, was subjected to compara-
tive evaluations against the original U-Net across various medical image segmen- tation datasets The findings revealed that the MultiResUNet consistently outper- formed the standard U-Net in terms of segmentation results
Another notable innovation, Attention U-Net [63], seeks to enhance the fea-
ture learning capability of the U-Net by integrating an attention gate This attention mechanism suppresses irrelevant information responses and accentuates critical in- formation, thereby enhancing prediction accuracy and model sensitivity
However, while the application of attention mechanisms is promising, it is important to note that their direct utilization may potentially compromise the ex- traction of underlying feature representations This concern is particularly relevant when the assessment of the region of interest is flawed, leading to suboptimal net-
work performance
2.4.1 Attention mechanism
Attention mechanism has been favorably used in many computer vision tasks SENet (64), with the core Squeeze-and-Excitation (SE) module, congregates information globally, captures channel-wise relationships, and enhances output vi- sualization GSoP-Net [65] has been proposed to improve the Squeeze stage by
utilizing a global second-order pooling block to model high-order statistics while
accumulating features globally
2.4.2 Compact Dilation Convolution-based Module (CDCM)
In PiDiNet B2) CDCMs have been designed to refine feature maps, begin- ning from the end of each stage The input of n x C channels are exploited to sup- plement multi-scale edge information to output feature map of M (M <C) channels
and relieve the computation overhead Furthermore, each CDCM is followed by a
Compact Spatial Attention Module (CSAM) [32] to curtail the background noise
‘Convidilation = 11, padding = 11)
Figure 2.21 CDCM Module (32)
27