Declaration of Authorship and Topic Sentences• Propose a novel neural network architecture to address the problem; • Introduce a new annotated image dataset for the proposed problem; • P
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Master’s Thesis
in Data Science and Artificial Intelligence
Applying Deep Learning Techniques for the Localization and Classification
of Digestive Tract Lesions
PHAN NGOC LAN
Lan.PN202634M@sis.hust.edu.vn
Supervisor: Dr Dinh Viet Sang
Department: Computer Science
Ha Noi, 10/2021
Trang 2Declaration of Authorship and Topic Sentences
• Propose a novel neural network architecture to address the problem;
• Introduce a new annotated image dataset for the proposed problem;
• Perform evaluations of the model on the new dataset, with comparisons
to existing segmentation models.
Ha Noi, October 2021 Supervisor
Dr Dinh Viet Sang
Trang 3I would like to thank my supervisor, Dr Dinh Viet Sang, for his continuedsupport and guidance throughout the course of my Masters’ studies He hasbeen a great teacher and mentor for me since my undergraduate years, and I
am proud to have completed this thesis under his supervision
I would also like to thank Dr Dao Viet Hang and the team of doctors andphysicians at the Institute of Gastroenterology and Hepatology Their tirelessefforts have resulted in the NeoPolyp dataset presented in this thesis, and thiswork would not have been possible without their contributions
The work in this thesis is also supported by the VINIF research project
“Development of a Real-time AI-assisted System to Detect Colon Polypsand Identify Lesions at High Risk of Malignancy During Endoscopy”, codeVINIF.2020.DA17 I would like to thank Vingroup and the Vingroup Innova-tion Foundation who have funded the project, along with the many students,faculty members and research staff who have helped me in my research
I want to thank my family, my fiancee, and my friends, who have given
me their unconditional love and support to finish my Masters’ studies
Finally, I would like to again thank Vingroup and the Vingroup vation Foundation, who have supported my studies through their DomesticMaster/Ph.D Scholarship program
Inno-Parts of this work were published in the paper “NeoUNet: Towards rate polyp segmentation and neoplasm detection” by Phan Ngoc Lan, Nguyen
accu-Sy An, Dao Viet Hang, Dao Van Long, Tran Quang Trung, Nguyen Thi Thuyand Dinh Viet Sang in the Proceedings of the 16th International Symposium
on Visual Computing, 2021
Phan Ngoc Lan was funded by Vingroup Joint Stock Company and
sup-ii
Trang 4ported by the Domestic Master/Ph.D Scholarship Programme of VingroupInnovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA),code VINIF.2020.ThS.BK.02.
Trang 5Medical image segmentation is a highly challenging task in computer sion with many important applications While the advent of deep learningtechniques has created important breakthroughs in this field, there is stillmuch room for improvement In this thesis, we focus on segmentation fordigestive tract lesions, particularly colon polyps and esophageal lesions Weidentify a shortcoming in previous formulations of polyp segmentation, inwhich neoplasm classification is often ignored To address this issue, we pro-pose a new problem formulation called Polyp Segmentation and NeoplasmDetection (PSND) In addition, this thesis proposes a deep neural networkcalled NeoUNet to solve lesion segmentation and the PSND problem Theproposed model is built upon U-Net, with a novel hybrid loss function thattakes advantage of incomplete labels To validate NeoUNet, two medical im-age datasets are collected with the help of experts Our experiments showthe effectiveness of NeoUnet over existing state-of-the-art models for imagesegmentation
vi-Keywords: Convolutional Neural Network, Medical Image Processing,Image Segmentation, U-Net, Colonoscopy
Author
Phan Ngoc Lan
iv
Trang 6List of Figures
List of Tables
1.1 Problem overview 1
1.2 Thesis contributions 3
1.3 Thesis structure 3
2 Theoretical Basis 4 2.1 Machine learning 4
2.2 Artificial neural networks 5
2.3 Convolutional neural networks 11
2.4 Attention mechanisms 16
2.5 Convolutional neural networks for semantic segmentation 18
2.6 Polyp segmentation and neoplasm classification 23
2.7 Problem formulation 23
3 Proposed Methods 26 3.1 NeoUNet 26
Trang 73.1.1 Motivation 26
3.1.2 Architecture overview 26
3.1.3 Encoder backbone 27
3.1.4 Attention mechanism 29
3.1.5 Decoder module 30
3.1.6 Loss function 31
3.2 Implementation details 33
4 Experiments 41 4.1 Dataset 41
4.1.1 NeoPolyp 41
4.1.2 Esophageal lesions 42
4.2 Experiment settings 43
4.3 Evaluation metrics 45
4.4 Results and discussion 46
4.4.1 Evaluating the HarDNet68 backbone 46
4.4.2 Comparison with baseline models 47
4.4.3 Evaluating the effect of undefined polyps 50
Trang 8List of Figures
1.1 Example images of colon polyps and esophageal lesions Images
on the right denote pixels with lesions in white 2
2.1 A 4-layer neural network1 6
2.2 Simple visualization of gradient descent2 8
2.3 Example of a computational graph Computation nodes store their derived gradients w.r.t their inputs 3 9
2.4 Speed comparison on several deep learning tasks between Xeon CPUs and NVIDIA Tesla GPUs 4 10
2.5 An example convolution layer5 12
2.6 An example of max-pooling6 12
2.7 LeNet-5 architecture [29] 13
2.8 Architecture of VGG-167 14
2.9 Example of a skip connection [16] 14
2.10 Architecture of Inception V1 (GoogLeNet) [49] 15
2.11 Example of dropout [14] 16
2.12 Architecture of EfficientNet-B0 [3] 17
2.13 Attention mechanism proposed in [5] 18
2.14 Transformer architecture [55] The network processes items in the sequence one-by-one, passing the output to the decoder for the next item 19
Trang 92.15 Architecture of the Fully Convolutional Network [34] 20
2.16 Overall U-Net architecture [43] 21
2.17 Overall PraNet architecture [13] 22
2.18 Overall HarDNet-MSEG architecture [19] 22
2.19 Classification targets for the polyp segmentation problem and the polyp segmentation and neoplasm detection problem 24
2.20 Expected outputs for polyp segmentation and PSND Black regions denote background pixels White regions denote polyp regions Green and red regions denote non-neoplastic and neo-plastic polyp regions, respectively 24
2.21 Example of an image with an undefined polyp Pixels anno-tated in yellow denote the undefined polyp area 25
3.1 Overview of NeoUNet’s architecture 27
3.2 Structure of an example Harmonic Dense Block The value on each layer denotes the number of output channels 28
3.3 HarDNet68 architecture HDB layers may not be to scale with actual depths 29
3.4 Diagram of the additive attention gate module [38] 30
4.1 Pixel-wise distribution of polyp class labels in the NeoPolyp dataset Percentages are calculated on polyp pixels only (not including background pixels) 42
4.2 Learning rate over each step for the cosine annealing with warmup schedule 44
4.3 Examples of how Dice score and IoU scores are calculated Blue areas denote sets of pixels that are used for calculation Orange-lined rectangles denote prediction mask pixels, and green-Orange-lined rectangles denote ground-truth mask pixels 46
4.4 Qualitative results on the NeoPolyp test set 48
Trang 104.5 NeoUNet outputs for test images with undefined labels 494.6 Sample images and ground-truth labels from the NeoPolypdataset Yellow pixels denote the undefined labels 524.7 Sample images and ground-truth labels from the esophageallesion dataset 53
Trang 11List of Tables
4.1 Performance metrics on the NeoPolyp test set for ResNet101, NeoUNet-DenseNet121, and NeoUNet-HarDNet68 464.2 Performance metrics on the NeoPolyp-Clean test set for U-Net,PraNet, HarDNet-MSEG, AttentionUNet and NeoUNet 474.3 Performance metrics on the esophageal lesion test set for U-Net, PraNet, AttentionUNet, HarDNet-MSEG, and NeoUNet 504.4 Performance metrics for NeoUNet when training on NeoPolypand NeoPolyp-Clean, measured on the NeoPolyp test set 51
Trang 12Despite their difficulty, medical image segmentation problems are an area
of very active research due to their high potential for application Successfulapplications can save countless hours of labor for doctors, physicians, andoperators, which can translate to lower medical costs and more lives saved.This thesis focuses on segmenting lesions in the digestive tract Two spe-cific types of lesions are considered in this work The first is colorectal polyps.Polyps are a type of lesion that can naturally develop inside the digestive tract
As some polyps can develop into more serious conditions such as colorectalcancer, their detection and treatment have been a concern for gastrointestinaldoctors The second type of lesions is esophageal lesions, which form in theesophagus due to several factors such as diet
Usually, these lesions are detected through either colonoscopy [36] or upper
GI endoscopy [37] In both procedures, an endoscope is inserted into thepatient’s digestive tract A doctor can control the endoscope’s movement and
1
Trang 13examine the digestive tract using the built-in camera Lesions are detectedmanually in this manner, which requires the doctor to be highly focused andthorough to not miss any that may turn out to be dangerous.
(a) Colon polyp
treat-Several works have researched automatic polyp segmentation with tive results [4, 13,19] However, these works only determined whether an area
posi-is part of a polyp or not, yet most polyps are not created equal Two polyp
Trang 14types that are of interest to doctors include neoplastic polyps (or adenomas)and non-neoplastic polyps Neoplastic polyps are precursor lesions to cancer,requiring various follow-up procedures such as polypectomies, endoscopic mu-cosal resection, endoscopic submucosal dissection, biopsy, marking, surgery orchemo-radiotherapy In contrast, non-neoplastic polyps are mostly benign andcan be removed or left without follow-up During endoscopies, doctors mustevaluate each polyp to estimate their neoplasm status Suspected neoplasticpolyps are sampled and further analyzed Evaluating polyps during live endo-scopies can be time-consuming and just as error-prone as detection, especiallyunder tight time deadlines.
1.2 Thesis contributions
The inherent challenges of lesion segmentation and neoplasm detectionmotivate our work in this thesis Specifically, the contributions of this thesisinclude:
• Extending the polyp segmentation problem with a multi-class target(neoplastic and non-neoplastic polyps) as a new problem called PolypSegmentation and Neoplasm Detection (PSND);
• Proposing a convolutional neural network model called NeoUNet, which
is designed to effectively solve PSND and lesion segmentation in general;
• Presenting two datasets, including an esophageal lesion dataset and apolyp dataset called NeoPolyp that contains neoplasm information;
• Evaluating NeoUNet on the presented datasets, with comparisons toexisting segmentation models
1.3 Thesis structure
The rest of the thesis is organized as follows Section 2 describes the thesis’theoretical foundation and outlines related works Section 3 describes thePSND problem formulation and NeoUNet model in detail Our experimentsare described and reported in section 4 Finally, section 5 concludes the thesisand outlines future works
Trang 15Chapter 2
Theoretical Basis
2.1 Machine learning
Machine learning (ML) is a sub-field of artificial intelligence, which seeks
to provide knowledge to computers through data, observations and interactingwith the world [6] Machine learning algorithms are unique in that they typ-ically include two distinct phases: training and inference The training phaseextracts insights and properties in the dataset to form a learned model, whilethe inference phase uses this model to produce results on new data
As machine learning is approximate by nature, it is typically applied to
NP or incomputable problems
Machine learning is also tied to statistics and optimization Due to theirdata-driven nature, machine learning models are essentially statistical models
on top of their training data This also means that despite numerous advances
in learning algorithms, data will always play a crucial role in successful chine learning applications In addition, understanding statistical properties
ma-in the data is also vital ma-in designma-ing proper machma-ine learnma-ing solutions Onthe other hand, the training of models can be approached as an optimiza-tion problem, in which we seek to minimize certain desired metrics Whilethis may not always be the case, optimization still plays an important role inmachine learning A key difference of machine learning compared to statisticsand optimization is that ML’s goal is to generalize instead of describing seendata
There are several different ways machines can learn Supervised learning
4
Trang 16algorithms learn from a set of inputs and outputs that are assumed to be rect Supervised learning models can directly (or oftentimes with little effort)solve the target problem without complex inference Naive-Bayes, SVMs andmultilayer perceptrons are examples of supervised learning algorithms.
cor-Semi-supervised algorithms are similar to supervised algorithms but aredesigned to handle missing features or tiny datasets They often require somelevel of assumption about the missing data to operate correctly
Unsupervised algorithms are applied on only input data, without anyset outputs These models seek to build relationships between different datapoints (e.g., clustering, hierarchy, ) Their outputs often require furtherinference and processing to eventually solve the target problem Popular un-supervised methods include k-means and neural autoencoders
Reinforcement learning algorithms are slightly different as they do nothave a static idea of training data These algorithms learn within an “envi-ronment”, in which they play the role of an intelligent agent Feedback fromthe environment guides the agent’s learning process Recent research on rein-forcement learning has tackled problems such as self-driving agents and gameplaying
2.2 Artificial neural networks
Artificial neural networks (commonly referred to as neural networks) are atype of machine learning model inspired by the way biological neural systemsprocess information Specifically, they mimic the way biologically neuronsform connections between one another Similar to their biological counter-parts, artificial neurons are computationally simple but rely on their largenumbers and cross-connections to model complex dependencies
Frank Rosenblatt proposed the first neural network in 1958, which hecalled the Perceptron [44] The Perceptron consists of a single layer of neuronsrepresented by the weight vector w, a bias parameter b, and an activationfunction g(x) Given the input vector x, a perceptron produces the followingoutput:
Trang 17f (x; w, b) = g(wTx + b) (2.1)
Figure 2.1: A 4-layer neural network1The Perceptron eventually evolves into multilayer perceptrons (MLP) orneural networks An MLP consists of multiple layers of neurons, including
an input layer, an output layer and a number of hidden layers Figure 2.1illustrates an example of a 4-layer neural network
The input layer is a vector representation of input data This layer tains raw numerical values that directly model the input The output layer
con-is a vector representation of the problem’s output For binary classificationproblems, this may be a single value denoting the prediction likelihood Hid-den layers sit between the input and output layers They make up the neuralnetwork’s abstract representation space Hidden layers may have arbitrarysizes and can stack to form “deeper” networks As it is not typically possible
to understand or infer insights from these layers, their information is “hidden”
to human observers
MLPs are “densely” connected, in which a neuron in layer i is connected
to every neuron in layeri + 1 A connection between two neurons denotes thatthe receiving neuron takes the sending neuron’s output value as its input.Neural networks implement two primary procedures: forward pass andbackpropagation The forward pass generates a network’s output by iteratingthrough its layers (see Algorithm 1) The output for layer j is defined in [15]as:
h(j) = g(j)(W(j)T · h(j−1)+ b(j)) (2.2)
1 https://technology.condenast.com/story/a-neural-network-primer
Trang 18whereg(j) is the jth layer’s activation function, W(j) is the weight matrix,b(j)
is the bias weight, and h(0) = x The size of W(j) corresponds to the number
of neurons at layers j and j − 1
MLPs require non-linear activations to model complex relations The mostcommon activations are thetanh andsigmoid functions For the output layer,the activation function is usually chosen such that it constrains the outputspace to the desired range For example, binary classification problems oftenactivate the output with sigmoid, as sigmoid(x) ∈ (0, 1)
Algorithm 1: Neural network forward pass
Input : Vector of input features x
Network layers L = ((W (1) , b (1) , g (1) ), , (W (m) , b (m) , g (m) )) Output : Predictions h (m)
Outputs at each layer Z = (z (1) , , z (m) )
Algorithm 2: Neural network backpropagation
Input : Vector of input features x
Ground-truth labels y Network layers L = ((W(1), b(1), g(1)), , (W(m), b(m), g(m))) Output : Gradients at each layer ∆ = (δ(1), , δ(m))
Trang 19Concep-descent uses this property to continually update the neural network The dated vector is multiplied by a learning rate, whose values are subtracted fromthe weight of each neuron Learning rates help throttle the update process toensure that we eventually reach a local minimum Algorithm 3 describes thisprocess in detail.
up-Algorithm 3: Gradient Descent
Input : Vector of input features x
Ground-truth labels y Network layers L = ((W(1), b(1), g(1)), , (W(m), b(m), g(m))) Learning rate γ
Output : The updated network
Figure 2.2: Simple visualization of gradient descent2
The gradient descent algorithm can “train” a neural network by continuallyapplying backpropagation and updating over a training dataset StochasticGradient Descent (SGD) uses constant-sized batches for each iteration instead
of the entire dataset, allowing training on massive datasets that do not fit intomemory There are also several variants of SGD that adapt the learning rate
to find better minima or reach convergence faster Such strategies includeAdam [26] and Adadelta [59], among others
2 97a7afb33add
Trang 20https://towardsdatascience.com/quick-guide-to-gradient-descent-and-its-variants-While neural networks can model highly complex functions by adding rons and layers, their size in practice is constrained by two problems: over-fitting and gradient vanishing Overfitting is a common problem for highlyexpressive machine learning models, where the model performs extremely well
neu-on the training set but poorly neu-on unseen data In other words, these models
do not generalize Large MLPs are highly susceptible to overfitting, as tions can be made at a very large scale At a certain level, an overfitting MLPcan essentially “remember” the training dataset through its neurons, thus notachieving its goal of learning generalized features
connec-Gradient vanishing is an issue that arises in relatively deep MLPs Asgradients are computed with the derivative chain rule, their values slowlydiminish at layers far away from the output This means that most learninghappens at the final layers of the network, while early layers contribute verylittle These issues impose limits on how neural networks are designed, beforethe introduction of a set of new models and techniques for deeper and largerneural networks, collectively referred to as Deep Learning
Fast implementation of the forward pass and backpropagation algorithmsare also crucial to developing neural networks While the logical steps forbackpropagation are relatively trivial, larger and more complex operationscan quickly make implementations cumbersome This complexity is addressedwith computational graphs, which allow gradients and derivations to be trackedduring the forward pass itself (see Figure 2.3)
Figure 2.3: Example of a computational graph Computation nodes store their rived gradients w.r.t their inputs 3
de-Another challenge for neural networks is execution speed Even smallerMLPs can take significant amounts of time to train due to their large number
3 http://datahacker.rs/004-computational-graph-and-autograd-with-pytorch/
Trang 21of connections and the stochastic nature of SGD For larger datasets, trainingcan take many hours or even days on powerful CPUs Raina et al [40] werethe first to propose the use of graphics processing units (GPUs) for trainingand running neural networks As most neural network operations rely onmatrix computations, GPUs proved to be highly adept for the task, showing
up to 40 times improvement over running on CPUs Other works also exploredmulti-node execution, either with parameter servers or peer-to-peer protocols.Hardware alternatives for GPUs are also available, albeit with less adoption,such as Tensor Processing Units (TPUs) or FPGAs
Figure 2.4: Speed comparison on several deep learning tasks between Xeon CPUs and NVIDIA Tesla GPUs 4
Deep learning frameworks combine computational graphs, GPU supportand helpful abstractions to form a complete ecosystem for developing neuralmodels Theano [7] was the first such framework, evolving from a tool usedmostly for convex optimization Google’s TensorFlow [1] adapted much of theidealogy of Theano, most significantly the idea of “graph-declaration-as-code”,
to create a powerful framework that supports single-node and multi-nodeexecution Both frameworks require the computational graph to be explicitlydefined and run separately (similar to high-performance computing librarieslike Spark), with an optimization phase to improve performance
Despite its early popularity, TensorFlow was notoriously difficult for ginners, and even experienced users had a hard time debugging complex mod-els written in TensorFlow PyTorch [39] emerged as a younger framework
be-4 https://www.nextplatform.com/2018/09/12/nvidia-takes-on-the-inference-hordes -with-turing-gpus/
Trang 22seeking to alleviate such issues PyTorch executes computations “eagerly” andbuilds the computational graph on the fly It also featured friendlier, higher-level abstractions than TensorFlow at the time These advantages slowlyshifted many researchers to using PyTorch to implement their ideas quickly,while industry applications relied on TensorFlow to maximize performance.However, both TensorFlow and PyTorch are seeking to cover both use cases,
as TensorFlow 2.0 introduces the “eager execution” mode and PyTorch 1.0introduces a static graph mode
2.3 Convolutional neural networks
The idea of applying machine learning models to images is not particularlynew However, early attempts had to overcome the problem of input size,namely that image inputs are often inconveniently large A “small” image ofsize 200 × 200 already has 40, 000 features (!) Thus, these early models relied
on feature extraction methods such as bag-of-visual-words, SIFT or HOG tocondense images into more compact forms While this approach can yieldpositive results, it relies on assumptions made by feature extractors that maynot be robust to diverse inputs
Convolutional neural networks also rely on a core assumption, which statesthat an image can be understood with high accuracy by examining smallersliding windows This assumption is carried out in a special layer type calledthe convolution layer
Convolution is a common operation in image processing, especially forblurring, sharpening, or detecting edges A convolution uses a fixed “kernel”,
a small 2-D matrix containing weights, and slides the kernel across both imagedimensions For each image segment, the kernel is multiplied with pixel valuesand aggregated to form an output matrix In essence, the value of item(i, j)inthe output matrix is an aggregation of all kernel weights and the pixel values
at location(i, j) It encodes the local state of the pixel, i.e., reflects what types
of pixels are surrounding it
The convolution layer (or CONV layer) consists of k convolution kernelsacting as weights Given an input image represented as a tensor of shape
h × w × c, each kernel produces an output matrix of shape h0× w0 Kernel
Trang 23outputs are stacked to form the final output for the CONV layer, with ashape ofk × h0× w (see Figure 2.5) Note that since convolution layers produceoutputs of similar shapes to their inputs, they can be easily stacked.
Figure 2.5: An example convolution layer 5
A basic convolution layer can be configured with several parameters Thekernel size often varies between 3 × 3 to 7 × 7, which affects the network’sreceptive field Stride can also vary between 1 and 3 pixels Kernel size andstride also affect the final sizeh0×w0 of the output matrix Some combinations,such as3×3kernels with stride 1, cause the input to shrink after going throughconvolution As this shrinking is often undesirable, padding is added to theinput to preserve size Finally, the number of kernels k can be set arbitrarily,with each kernel implying a different local feature being learned
Convolution layers address many key problems with neural networks forimages They allow a small number of weights to view and distill featuresfrom the entire image, essentially creating learnable feature extractors
Figure 2.6: An example of max-pooling6
One of the first successful CNNs was LeNet [29], introduced by LeCun
et al in 1998 Aside from the introduction of convolutional layers, severalkey concepts were laid out by the seminal paper These include the use of
5
http://cs231n.github.io/understanding-cnn/
6 http://cs231n.github.io/understanding-cnn/
Trang 24pooling layers (see Figure 2.6), which imposes a hard filter on a feature mapand reduces its size Pooling layers act as bottlenecks that select abstract,high-level features to feed to subsequent layers of the network The ReLUfunction (ReLU (x) = max(x, 0)) was also proposed as the activation in place
of sigmoid or tanh An advantage of ReLU is that it lessens the impact ofgradient vanishing compared to other non-linear activations
For classification tasks, two or three fully-connected layers are appendedafter convolution layers to produce the final output vector In a sense, we canconsider the convolution layers to be encoders that compress the image into
a compact representation for the MLP classifier
Figure 2.7: LeNet-5 architecture [ 29 ]
Following the success of LeNet, AlexNet [27] and VGG [47] were some ofthe earliest improvements to CNN design Notably, VGG was one of the firsttruly "deep" neural networks with more than a handful of layers Despite
a rather simple architecture, with linear stacks of convolution and poolinglayers, VGG-16 was quite robust for its time, achieving a top-5 accuracy of92.7%on the ImageNet dataset Figure 2.8 illustrates the VGG-16 architecture
in detail
A major hurdle for models such as VGG-16 when going deeper is ent vanishing As the number of layers grows, feedback signals from the lossfunction simply cannot be retained through the linear layer stack ResNet [16]addressed this problem with skip connections Instead of stacking layers lin-early, ResNet includes layers that take input as the sum of the previous layerand the k-th previous layer (see Figure 2.9) These skip connections serve tocombat gradient vanishing deep into the network, while also smoothing out theloss landscape ResNet-50 is the most common variation of ResNet, achieving
gradi-a top-5 gradi-accurgradi-acy of94.8%on ImageNet The architecture has also proven to be
7
https://neurohive.io/en/popular-networks/vgg16/
Trang 25Figure 2.8: Architecture of VGG-16 7
quite robust to many different problem types, including malware classification[41], food recognition [35], and even speech and natural language domains [9].For many years, it was also the de-facto model for industry applications andprovided the encoder backbone for many network types
While ResNet pioneered the use of skip connections, many later works posed different improvements to the concept ResNeXt [58] utilizes a multi-branch design similar to that of GoogLeNet [49] to pair with skip connections.Meanwhile, Huang et al [20] propose DenseNet, a massively densified version
pro-of ResNet in which skip connections are made to every prior layer in the block.While this approach yielded improvements in accuracy, it is also highly expen-sive computationally Several works have attempted to “sparsify” DenseNet,including LogDenseNet [17], SparseNet [33] and Harmonic DenseNet (HarD-Net) [10]
Figure 2.9: Example of a skip connection [ 16 ]Aside from skip connections, a key intuition in designing CNNs is the use ofbranches The most notable example of this mindset is the Inception family of
Trang 26architectures The first Inception network (InceptionV1 or GoogLeNet) [49]uses multiple convolution layers with different filter sizes (3 × 3, 5 × 5 and
7 × 7), along with a max-pooling layers Outputs from these layers are thenconcatenated (see Figure 2.10) The goal of these multiple kernel sizes is tolet the model learn which kernel is most optimal for the given image Largekernels are suited for global information, while small kernels help representfiner details Additionally, growing the network “wider” instead of deeper helpsavoid gradient vanishing Later versions of Inception [50] further optimizethe architecture by removing representational bottlenecks (when feature mapdimensions are reduced too drastically) and applying improvements in thetraining phase
Figure 2.10: Architecture of Inception V1 (GoogLeNet) [ 49 ]
Many previous works also proposed different methods for regularizingCNNs and neural networks in general Regularization essentially imposes con-straints on a model in hopes of preventing it from overfitting L1 and L2 regu-larization are commonly used in linear models, which simply prevent weightsfrom being too large Unfortunately, this approach does not work very wellfor CNNs, as it hinders most of the network’s expressiveness Srivastava et
al [48] proposed dropout in 2014, which proved effective for both CNNs andfeed-forward MLPs The idea of dropout is to simply turn off a random sub-set of neurons in each iteration of SGD (see Figure 2.11) Neurons that areturned off simply emit 0 as their output and therefore do not contribute tothe loss or learn during the iteration While quite simple, dropout is surpris-ingly effective By taking out random neurons during training, we enforce
a level of redundancy and robustness to the model Each subset of neurons
in the network needs to have enough useful information such that they stillproduce correct predictions even as other neurons are turned off Meanwhile,batch normalization (or BatchNorm) [21] takes a slightly different approach
Trang 27for normalization The authors identified high levels of numerical instability
in many neural network implementations, which causes overfitting and slowconvergence Batch normalization applies a normalization typically only usedfor input data (i.e., subtracting the mean and dividing by the deviation) tothe output feature map of convolution layers Unlike training images, how-ever, output feature maps for the same image constantly change throughouttraining, and it would be infeasible to calculate the mean and deviation con-stantly Instead, batch normalization layers keep track of running mean anddeviation values and update them during training
Figure 2.11: Example of dropout [ 14 ]
A recent development in network architectures has been the use of neuralarchitecture search techniques, which produces highly optimal network de-signs without manual tuning A well-known example is EfficientNet [52] (seeFigure 2.12), a family of architectures providing a range of trade-offs betweenaccuracy and latency/size EfficientNet uses a baseline architecture consisting
of Mobile Inverted Bottleneck blocks and searches for different scaling urations regarding width, depth, and resolution EfficientNet V2 [53] furtherimproves model size and training speed
config-2.4 Attention mechanisms
Attention is a powerful mechanism applied in numerous neural networks,most notably in natural language processing and computer vision In essence,the goal of adding attention is to help neural networks focus (i.e., pay attention
Trang 28Figure 2.12: Architecture of EfficientNet-B0 [ 3 ]
to) important parts of the input For image processing, this focus manifests
as a mask over the input tensor, in which high-value segments carry higherattention values When processing sequences such as text or audio, atten-tion works as a type of soft memory unit that signals relevant items at eachtimestep
The first work to introduce attention is that of Bahdanau et al [5] in
2014, when it was applied in the domain of machine translation This version
of attention produces a probability map of the entire input sequence usingthe softmax function, then applies the map to the decoder of a sequence-to-sequence recurrent network (see Figure 2.13)
Later works further apply attention in different problem domains andnetwork architectures, many with positive results Wang et al [56] proposedResidual Attention Network, which stacks multiple attention modules along-side skip connections to process images The seminal work by Vaswani et al.[55] introduced Transformers, which relies heavily on multi-head self-attentionmodules (see Figure 2.14) The self-attention mechanism used in Transform-ers essentially allows the network to see and focus on relevant parts of its ownoutput at previous timesteps
Trang 29Figure 2.13: Attention mechanism proposed in [ 5 ]
2.5 Convolutional neural networks for semantic
to semantic segmentation As the name suggests, FCNs do not include the
FC layers In fact, it simply replaces FC layers in the original architectures(including AlexNet, VGG-16 and GoogLeNet) with convolution layers to act
as the “encoder”
While the modification seems straightforward, the heatmaps produced bythe encoder are far too coarse to produce accurate predictions This is becauseCNNs progressively reduce the feature map size to learn more abstract andglobal features While this works fine for classification problems (since thedesired output is highly abstract), it greatly hinders segmentation tasks Thefinal feature map in VGG-16, for example, is only 32 × 32for a256 × 256 input
Trang 30Figure 2.14: Transformer architecture [ 55 ] The network processes items in the quence one-by-one, passing the output to the decoder for the next item.
se-image To address this problem, FCN uses a series of connections between theoutput of each pooling layer, forming a directed acyclic graph (DAG) ThisDAG acts as the network’s decoder (see Figure 2.15) These connections helpsupply more fine-grained information to the encoder’s output
While FCNs achieved state-of-the-art results in semantic segmentation,their design was still heavily constrained by the linear stacking of convolutionlayers Ronneberger et al [43] proposed a more elegant model to address theseissues, called U-Net (see Figure 2.16) U-Net features a symmetrical, U-shapedarchitecture consisting of an encoder and decoder The encoder is a standardconvolutional network similar to FCN, producing progressively smaller featuremaps with a higher level of abstraction However, instead of combining featuremaps on different levels directly, U-Net uses another convolutional network asthe decoder The decoder upsamples the feature map with each block, com-
Trang 31(a) Overall architecture
(b) Decoder DAG
Figure 2.15: Architecture of the Fully Convolutional Network [ 34 ]
bining the upsampled features with corresponding features from the encoder(a form of skip connection) The final decoder layer restores the feature map
to the input’s original size and produces the network’s prediction Besidesits nice symmetric properties, this design addresses key challenges faced byFCNs It allows coarse abstract features and fine-grained features to be com-bined through each decoder layer The skip connections between encoder anddecoder blocks also reduce gradient vanishing U-Net achieved state-of-the-art performance on the ISBI 2012 EM segmentation dataset and was quicklyadapted for many other segmentation problems
Following the introduction of U-Net, many have proposed improvements
to the architecture Unet++ [60] uses nested skip connections between theencoder and decoder Nested skip connections allow features from higher-level encoder blocks to combine with lower encoder blocks before reaching thedecoder These connections also contain more skip connections themselves.All these skip connections create a highly connected architecture with free-flowing information between blocks ResUNet++ [25] replaces the encoder inUNet++ with ResNet backbones Meanwhile, DoubleUNet [23] stacked twoUNets sequentially to create a larger, more powerful network The authorsused VGG-16 as the backbone and added squeeze and excitation units [18]
Trang 32Figure 2.16: Overall U-Net architecture [ 43 ]
to better model channel-wise dependencies DoubleUNet also incorporatesthe Atrous Spatial Pyramid Pooling [11] (ASPP) module, which expands thenetwork’s receptive field with multiple sample rates However, DoubleUNetonly has a single connection between the two U-Nets, severely limiting itsinformation flow Tang et al [54] proposed Coupled U-Net to improve on thislimitation by adding skip connections between the two networks Oktay et al.[38] added attention gates to the skip connections in U-Net, which helps filterout salient features and improve convergence
PraNet (Parallel Reverse Attention Network ) [13] is a CNN architecturedesigned specifically for medical image segmentation The overall architecture
is shown in Figure 2.17 PraNet follows the encoder-decoder pattern laid out
by U-Net with significant modifications PraNet’s encoder is a standard CNNbackbone, in this case Res2Net PraNet’s decoder is called Parallel PartialDecoder [57], which aggregates high-level features from the encoder In ad-dition, PraNet uses an attention mechanism called Reverse Attention on theskip connections The model achieved state-of-the-art performance on severalpolyp segmentation datasets
HarDNet-MSEG [19] builds upon the design of PraNet with a focus oninference speed In fact, the HarDNet-MSEG code repository is forked fromPraNet HarDNet-MSEG greatly reduces computational complexity by us-
Trang 33Figure 2.17: Overall PraNet architecture [ 13 ]
ing the lightweight HarDNet68 backbone instead of Res2Net, and removingreverse attention modules (see Figure 2.18) The result is a very fast andlean neural network, with inference speed up to 86 FPS on NVIDIA RTX
2080 GPU HarDNet-MSEG also achieves state-of-the-art performance on theKvasir dataset
Figure 2.18: Overall HarDNet-MSEG architecture [ 19 ]
Trang 342.6 Polyp segmentation and neoplasm
classifica-tion
Polyp segmentation has been a popular benchmark for many medical age segmentation methods Earlier works used hand-crafted features [22, 46]including color, shape, texture, etc., to separate polyps from the surround-ing mucosa In recent years, more generic neural networks such as UNet++[60], PraNet [13], DoubleUNet [23] and many other models have used polypsegmentation as their benchmark task This is partly due to the many publicdatasets available for the problem Notable datasets include the Kvasir-SEGdataset [24], CVC-ClinicDB dataset [8] and CVC-ColonDB dataset [51]
im-On the contrary, neoplasm classification for polyps has seen significantlyless research Major challenges for this research direction include the lack ofpublic datasets and the general difficulty of labeling data Ribeiro et al [42]were some of the few authors who tackled polyp neoplasm However, theirwork only considered a full-image classification problem without any segmen-tation data They extracted and labeled a dataset of 100 polyp images fromendoscopy videos, each containing exactly one polyp Several CNN modelswere tested on this dataset, including VGG, AlexNet, and GoogLeNet While
it is not impossible to couple this approach with a segmentation module, itcan become very inefficient in practice, especially for systems with real-timeconstraints or embedded hardware
2.7 Problem formulation
This thesis describes the Polyp segmentation and Neoplasm detectionproblem, hereafter denoted as PSND PSND extends the polyp segmenta-tion problem by separating polyps into two sub-classes: neoplastic and non-neoplastic (see Figure 2.19)
Formally, we define the PSND problem as follows Given an input dimensional image, generate an equal-sized matrix whose values representthe label for the corresponding pixel, which must be one of three values:
2-• 0 if the pixel is a background (non-polyp) pixel;
Trang 35Figure 2.19: Classification targets for the polyp segmentation problem and the polyp segmentation and neoplasm detection problem
• 1 if the pixel is part of a non-neoplastic polyp;
• 2 if the pixel is part of a neoplastic polyp
(a) Input image (b) Polyp segmentation (c) PSND
Figure 2.20: Expected outputs for polyp segmentation and PSND Black regions denote background pixels White regions denote polyp regions Green and red regions denote non-neoplastic and neoplastic polyp regions, respectively.
Figure 2.20 demonstrates an example of PSND’s output The problempresents several unique challenges A polyp’s surface pattern can greatly varyfor both neoplastic and non-neoplastic classes, sometimes separating into sev-eral sections within the same polyp In one lesion, a neoplastic area may onlytake up a portion of the polyp’s surface, requiring very fine-grained spatialunderstanding to tell apart This is a differentiating factor between PSND andgeneric multi-class segmentation While datasets such as PASCAL-VOC havemany more classes, they are primarily well-defined and highly distinguishable(e.g., car, bird, person, etc ) Moreover, from a medical perspective, misclas-sification could be very costly False positives could result in over-indications
of endoscopic interventions or surgery, while false negatives could result indelaying suitable treatment
Trang 36Figure 2.21: Example of an image with an undefined polyp Pixels annotated in yellow denote the undefined polyp area
Another challenge for PSND arises during actual data labeling Due tothe highly ambiguous nature of neoplasm, some polyps remain uncertain evenfor human annotators These polyps are represented as a fourth class in thedata: polyps with undefined neoplasticity (see Figure 2.21) However, ourformulation does not include this class, because it is often undesirable forautomatic systems to return “undefined” results Additionally, such labelingcan create confusion for learning models and degrade accuracy Therefore, weintroduce special methods to handle and make use of undefined polyps in thefollowing sections
For segmenting esophageal lesions, the problem formulation in this thesis
is similar to classic semantic segmentation problems Given an input image,generate an equal-sized matrix whose values represent the label for the cor-responding pixel: 0 if the pixel is a background pixel and 1 if the pixel is aforeground pixel (lesion)
While the formulation is similar in nature, the specific segmentation target(esophageal lesions) has not been addressed in any previous work Specificproperties of the data may have adverse effects on existing models
Trang 37Our goal when designing NeoUNet is to create a strong baseline for PSNDand esophageal lesion segmentation This means that we try to create a bal-ance between different trade-offs for NeoUNet, and avoid specializing for spe-cific use cases.
3.1.2 Architecture overview
We base NeoUNet on the tried-and-true U-Net architecture [43], alongwith some characteristics from the Attention U-Net model introduced byAbraham et al [2] The network is comprised of an encoder-decoder structure,with symmetrical shortcut connections between encoder and decoder blocks.The encoder is a HarDNet68 network [10], from which 5 feature maps areextracted corresponding to different abstraction levels The decoder consists
of a series of convolution blocks, which takes a concatenated tensor of the
26
Trang 38Figure 3.1: Overview of NeoUNet’s architecture
previous decoder’s output and the output from a corresponding encoder block.The encoder block outputs also go through attention gate modules (exceptthe first and last block) Each decoder then passes its output to an outputblock to produce the final prediction mask Figure 3.1 describes NeoUNet’sgeneral architecture
NeoUNet produces 4 output masks instead of 1 to facilitate deep sion, a training technique in which the loss function is calculated for multipleoutputs at different levels Deep supervision typically improves a model’s ro-bustness, as the decoder at each level must learn strong and concrete features
supervi-to produce accurate feature maps During inference, we only use the predictionmask from the final decoder layer
On the surface level, NeoUNet’s primary difference compared to existingsegmentation networks (such as PraNet or AttentionUNet) is that the modeloutputs a 2-channel segmentation mask, as opposed to a 1-channel binarymask However, NeoUNet is also designed to cope with specific difficulties ofthe PSND problem, including high inter-class similarities and missing data
3.1.3 Encoder backbone
HarDNet (Harmonic DenseNet) [10] is a convolutional neural network chitecture inspired by DenseNet [20] DenseNet’s core innovation is the facili-tation of feature reuse through dense skip connections Each layer in a DenseBlock receives the concatenated feature map of every preceding layer, giving