Applying deep learning techniques for the localization and classification of digestive tract lesions = ứng dụng kỹ thuật học sâu trong khoanh vùng và phân loại tổn thương đường tiêu hóa

Declaration of Authorship and Topic Sentences• Propose a novel neural network architecture to address the problem; • Introduce a new annotated image dataset for the proposed problem; • P

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Master’s Thesis

in Data Science and Artificial Intelligence

Applying Deep Learning Techniques for the Localization and Classification

of Digestive Tract Lesions

PHAN NGOC LAN

Lan.PN202634M@sis.hust.edu.vn

Supervisor: Dr Dinh Viet Sang

Department: Computer Science

Ha Noi, 10/2021

Trang 2

Declaration of Authorship and Topic Sentences

• Propose a novel neural network architecture to address the problem;

• Introduce a new annotated image dataset for the proposed problem;

• Perform evaluations of the model on the new dataset, with comparisons

to existing segmentation models.

Ha Noi, October 2021 Supervisor

Dr Dinh Viet Sang

Trang 3

I would like to thank my supervisor, Dr Dinh Viet Sang, for his continuedsupport and guidance throughout the course of my Masters’ studies He hasbeen a great teacher and mentor for me since my undergraduate years, and I

am proud to have completed this thesis under his supervision

I would also like to thank Dr Dao Viet Hang and the team of doctors andphysicians at the Institute of Gastroenterology and Hepatology Their tirelessefforts have resulted in the NeoPolyp dataset presented in this thesis, and thiswork would not have been possible without their contributions

The work in this thesis is also supported by the VINIF research project

“Development of a Real-time AI-assisted System to Detect Colon Polypsand Identify Lesions at High Risk of Malignancy During Endoscopy”, codeVINIF.2020.DA17 I would like to thank Vingroup and the Vingroup Innova-tion Foundation who have funded the project, along with the many students,faculty members and research staff who have helped me in my research

I want to thank my family, my fiancee, and my friends, who have given

me their unconditional love and support to finish my Masters’ studies

Finally, I would like to again thank Vingroup and the Vingroup vation Foundation, who have supported my studies through their DomesticMaster/Ph.D Scholarship program

Inno-Parts of this work were published in the paper “NeoUNet: Towards rate polyp segmentation and neoplasm detection” by Phan Ngoc Lan, Nguyen

accu-Sy An, Dao Viet Hang, Dao Van Long, Tran Quang Trung, Nguyen Thi Thuyand Dinh Viet Sang in the Proceedings of the 16th International Symposium

on Visual Computing, 2021

Phan Ngoc Lan was funded by Vingroup Joint Stock Company and

sup-ii

Trang 4

ported by the Domestic Master/Ph.D Scholarship Programme of VingroupInnovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA),code VINIF.2020.ThS.BK.02.

Trang 5

Medical image segmentation is a highly challenging task in computer sion with many important applications While the advent of deep learningtechniques has created important breakthroughs in this field, there is stillmuch room for improvement In this thesis, we focus on segmentation fordigestive tract lesions, particularly colon polyps and esophageal lesions Weidentify a shortcoming in previous formulations of polyp segmentation, inwhich neoplasm classification is often ignored To address this issue, we pro-pose a new problem formulation called Polyp Segmentation and NeoplasmDetection (PSND) In addition, this thesis proposes a deep neural networkcalled NeoUNet to solve lesion segmentation and the PSND problem Theproposed model is built upon U-Net, with a novel hybrid loss function thattakes advantage of incomplete labels To validate NeoUNet, two medical im-age datasets are collected with the help of experts Our experiments showthe effectiveness of NeoUnet over existing state-of-the-art models for imagesegmentation

vi-Keywords: Convolutional Neural Network, Medical Image Processing,Image Segmentation, U-Net, Colonoscopy

Author

Phan Ngoc Lan

iv

Trang 6

List of Figures

List of Tables

1.1 Problem overview 1

1.2 Thesis contributions 3

1.3 Thesis structure 3

2 Theoretical Basis 4 2.1 Machine learning 4

2.2 Artificial neural networks 5

2.3 Convolutional neural networks 11

2.4 Attention mechanisms 16

2.5 Convolutional neural networks for semantic segmentation 18

2.6 Polyp segmentation and neoplasm classification 23

2.7 Problem formulation 23

3 Proposed Methods 26 3.1 NeoUNet 26

Trang 7

3.1.1 Motivation 26

3.1.2 Architecture overview 26

3.1.3 Encoder backbone 27

3.1.4 Attention mechanism 29

3.1.5 Decoder module 30

3.1.6 Loss function 31

3.2 Implementation details 33

4 Experiments 41 4.1 Dataset 41

4.1.1 NeoPolyp 41

4.1.2 Esophageal lesions 42

4.2 Experiment settings 43

4.3 Evaluation metrics 45

4.4 Results and discussion 46

4.4.1 Evaluating the HarDNet68 backbone 46

4.4.2 Comparison with baseline models 47

4.4.3 Evaluating the effect of undefined polyps 50

Trang 8

List of Figures

1.1 Example images of colon polyps and esophageal lesions Images

on the right denote pixels with lesions in white 2

2.1 A 4-layer neural network1 6

2.2 Simple visualization of gradient descent2 8

2.3 Example of a computational graph Computation nodes store their derived gradients w.r.t their inputs 3 9

2.4 Speed comparison on several deep learning tasks between Xeon CPUs and NVIDIA Tesla GPUs 4 10

2.5 An example convolution layer5 12

2.6 An example of max-pooling6 12

2.7 LeNet-5 architecture [29] 13

2.8 Architecture of VGG-167 14

2.9 Example of a skip connection [16] 14

2.10 Architecture of Inception V1 (GoogLeNet) [49] 15

2.11 Example of dropout [14] 16

2.12 Architecture of EfficientNet-B0 [3] 17

2.13 Attention mechanism proposed in [5] 18

2.14 Transformer architecture [55] The network processes items in the sequence one-by-one, passing the output to the decoder for the next item 19

Trang 9

2.15 Architecture of the Fully Convolutional Network [34] 20

2.16 Overall U-Net architecture [43] 21

2.17 Overall PraNet architecture [13] 22

2.18 Overall HarDNet-MSEG architecture [19] 22

2.19 Classification targets for the polyp segmentation problem and the polyp segmentation and neoplasm detection problem 24

2.20 Expected outputs for polyp segmentation and PSND Black regions denote background pixels White regions denote polyp regions Green and red regions denote non-neoplastic and neo-plastic polyp regions, respectively 24

2.21 Example of an image with an undefined polyp Pixels anno-tated in yellow denote the undefined polyp area 25

3.1 Overview of NeoUNet’s architecture 27

3.2 Structure of an example Harmonic Dense Block The value on each layer denotes the number of output channels 28

3.3 HarDNet68 architecture HDB layers may not be to scale with actual depths 29

3.4 Diagram of the additive attention gate module [38] 30

4.1 Pixel-wise distribution of polyp class labels in the NeoPolyp dataset Percentages are calculated on polyp pixels only (not including background pixels) 42

4.2 Learning rate over each step for the cosine annealing with warmup schedule 44

4.3 Examples of how Dice score and IoU scores are calculated Blue areas denote sets of pixels that are used for calculation Orange-lined rectangles denote prediction mask pixels, and green-Orange-lined rectangles denote ground-truth mask pixels 46

4.4 Qualitative results on the NeoPolyp test set 48

Trang 10

4.5 NeoUNet outputs for test images with undefined labels 494.6 Sample images and ground-truth labels from the NeoPolypdataset Yellow pixels denote the undefined labels 524.7 Sample images and ground-truth labels from the esophageallesion dataset 53

Trang 11

List of Tables

4.1 Performance metrics on the NeoPolyp test set for ResNet101, NeoUNet-DenseNet121, and NeoUNet-HarDNet68 464.2 Performance metrics on the NeoPolyp-Clean test set for U-Net,PraNet, HarDNet-MSEG, AttentionUNet and NeoUNet 474.3 Performance metrics on the esophageal lesion test set for U-Net, PraNet, AttentionUNet, HarDNet-MSEG, and NeoUNet 504.4 Performance metrics for NeoUNet when training on NeoPolypand NeoPolyp-Clean, measured on the NeoPolyp test set 51

Trang 12

Despite their difficulty, medical image segmentation problems are an area

of very active research due to their high potential for application Successfulapplications can save countless hours of labor for doctors, physicians, andoperators, which can translate to lower medical costs and more lives saved.This thesis focuses on segmenting lesions in the digestive tract Two spe-cific types of lesions are considered in this work The first is colorectal polyps.Polyps are a type of lesion that can naturally develop inside the digestive tract

As some polyps can develop into more serious conditions such as colorectalcancer, their detection and treatment have been a concern for gastrointestinaldoctors The second type of lesions is esophageal lesions, which form in theesophagus due to several factors such as diet

Usually, these lesions are detected through either colonoscopy [36] or upper

GI endoscopy [37] In both procedures, an endoscope is inserted into thepatient’s digestive tract A doctor can control the endoscope’s movement and

1

Trang 13

examine the digestive tract using the built-in camera Lesions are detectedmanually in this manner, which requires the doctor to be highly focused andthorough to not miss any that may turn out to be dangerous.

(a) Colon polyp

treat-Several works have researched automatic polyp segmentation with tive results [4, 13,19] However, these works only determined whether an area

posi-is part of a polyp or not, yet most polyps are not created equal Two polyp

Trang 14

types that are of interest to doctors include neoplastic polyps (or adenomas)and non-neoplastic polyps Neoplastic polyps are precursor lesions to cancer,requiring various follow-up procedures such as polypectomies, endoscopic mu-cosal resection, endoscopic submucosal dissection, biopsy, marking, surgery orchemo-radiotherapy In contrast, non-neoplastic polyps are mostly benign andcan be removed or left without follow-up During endoscopies, doctors mustevaluate each polyp to estimate their neoplasm status Suspected neoplasticpolyps are sampled and further analyzed Evaluating polyps during live endo-scopies can be time-consuming and just as error-prone as detection, especiallyunder tight time deadlines.

1.2 Thesis contributions

The inherent challenges of lesion segmentation and neoplasm detectionmotivate our work in this thesis Specifically, the contributions of this thesisinclude:

• Extending the polyp segmentation problem with a multi-class target(neoplastic and non-neoplastic polyps) as a new problem called PolypSegmentation and Neoplasm Detection (PSND);

• Proposing a convolutional neural network model called NeoUNet, which

is designed to effectively solve PSND and lesion segmentation in general;

• Presenting two datasets, including an esophageal lesion dataset and apolyp dataset called NeoPolyp that contains neoplasm information;

• Evaluating NeoUNet on the presented datasets, with comparisons toexisting segmentation models

1.3 Thesis structure

The rest of the thesis is organized as follows Section 2 describes the thesis’theoretical foundation and outlines related works Section 3 describes thePSND problem formulation and NeoUNet model in detail Our experimentsare described and reported in section 4 Finally, section 5 concludes the thesisand outlines future works

Trang 15

Chapter 2

Theoretical Basis

2.1 Machine learning

Machine learning (ML) is a sub-field of artificial intelligence, which seeks

to provide knowledge to computers through data, observations and interactingwith the world [6] Machine learning algorithms are unique in that they typ-ically include two distinct phases: training and inference The training phaseextracts insights and properties in the dataset to form a learned model, whilethe inference phase uses this model to produce results on new data

As machine learning is approximate by nature, it is typically applied to

NP or incomputable problems

Machine learning is also tied to statistics and optimization Due to theirdata-driven nature, machine learning models are essentially statistical models

on top of their training data This also means that despite numerous advances

in learning algorithms, data will always play a crucial role in successful chine learning applications In addition, understanding statistical properties

ma-in the data is also vital ma-in designma-ing proper machma-ine learnma-ing solutions Onthe other hand, the training of models can be approached as an optimiza-tion problem, in which we seek to minimize certain desired metrics Whilethis may not always be the case, optimization still plays an important role inmachine learning A key difference of machine learning compared to statisticsand optimization is that ML’s goal is to generalize instead of describing seendata

There are several different ways machines can learn Supervised learning

4

Trang 16

algorithms learn from a set of inputs and outputs that are assumed to be rect Supervised learning models can directly (or oftentimes with little effort)solve the target problem without complex inference Naive-Bayes, SVMs andmultilayer perceptrons are examples of supervised learning algorithms.

cor-Semi-supervised algorithms are similar to supervised algorithms but aredesigned to handle missing features or tiny datasets They often require somelevel of assumption about the missing data to operate correctly

Unsupervised algorithms are applied on only input data, without anyset outputs These models seek to build relationships between different datapoints (e.g., clustering, hierarchy, ) Their outputs often require furtherinference and processing to eventually solve the target problem Popular un-supervised methods include k-means and neural autoencoders

Reinforcement learning algorithms are slightly different as they do nothave a static idea of training data These algorithms learn within an “envi-ronment”, in which they play the role of an intelligent agent Feedback fromthe environment guides the agent’s learning process Recent research on rein-forcement learning has tackled problems such as self-driving agents and gameplaying

2.2 Artificial neural networks

Artificial neural networks (commonly referred to as neural networks) are atype of machine learning model inspired by the way biological neural systemsprocess information Specifically, they mimic the way biologically neuronsform connections between one another Similar to their biological counter-parts, artificial neurons are computationally simple but rely on their largenumbers and cross-connections to model complex dependencies

Frank Rosenblatt proposed the first neural network in 1958, which hecalled the Perceptron [44] The Perceptron consists of a single layer of neuronsrepresented by the weight vector w, a bias parameter b, and an activationfunction g(x) Given the input vector x, a perceptron produces the followingoutput:

Trang 17

f (x; w, b) = g(wTx + b) (2.1)

Figure 2.1: A 4-layer neural network1The Perceptron eventually evolves into multilayer perceptrons (MLP) orneural networks An MLP consists of multiple layers of neurons, including

an input layer, an output layer and a number of hidden layers Figure 2.1illustrates an example of a 4-layer neural network

The input layer is a vector representation of input data This layer tains raw numerical values that directly model the input The output layer

con-is a vector representation of the problem’s output For binary classificationproblems, this may be a single value denoting the prediction likelihood Hid-den layers sit between the input and output layers They make up the neuralnetwork’s abstract representation space Hidden layers may have arbitrarysizes and can stack to form “deeper” networks As it is not typically possible

to understand or infer insights from these layers, their information is “hidden”

to human observers

MLPs are “densely” connected, in which a neuron in layer i is connected

to every neuron in layeri + 1 A connection between two neurons denotes thatthe receiving neuron takes the sending neuron’s output value as its input.Neural networks implement two primary procedures: forward pass andbackpropagation The forward pass generates a network’s output by iteratingthrough its layers (see Algorithm 1) The output for layer j is defined in [15]as:

h(j) = g(j)(W(j)T · h(j−1)+ b(j)) (2.2)

1 https://technology.condenast.com/story/a-neural-network-primer

Trang 18

whereg(j) is the jth layer’s activation function, W(j) is the weight matrix,b(j)

is the bias weight, and h(0) = x The size of W(j) corresponds to the number

of neurons at layers j and j − 1

MLPs require non-linear activations to model complex relations The mostcommon activations are thetanh andsigmoid functions For the output layer,the activation function is usually chosen such that it constrains the outputspace to the desired range For example, binary classification problems oftenactivate the output with sigmoid, as sigmoid(x) ∈ (0, 1)

Algorithm 1: Neural network forward pass

Input : Vector of input features x

Network layers L = ((W (1) , b (1) , g (1) ), , (W (m) , b (m) , g (m) )) Output : Predictions h (m)

Outputs at each layer Z = (z (1) , , z (m) )

Algorithm 2: Neural network backpropagation

Ground-truth labels y Network layers L = ((W(1), b(1), g(1)), , (W(m), b(m), g(m))) Output : Gradients at each layer ∆ = (δ(1), , δ(m))

Trang 19

Concep-descent uses this property to continually update the neural network The dated vector is multiplied by a learning rate, whose values are subtracted fromthe weight of each neuron Learning rates help throttle the update process toensure that we eventually reach a local minimum Algorithm 3 describes thisprocess in detail.

up-Algorithm 3: Gradient Descent

Ground-truth labels y Network layers L = ((W(1), b(1), g(1)), , (W(m), b(m), g(m))) Learning rate γ

Output : The updated network

Figure 2.2: Simple visualization of gradient descent2

The gradient descent algorithm can “train” a neural network by continuallyapplying backpropagation and updating over a training dataset StochasticGradient Descent (SGD) uses constant-sized batches for each iteration instead

of the entire dataset, allowing training on massive datasets that do not fit intomemory There are also several variants of SGD that adapt the learning rate

to find better minima or reach convergence faster Such strategies includeAdam [26] and Adadelta [59], among others

2 97a7afb33add

Trang 20

https://towardsdatascience.com/quick-guide-to-gradient-descent-and-its-variants-While neural networks can model highly complex functions by adding rons and layers, their size in practice is constrained by two problems: over-fitting and gradient vanishing Overfitting is a common problem for highlyexpressive machine learning models, where the model performs extremely well

neu-on the training set but poorly neu-on unseen data In other words, these models

do not generalize Large MLPs are highly susceptible to overfitting, as tions can be made at a very large scale At a certain level, an overfitting MLPcan essentially “remember” the training dataset through its neurons, thus notachieving its goal of learning generalized features

connec-Gradient vanishing is an issue that arises in relatively deep MLPs Asgradients are computed with the derivative chain rule, their values slowlydiminish at layers far away from the output This means that most learninghappens at the final layers of the network, while early layers contribute verylittle These issues impose limits on how neural networks are designed, beforethe introduction of a set of new models and techniques for deeper and largerneural networks, collectively referred to as Deep Learning

Fast implementation of the forward pass and backpropagation algorithmsare also crucial to developing neural networks While the logical steps forbackpropagation are relatively trivial, larger and more complex operationscan quickly make implementations cumbersome This complexity is addressedwith computational graphs, which allow gradients and derivations to be trackedduring the forward pass itself (see Figure 2.3)

Figure 2.3: Example of a computational graph Computation nodes store their rived gradients w.r.t their inputs 3

de-Another challenge for neural networks is execution speed Even smallerMLPs can take significant amounts of time to train due to their large number

3 http://datahacker.rs/004-computational-graph-and-autograd-with-pytorch/

Trang 21

of connections and the stochastic nature of SGD For larger datasets, trainingcan take many hours or even days on powerful CPUs Raina et al [40] werethe first to propose the use of graphics processing units (GPUs) for trainingand running neural networks As most neural network operations rely onmatrix computations, GPUs proved to be highly adept for the task, showing

up to 40 times improvement over running on CPUs Other works also exploredmulti-node execution, either with parameter servers or peer-to-peer protocols.Hardware alternatives for GPUs are also available, albeit with less adoption,such as Tensor Processing Units (TPUs) or FPGAs

Figure 2.4: Speed comparison on several deep learning tasks between Xeon CPUs and NVIDIA Tesla GPUs 4

Deep learning frameworks combine computational graphs, GPU supportand helpful abstractions to form a complete ecosystem for developing neuralmodels Theano [7] was the first such framework, evolving from a tool usedmostly for convex optimization Google’s TensorFlow [1] adapted much of theidealogy of Theano, most significantly the idea of “graph-declaration-as-code”,

to create a powerful framework that supports single-node and multi-nodeexecution Both frameworks require the computational graph to be explicitlydefined and run separately (similar to high-performance computing librarieslike Spark), with an optimization phase to improve performance

Despite its early popularity, TensorFlow was notoriously difficult for ginners, and even experienced users had a hard time debugging complex mod-els written in TensorFlow PyTorch [39] emerged as a younger framework

be-4 https://www.nextplatform.com/2018/09/12/nvidia-takes-on-the-inference-hordes -with-turing-gpus/

Trang 22

seeking to alleviate such issues PyTorch executes computations “eagerly” andbuilds the computational graph on the fly It also featured friendlier, higher-level abstractions than TensorFlow at the time These advantages slowlyshifted many researchers to using PyTorch to implement their ideas quickly,while industry applications relied on TensorFlow to maximize performance.However, both TensorFlow and PyTorch are seeking to cover both use cases,

as TensorFlow 2.0 introduces the “eager execution” mode and PyTorch 1.0introduces a static graph mode

2.3 Convolutional neural networks

The idea of applying machine learning models to images is not particularlynew However, early attempts had to overcome the problem of input size,namely that image inputs are often inconveniently large A “small” image ofsize 200 × 200 already has 40, 000 features (!) Thus, these early models relied

on feature extraction methods such as bag-of-visual-words, SIFT or HOG tocondense images into more compact forms While this approach can yieldpositive results, it relies on assumptions made by feature extractors that maynot be robust to diverse inputs

Convolutional neural networks also rely on a core assumption, which statesthat an image can be understood with high accuracy by examining smallersliding windows This assumption is carried out in a special layer type calledthe convolution layer

Convolution is a common operation in image processing, especially forblurring, sharpening, or detecting edges A convolution uses a fixed “kernel”,

a small 2-D matrix containing weights, and slides the kernel across both imagedimensions For each image segment, the kernel is multiplied with pixel valuesand aggregated to form an output matrix In essence, the value of item(i, j)inthe output matrix is an aggregation of all kernel weights and the pixel values

at location(i, j) It encodes the local state of the pixel, i.e., reflects what types

of pixels are surrounding it

The convolution layer (or CONV layer) consists of k convolution kernelsacting as weights Given an input image represented as a tensor of shape

h × w × c, each kernel produces an output matrix of shape h0× w0 Kernel

Trang 23

outputs are stacked to form the final output for the CONV layer, with ashape ofk × h0× w (see Figure 2.5) Note that since convolution layers produceoutputs of similar shapes to their inputs, they can be easily stacked.

Figure 2.5: An example convolution layer 5

A basic convolution layer can be configured with several parameters Thekernel size often varies between 3 × 3 to 7 × 7, which affects the network’sreceptive field Stride can also vary between 1 and 3 pixels Kernel size andstride also affect the final sizeh0×w0 of the output matrix Some combinations,such as3×3kernels with stride 1, cause the input to shrink after going throughconvolution As this shrinking is often undesirable, padding is added to theinput to preserve size Finally, the number of kernels k can be set arbitrarily,with each kernel implying a different local feature being learned

Convolution layers address many key problems with neural networks forimages They allow a small number of weights to view and distill featuresfrom the entire image, essentially creating learnable feature extractors

Figure 2.6: An example of max-pooling6

One of the first successful CNNs was LeNet [29], introduced by LeCun

et al in 1998 Aside from the introduction of convolutional layers, severalkey concepts were laid out by the seminal paper These include the use of

5

http://cs231n.github.io/understanding-cnn/

6 http://cs231n.github.io/understanding-cnn/

Trang 24

pooling layers (see Figure 2.6), which imposes a hard filter on a feature mapand reduces its size Pooling layers act as bottlenecks that select abstract,high-level features to feed to subsequent layers of the network The ReLUfunction (ReLU (x) = max(x, 0)) was also proposed as the activation in place

of sigmoid or tanh An advantage of ReLU is that it lessens the impact ofgradient vanishing compared to other non-linear activations

For classification tasks, two or three fully-connected layers are appendedafter convolution layers to produce the final output vector In a sense, we canconsider the convolution layers to be encoders that compress the image into

a compact representation for the MLP classifier

Figure 2.7: LeNet-5 architecture [ 29 ]

Following the success of LeNet, AlexNet [27] and VGG [47] were some ofthe earliest improvements to CNN design Notably, VGG was one of the firsttruly "deep" neural networks with more than a handful of layers Despite

a rather simple architecture, with linear stacks of convolution and poolinglayers, VGG-16 was quite robust for its time, achieving a top-5 accuracy of92.7%on the ImageNet dataset Figure 2.8 illustrates the VGG-16 architecture

in detail

A major hurdle for models such as VGG-16 when going deeper is ent vanishing As the number of layers grows, feedback signals from the lossfunction simply cannot be retained through the linear layer stack ResNet [16]addressed this problem with skip connections Instead of stacking layers lin-early, ResNet includes layers that take input as the sum of the previous layerand the k-th previous layer (see Figure 2.9) These skip connections serve tocombat gradient vanishing deep into the network, while also smoothing out theloss landscape ResNet-50 is the most common variation of ResNet, achieving

gradi-a top-5 gradi-accurgradi-acy of94.8%on ImageNet The architecture has also proven to be

7

https://neurohive.io/en/popular-networks/vgg16/

Trang 25

Figure 2.8: Architecture of VGG-16 7

quite robust to many different problem types, including malware classification[41], food recognition [35], and even speech and natural language domains [9].For many years, it was also the de-facto model for industry applications andprovided the encoder backbone for many network types

While ResNet pioneered the use of skip connections, many later works posed different improvements to the concept ResNeXt [58] utilizes a multi-branch design similar to that of GoogLeNet [49] to pair with skip connections.Meanwhile, Huang et al [20] propose DenseNet, a massively densified version

pro-of ResNet in which skip connections are made to every prior layer in the block.While this approach yielded improvements in accuracy, it is also highly expen-sive computationally Several works have attempted to “sparsify” DenseNet,including LogDenseNet [17], SparseNet [33] and Harmonic DenseNet (HarD-Net) [10]

Figure 2.9: Example of a skip connection [ 16 ]Aside from skip connections, a key intuition in designing CNNs is the use ofbranches The most notable example of this mindset is the Inception family of

Trang 26

architectures The first Inception network (InceptionV1 or GoogLeNet) [49]uses multiple convolution layers with different filter sizes (3 × 3, 5 × 5 and

7 × 7), along with a max-pooling layers Outputs from these layers are thenconcatenated (see Figure 2.10) The goal of these multiple kernel sizes is tolet the model learn which kernel is most optimal for the given image Largekernels are suited for global information, while small kernels help representfiner details Additionally, growing the network “wider” instead of deeper helpsavoid gradient vanishing Later versions of Inception [50] further optimizethe architecture by removing representational bottlenecks (when feature mapdimensions are reduced too drastically) and applying improvements in thetraining phase

Figure 2.10: Architecture of Inception V1 (GoogLeNet) [ 49 ]

Many previous works also proposed different methods for regularizingCNNs and neural networks in general Regularization essentially imposes con-straints on a model in hopes of preventing it from overfitting L1 and L2 regu-larization are commonly used in linear models, which simply prevent weightsfrom being too large Unfortunately, this approach does not work very wellfor CNNs, as it hinders most of the network’s expressiveness Srivastava et

al [48] proposed dropout in 2014, which proved effective for both CNNs andfeed-forward MLPs The idea of dropout is to simply turn off a random sub-set of neurons in each iteration of SGD (see Figure 2.11) Neurons that areturned off simply emit 0 as their output and therefore do not contribute tothe loss or learn during the iteration While quite simple, dropout is surpris-ingly effective By taking out random neurons during training, we enforce

a level of redundancy and robustness to the model Each subset of neurons

in the network needs to have enough useful information such that they stillproduce correct predictions even as other neurons are turned off Meanwhile,batch normalization (or BatchNorm) [21] takes a slightly different approach

Trang 27

for normalization The authors identified high levels of numerical instability

in many neural network implementations, which causes overfitting and slowconvergence Batch normalization applies a normalization typically only usedfor input data (i.e., subtracting the mean and dividing by the deviation) tothe output feature map of convolution layers Unlike training images, how-ever, output feature maps for the same image constantly change throughouttraining, and it would be infeasible to calculate the mean and deviation con-stantly Instead, batch normalization layers keep track of running mean anddeviation values and update them during training

Figure 2.11: Example of dropout [ 14 ]

A recent development in network architectures has been the use of neuralarchitecture search techniques, which produces highly optimal network de-signs without manual tuning A well-known example is EfficientNet [52] (seeFigure 2.12), a family of architectures providing a range of trade-offs betweenaccuracy and latency/size EfficientNet uses a baseline architecture consisting

of Mobile Inverted Bottleneck blocks and searches for different scaling urations regarding width, depth, and resolution EfficientNet V2 [53] furtherimproves model size and training speed

config-2.4 Attention mechanisms

Attention is a powerful mechanism applied in numerous neural networks,most notably in natural language processing and computer vision In essence,the goal of adding attention is to help neural networks focus (i.e., pay attention

Trang 28

Figure 2.12: Architecture of EfficientNet-B0 [ 3 ]

to) important parts of the input For image processing, this focus manifests

as a mask over the input tensor, in which high-value segments carry higherattention values When processing sequences such as text or audio, atten-tion works as a type of soft memory unit that signals relevant items at eachtimestep

The first work to introduce attention is that of Bahdanau et al [5] in

2014, when it was applied in the domain of machine translation This version

of attention produces a probability map of the entire input sequence usingthe softmax function, then applies the map to the decoder of a sequence-to-sequence recurrent network (see Figure 2.13)

Later works further apply attention in different problem domains andnetwork architectures, many with positive results Wang et al [56] proposedResidual Attention Network, which stacks multiple attention modules along-side skip connections to process images The seminal work by Vaswani et al.[55] introduced Transformers, which relies heavily on multi-head self-attentionmodules (see Figure 2.14) The self-attention mechanism used in Transform-ers essentially allows the network to see and focus on relevant parts of its ownoutput at previous timesteps

Trang 29

Figure 2.13: Attention mechanism proposed in [ 5 ]

2.5 Convolutional neural networks for semantic

to semantic segmentation As the name suggests, FCNs do not include the

FC layers In fact, it simply replaces FC layers in the original architectures(including AlexNet, VGG-16 and GoogLeNet) with convolution layers to act

as the “encoder”

While the modification seems straightforward, the heatmaps produced bythe encoder are far too coarse to produce accurate predictions This is becauseCNNs progressively reduce the feature map size to learn more abstract andglobal features While this works fine for classification problems (since thedesired output is highly abstract), it greatly hinders segmentation tasks Thefinal feature map in VGG-16, for example, is only 32 × 32for a256 × 256 input

Trang 30

Figure 2.14: Transformer architecture [ 55 ] The network processes items in the quence one-by-one, passing the output to the decoder for the next item.

se-image To address this problem, FCN uses a series of connections between theoutput of each pooling layer, forming a directed acyclic graph (DAG) ThisDAG acts as the network’s decoder (see Figure 2.15) These connections helpsupply more fine-grained information to the encoder’s output

While FCNs achieved state-of-the-art results in semantic segmentation,their design was still heavily constrained by the linear stacking of convolutionlayers Ronneberger et al [43] proposed a more elegant model to address theseissues, called U-Net (see Figure 2.16) U-Net features a symmetrical, U-shapedarchitecture consisting of an encoder and decoder The encoder is a standardconvolutional network similar to FCN, producing progressively smaller featuremaps with a higher level of abstraction However, instead of combining featuremaps on different levels directly, U-Net uses another convolutional network asthe decoder The decoder upsamples the feature map with each block, com-

Trang 31

(a) Overall architecture

(b) Decoder DAG

Figure 2.15: Architecture of the Fully Convolutional Network [ 34 ]

bining the upsampled features with corresponding features from the encoder(a form of skip connection) The final decoder layer restores the feature map

to the input’s original size and produces the network’s prediction Besidesits nice symmetric properties, this design addresses key challenges faced byFCNs It allows coarse abstract features and fine-grained features to be com-bined through each decoder layer The skip connections between encoder anddecoder blocks also reduce gradient vanishing U-Net achieved state-of-the-art performance on the ISBI 2012 EM segmentation dataset and was quicklyadapted for many other segmentation problems

Following the introduction of U-Net, many have proposed improvements

to the architecture Unet++ [60] uses nested skip connections between theencoder and decoder Nested skip connections allow features from higher-level encoder blocks to combine with lower encoder blocks before reaching thedecoder These connections also contain more skip connections themselves.All these skip connections create a highly connected architecture with free-flowing information between blocks ResUNet++ [25] replaces the encoder inUNet++ with ResNet backbones Meanwhile, DoubleUNet [23] stacked twoUNets sequentially to create a larger, more powerful network The authorsused VGG-16 as the backbone and added squeeze and excitation units [18]

Trang 32

Figure 2.16: Overall U-Net architecture [ 43 ]

to better model channel-wise dependencies DoubleUNet also incorporatesthe Atrous Spatial Pyramid Pooling [11] (ASPP) module, which expands thenetwork’s receptive field with multiple sample rates However, DoubleUNetonly has a single connection between the two U-Nets, severely limiting itsinformation flow Tang et al [54] proposed Coupled U-Net to improve on thislimitation by adding skip connections between the two networks Oktay et al.[38] added attention gates to the skip connections in U-Net, which helps filterout salient features and improve convergence

PraNet (Parallel Reverse Attention Network ) [13] is a CNN architecturedesigned specifically for medical image segmentation The overall architecture

is shown in Figure 2.17 PraNet follows the encoder-decoder pattern laid out

by U-Net with significant modifications PraNet’s encoder is a standard CNNbackbone, in this case Res2Net PraNet’s decoder is called Parallel PartialDecoder [57], which aggregates high-level features from the encoder In ad-dition, PraNet uses an attention mechanism called Reverse Attention on theskip connections The model achieved state-of-the-art performance on severalpolyp segmentation datasets

HarDNet-MSEG [19] builds upon the design of PraNet with a focus oninference speed In fact, the HarDNet-MSEG code repository is forked fromPraNet HarDNet-MSEG greatly reduces computational complexity by us-

Trang 33

Figure 2.17: Overall PraNet architecture [ 13 ]

ing the lightweight HarDNet68 backbone instead of Res2Net, and removingreverse attention modules (see Figure 2.18) The result is a very fast andlean neural network, with inference speed up to 86 FPS on NVIDIA RTX

2080 GPU HarDNet-MSEG also achieves state-of-the-art performance on theKvasir dataset

Figure 2.18: Overall HarDNet-MSEG architecture [ 19 ]

Trang 34

2.6 Polyp segmentation and neoplasm

classifica-tion

Polyp segmentation has been a popular benchmark for many medical age segmentation methods Earlier works used hand-crafted features [22, 46]including color, shape, texture, etc., to separate polyps from the surround-ing mucosa In recent years, more generic neural networks such as UNet++[60], PraNet [13], DoubleUNet [23] and many other models have used polypsegmentation as their benchmark task This is partly due to the many publicdatasets available for the problem Notable datasets include the Kvasir-SEGdataset [24], CVC-ClinicDB dataset [8] and CVC-ColonDB dataset [51]

im-On the contrary, neoplasm classification for polyps has seen significantlyless research Major challenges for this research direction include the lack ofpublic datasets and the general difficulty of labeling data Ribeiro et al [42]were some of the few authors who tackled polyp neoplasm However, theirwork only considered a full-image classification problem without any segmen-tation data They extracted and labeled a dataset of 100 polyp images fromendoscopy videos, each containing exactly one polyp Several CNN modelswere tested on this dataset, including VGG, AlexNet, and GoogLeNet While

it is not impossible to couple this approach with a segmentation module, itcan become very inefficient in practice, especially for systems with real-timeconstraints or embedded hardware

2.7 Problem formulation

This thesis describes the Polyp segmentation and Neoplasm detectionproblem, hereafter denoted as PSND PSND extends the polyp segmenta-tion problem by separating polyps into two sub-classes: neoplastic and non-neoplastic (see Figure 2.19)

Formally, we define the PSND problem as follows Given an input dimensional image, generate an equal-sized matrix whose values representthe label for the corresponding pixel, which must be one of three values:

2-• 0 if the pixel is a background (non-polyp) pixel;

Trang 35

Figure 2.19: Classification targets for the polyp segmentation problem and the polyp segmentation and neoplasm detection problem

• 1 if the pixel is part of a non-neoplastic polyp;

• 2 if the pixel is part of a neoplastic polyp

(a) Input image (b) Polyp segmentation (c) PSND

Figure 2.20: Expected outputs for polyp segmentation and PSND Black regions denote background pixels White regions denote polyp regions Green and red regions denote non-neoplastic and neoplastic polyp regions, respectively.

Figure 2.20 demonstrates an example of PSND’s output The problempresents several unique challenges A polyp’s surface pattern can greatly varyfor both neoplastic and non-neoplastic classes, sometimes separating into sev-eral sections within the same polyp In one lesion, a neoplastic area may onlytake up a portion of the polyp’s surface, requiring very fine-grained spatialunderstanding to tell apart This is a differentiating factor between PSND andgeneric multi-class segmentation While datasets such as PASCAL-VOC havemany more classes, they are primarily well-defined and highly distinguishable(e.g., car, bird, person, etc ) Moreover, from a medical perspective, misclas-sification could be very costly False positives could result in over-indications

of endoscopic interventions or surgery, while false negatives could result indelaying suitable treatment

Trang 36

Figure 2.21: Example of an image with an undefined polyp Pixels annotated in yellow denote the undefined polyp area

Another challenge for PSND arises during actual data labeling Due tothe highly ambiguous nature of neoplasm, some polyps remain uncertain evenfor human annotators These polyps are represented as a fourth class in thedata: polyps with undefined neoplasticity (see Figure 2.21) However, ourformulation does not include this class, because it is often undesirable forautomatic systems to return “undefined” results Additionally, such labelingcan create confusion for learning models and degrade accuracy Therefore, weintroduce special methods to handle and make use of undefined polyps in thefollowing sections

For segmenting esophageal lesions, the problem formulation in this thesis

is similar to classic semantic segmentation problems Given an input image,generate an equal-sized matrix whose values represent the label for the cor-responding pixel: 0 if the pixel is a background pixel and 1 if the pixel is aforeground pixel (lesion)

While the formulation is similar in nature, the specific segmentation target(esophageal lesions) has not been addressed in any previous work Specificproperties of the data may have adverse effects on existing models

Trang 37

Our goal when designing NeoUNet is to create a strong baseline for PSNDand esophageal lesion segmentation This means that we try to create a bal-ance between different trade-offs for NeoUNet, and avoid specializing for spe-cific use cases.

3.1.2 Architecture overview

We base NeoUNet on the tried-and-true U-Net architecture [43], alongwith some characteristics from the Attention U-Net model introduced byAbraham et al [2] The network is comprised of an encoder-decoder structure,with symmetrical shortcut connections between encoder and decoder blocks.The encoder is a HarDNet68 network [10], from which 5 feature maps areextracted corresponding to different abstraction levels The decoder consists

of a series of convolution blocks, which takes a concatenated tensor of the

26

Trang 38

Figure 3.1: Overview of NeoUNet’s architecture

previous decoder’s output and the output from a corresponding encoder block.The encoder block outputs also go through attention gate modules (exceptthe first and last block) Each decoder then passes its output to an outputblock to produce the final prediction mask Figure 3.1 describes NeoUNet’sgeneral architecture

NeoUNet produces 4 output masks instead of 1 to facilitate deep sion, a training technique in which the loss function is calculated for multipleoutputs at different levels Deep supervision typically improves a model’s ro-bustness, as the decoder at each level must learn strong and concrete features

supervi-to produce accurate feature maps During inference, we only use the predictionmask from the final decoder layer

On the surface level, NeoUNet’s primary difference compared to existingsegmentation networks (such as PraNet or AttentionUNet) is that the modeloutputs a 2-channel segmentation mask, as opposed to a 1-channel binarymask However, NeoUNet is also designed to cope with specific difficulties ofthe PSND problem, including high inter-class similarities and missing data

3.1.3 Encoder backbone

HarDNet (Harmonic DenseNet) [10] is a convolutional neural network chitecture inspired by DenseNet [20] DenseNet’s core innovation is the facili-tation of feature reuse through dense skip connections Each layer in a DenseBlock receives the concatenated feature map of every preceding layer, giving

Định dạng
Số trang	77
Dung lượng	0,97 MB