Hanoi University of Science and TechnologySchool of Information and Communication Technology Master Thesis in Data Science Unified Deep Neural Networks for Anatomical Site Classification
Trang 1Hanoi University of Science and Technology
School of Information and Communication Technology
Master Thesis in Data Science
Unified Deep Neural Networks for Anatomical Site Classification and Lesion Segmentation for Upper
Trang 3Contents
Abstract
List of Figures
List of Tables
List of Acronyms
1.1 General introduction 1
1.2 Objectives 2
1.3 Main contributions 2
1.4 Outline of the thesis 2
2 Artificial Intelligence and Machine Learning 3 2.1 Basic concepts 3
2.2 Types of learning 4
2.2.1 Supervised learning 4
2.2.2 Unsupervised learning 5
2.2.3 Reinforcement learning 5
2.3 Techniques 7
2.3.1 Deep Learning 7
2.3.1.1 Deep Learning and Neural Networks 7
2.3.1.2 Perceptron 9
2.3.1.3 Feed forward 10
2.3.1.4 Recurrent Neural Network 11
2.3.1.5 Deep Convolutional Network 11
2.3.1.6 Training a Neural Network 11
2.3.2 Convolutional Neural Network 12
2.3.2.1 Image kernel 13
2.3.2.2 The convolution operation 13
Trang 42.3.2.3 Motivation 14
2.3.2.4 Activation function 16
2.3.2.5 Pooling 17
2.3.3 Fully convolutional network 18
2.3.4 Some common convolutional network architectures 20
2.3.4.1 VGG 20
2.3.4.2 ResNet 20
2.3.4.3 DenseNet 21
2.3.4.4 UNet 21
2.3.5 Vision Transformer 23
2.3.5.1 The Transformer 23
2.3.5.2 Transformers for Vision 24
2.3.6 Multi-task learning 26
2.3.7 Transfer learning 27
2.3.8 Avoid overfitting 29
3 Methodology 31 3.1 EndoUNet 31
3.1.1 Overall architecture 31
3.1.2 Encoder 31
3.1.3 Segmentation decoder 33
3.1.4 Classifiers 34
3.2 SFMNet 34
3.2.1 Overall architecture 34
3.2.2 Encoder 35
3.2.3 Compact generalized non-local module 37
3.2.4 Squeeze and excitation module 37
3.2.5 Feature-aligned pyramid network 37
3.2.6 Classifiers 39
3.3 Metrics and loss functions 39
3.4 Multi-task training 40
4 Experiments 42 4.1 Datasets 42
4.2 Data preprocessing and data augmentation 44
4.3 Implementation details 45
4.4 Experimental results 46
Trang 5References 52
Trang 6Image Processing is a subfield of computer vision concerned with comprehendingand extracting data from digital images There are several applications for imageprocessing in various fields, including face recognition, optical character recognition,manufacturing automation inspection, medical diagnostics, and tasks connected toautonomous vehicles, such as pedestrian detection In recent years, the deep neuralnetwork has become one of the most popular image processing approaches due to anumber of significant advancements
The use of machine learning in biomedical applications can be structured into threemain orientations: (1) as a computer-aided diagnosis to help the physicians for anefficient and early diagnosis, with a better harmonization and less contradictorydiagnosis; (2) to enhance the medical care of patients with better-personalized ther-apies; and (3) to improve the human wellbeing, for example by analyzing the spread
of disease and social behaviors in relation to environmental factors [1] In this work,
I propose to construct the models for the first orientation that is capable of handlingmultiple simultaneous tasks pertaining to the upper gastrointestinal (GI) tract On
a dataset of 11469 endoscopic images, the models were evaluated and producedrelatively positive results
Trang 7List of Figures
2.1 Reinforcement learning components 6
2.2 Relationship between AI, ML, and DL 7
2.3 Neural Network 8
2.4 Illustration of a deep learning model [2] 9
2.5 Perceptron 10
2.6 Architecture of a CNN [3] 13
2.7 Example of convolution operation [4] 14
2.8 Sparse connectivity, viewed from below [2] 15
2.9 Sparse connectivity, viewed from above [2] 15
2.10 Common activation functions [5] 16
2.11 Max pooling 18
2.12 Average pooling 18
2.13 Architecture of an FCN [6] 19
2.14 Architecture of VGG16 [7] 20
2.15 A residual block [8] 21
2.16 DenseNet architecture vs ResNet architecture [9] 22
2.17 UNet architecture [10] 22
2.18 Attention in Neural Machine Translation 24
2.19 The Transformer - model architecture [11] 25
2.20 Vision Transformer architecture [12] 25
2.21 Common form of multi-task learning [2] 26
2.22 The traditional supervised learning setup 27
2.23 Transfer learning 28
3.1 Architecture of EndoUNet 31
3.2 VGG19-based shared block 32
3.3 ResNet50-based shared block 33
3.4 DenseNet121-based shared block 33
3.5 EndoUNet decoder configuration 34
3.6 SFMNet architecture 35
3.7 Grouped compact generalized non-local (CGNL) module [13] 37
Trang 83.8 A Squeeze-and-Excitation block [14] 38
3.9 Overview comparison between FPN and FaPN [15] 38
3.10 Feature alignment module [15] 39
3.11 Feature selection module [15] 39
4.1 Demostration of upper GI 42
4.2 Some samples in anatomical dataset 43
4.3 Some samples in lesion dataset 44
4.4 Some samples in HP dataset 44
4.5 Image augmentation 45
4.6 Learning rate in training phase 46
4.7 EndoUnet - Confusion matrix on anatomical site classification task on a fold 49
4.8 SFMNet - Confusion matrix on anatomical site classification task on a fold 49
4.9 Confusion matrices on lesion classification task on a fold 49
4.10 Some examples of the lesion segmentation task 50
Trang 9List of Tables
3.1 Detailed settings of MiT-B2 and MiT-B3 36
4.1 Number of images in each anatomical site and lighting mode 43
4.2 Accuracy comparison on the three classification tasks 47
4.3 Dice Score comparison on the segmentation task 48
4.4 Number of parameters and speed of models 48
Trang 10DNN Deep Neural Network
CNN Convolutional Neural NetworkRNN Recurrent Neural NetworkMTL Multi-task Learning
RL Reinforcement Learning
Trang 11Esophagogastroduodenoscopy (EGD) is a diagnostic procedure that visualizes theupper part of the GI tract down to the duodenum It is an exploration method thataccurately detects lesions of the GI tract that are difficult to identify with other tools(biomarkers or imaging) However, the lesions missing rate, defined as a negativefinding on endoscopy in patients with lesions within three years, has been reported
in the literature review In a study published by Menon et al [19], this rate was11.3% and similar for both esophageal and gastric cancers In another paper byShimodate et al from a Japanese institution [20], they concluded that the miss rate
of gastric superficial neoplasms (GSN) was 75.2% There are many reasons for thissituation, such as the heterogeneous quality of endoscopy systems, different levels
of experience in technical performance and lesions evaluation of endoscopists, andlack of patient tolerance of the procedure Therefore, computer-aided diagnosis isdesirable to help improve the reliability of this procedure
Deep learning (DL) has gained remarkable success in solving various computer sion tasks In recent years, several DL-based methods have been proposed to dealwith EGD-related tasks, such as informative frame screening, anatomical site classi-fication, gastric lesion detection, and diagnosis However, previous works often solvethese tasks separately Therefore, the development of computer-aided systems ca-
Trang 12vi-pable of simultaneously solving all the tasks using separate task-specific DL modelswould require many memory and computational resources This makes it difficult
to deploy such systems on low-cost devices On the other hand, collecting and notating patients’ data for medical imaging analysis is challenging in practice Thelack of data can significantly reduce the models’ performance
an-In order to address these issues, this paper proposes two models to simultaneouslysolve four EGD-related tasks: anatomical site classification, lesion classification,
HP classification, and lesion segmentation The model includes a shared encoderfor learning common feature representation, followed by four output branches tosolve four tasks As a result, models greatly benefit from multi-task training sincethey can learn a powerful joint representation from an extensive dataset that com-bines multiple data sources collected solely for each task Experiments show thatthe proposed models yield promising results in all tasks and achieve competitiveperformance compared to single-task models
1.2 Objectives
This work aims to build unified models to tackle multiple tasks relating to theupper gastrointestinal tract The tasks include anatomical site classification, lesionclassification, HP classification and lesion segmentation
1.3 Main contributions
The main contributions of this study are as follows:
• Introduce two unified deep learning-based models to simultaneously solve fourtasks related to the upper GI tract: a CNN-based baseline model and aTransformer-based model
• Evaluate the proposed methods on a Vietnamese endoscopy dataset
1.4 Outline of the thesis
The rest of this thesis is organized as follows:
Chapter 2 presents an overview of the concepts of Artificial Intelligence, MachineLearning, Deep Learning, and related techniques
Chapter 3 proposes a model to simultaneously solve tasks related to the uppergastrointestinal tract
Chapter 4 presents the content of the experiments and the results obtained
Chapter 5 concludes the thesis
Trang 13sur-The ability to simulate human intelligence distinguishes AI from logic ming in computer languages In particular, AI enables computers to gain humanintelligence, such as thinking and reasoning to solve problems, communicating viaunderstanding language and speech, and learning and adapting.
program-Artificial Intelligence is, in its simplest form, a field that combines computer scienceand robust datasets to enable problem-solving In addition, it includes the subfields
of machine learning and deep learning, which are commonly associated with artificialintelligence
AI can be categorized in different ways This thesis divides AI into two categoriesbased on its strength: weak AI and strong AI
Weak AI, also known as Narrow AI, is a sort of AI that has been trained to
do a particular task Weak AI imitates human perception and aids humanity byautomating time-consuming tasks and data analysis in ways that humans cannotalways perform This sort of artificial intelligence is more accurately described as
“Narrow”, as it lacks general intelligence and instead possesses intelligence tailored
to a certain field or task For instance, an AI that is great at navigation is typicallyincapable of playing chess, and vice versa Weak AI helps transform massive amounts
of data into useful information by identifying patterns and generating predictions.Most of the AIs that we see today are weak AIs, with typical examples such as
Trang 14virtual assistants (Apple’s Siri or Amazon’s Alexa), Facebook’s news feed, spamfiltering in email management applications (Gmail, Outlook), autonomous vehicles(Tesla, VinGroup).
In addition to its strength, weak AI has the potential to wreak damage in the event
of a system failure For instance, spam filtering systems may misidentify essentialemails and place them in the spam folder, or self-driving car systems may causetraffic accidents owing to miscalculations, etc
Strong AI, consisting of both Artificial General Intelligence (AGI) and ArtificialSuper Intelligence (ASI) Artificial General Intelligence is a speculative kind of AIthat posits a machine has intelligence equivalent to that of a person and is capable ofself-awareness, problem-solving, learning, and future planning Similarly, artificialsuperintelligence (also known as superintelligence) is a theoretical kind of artificialintelligence positing that a machine has superior intelligence and capacities to those
of human brains
Strong AI is currently simply a concept with no examples in practice Nonetheless,academics continue to conduct research and hunt for development avenues for thisform of AI
Machine Learning (ML) is a branch of AI and computer science that focuses onthe use of data and algorithms to imitate the way that humans learn, graduallyimproving its accuracy [22]
2.2 Types of learning
Given that the focus of the field of Machine Learning is “learning”, Artificial telligence/Machine Learning employs three broad categories of learning to acquireknowledge They are Supervised Learning, Unsupervised Learning, and Reinforce-ment Learning
In-2.2.1 Supervised learning
In supervised learning, the computer is given labeled examples so that for eachinput example, there is a matching output value This strategy is intended to assistmodel learning by comparing the output value created by the model with the realoutput value to identify errors and then progressively modifying the model to reduceerrors Supervised learning employs learned patterns to predict output values fornever-before-seen data (not in the training data) For classification and regressionproblems, supervised learning proved itself to be accurate and fast
• Classification is the process of discovering a function that divides a datasetinto classes according to certain parameters A computer program is trained
Trang 15on the training dataset, and based on this training, it classifies the data intovarious classes Classification has different use cases, such as spam filtering,customer behavior prediction, and document classification.
• Regression is a method for identifying relationships between dependent andindependent variables It aids in forecasting continuous variables, such asMarket Trends and Home Prices
Supervised learning functions by modeling the linkages and dependencies betweenthe goal prediction output and the input features, such that it is feasible to predictthe output values for new data based on the associations it learned from the datasets
2.2.2 Unsupervised learning
In contrast, the input data are not labeled in unsupervised learning Unsupervisedlearning describes a class of problems that involves using a model to describe orextract relationships in data Machines can learn to recognize complex processesand patterns without human supervision This method is especially beneficial whenspecialists do not know what to search for in the data, and the data itself does notoffer targets In actuality, the amount of unlabeled data is significantly more thanthe amount of labeled data; hence, unsupervised learning algorithms play a crucialrole in machine learning
Under many use cases of unsupervised learning, two main problems are often countered: clustering which involves finding groups in data, and density estimationwhich involves summarizing data distribution
en-• Clustering: k-mean clustering is a well-known technique of this type, where
k is the number of clusters to discover in the data It shares the same conceptwith classification However, in this case, there are no labels provided, andthe system will understand the data itself and cluster it
• Density estimation: an example of a density estimation algorithm is KernelDensity Estimation involves using small groups of closely related data samples
to estimate the distribution for new points in the problem space
Due to its complexity and implementation difficulty, this sort of machine learning
is not as popular as supervised learning, although it enables the solution of issueshumans would ordinarily avoid
2.2.3 Reinforcement learning
Reinforcement learning (RL) describes a class of problems where an agent operates
in an environment and must learn to operate using feedback According to [23],
Trang 16reinforcement learning is learning what to do — how to map situations to actions—tomaximize a numerical reward signal The learner is not told which actions to takebut instead must discover which actions yield the most reward by trying them.Reinforcement learning has five essential components: the agent, environment, state,action, and reward The RL algorithm (called the agent) will periodically improve
by exploring the environment and going through the different possible states Tomaximize the performance, the ideal behavior will be automatically determined bythe agents Feedback (the reward) is what allows the agent to improve its behavior
Figure 2.1: Reinforcement learning components
The idea can be translated into the following steps of an RL agent:
1 The agent observes an input state
2 An action is determined by a decision-making function (policy)
3 The action is performed
4 The agent receives a scalar reward or reinforcement from the environment
5 Information about the reward given for that state/action pair is recorded
In RL, there are two types of tasks: episodic and continuous
• Episodic task is the task that has a terminal state, this creates an episode:
a list of states, actions, rewards, and new states Video games are a typicalexample of this type of task
• Continuous task: opposite to episodic task, this one has no terminal stateand will never end In this case, the agent has to learn how to choose thebest actions and simultaneously interact with the environment For example,
a personal assistance robot does not have a terminal state
Two of the most used algorithms in RL are Monte Carlo and Temporal ence (TD) Learning The Monte Carlo method involves learning from experience
Trang 17Differ-It learns through sequences of states, actions, and rewards Suppose our agent is instate s1, takes action a1, gets a reward of r1, and is moved to state s2 This wholesequence is an experience TD learning is an unsupervised method for predicting theexpected value of a variable across a sequence of states TD employs a mathemati-cal trick to substitute complicated thinking about the future with a simple learningprocedure that yields the same outcomes Instead of computing the whole futurereward, TD attempts to forecast the mix of immediate reward and its prediction offuture reward at the next moment.
2.3 Techniques
2.3.1 Deep Learning
Since AI has been around for a long, it has a vast array of applications and is dividedinto numerous subfields Deep Learning (DL) is a subset of ML, which is itself abranch of AI
The figure below is a visual representation of the relationship between AI, ML, andDL
Figure 2.2: Relationship between AI, ML, and DL
2.3.1.1 Deep Learning and Neural Networks
In recent years, Machine Learning has achieved considerable success in AI research,enabling computers to outperform or come close to matching human performance
in various domains, including facial recognition, speech recognition, and languageprocessing
Machine Learning is the process of teaching a computer how to accomplish a task
Trang 18instead of programming it step-by-step Upon completion of training, a MachineLearning system should be able to make precise predictions when presented withdata.
Deep Learning is a subset of Machine Learning, capable of being different in someimportant respects from traditional Machine Learning, allowing computers to solve
a wide range of unsolvable complex problems As an example of a simple MachineLearning task, we can predict how ice cream sales will change based on the outdoortemperature Making predictions using only a few data features in this way isrelatively simple and can be done using a Machine Learning technique called linearregression
However, numerous problems in the real world do not fit into such simplistic works Recognizing handwritten numerals is an illustration of one of these difficultreal-world issues To tackle this issue, computers must be able to handle the widediversity of data presentation formats Each digit from 0 to 9 can be written in anunlimited number of ways; the size and shape of each handwritten digit can varydramatically depending on the writer and the context
frame-Allowing the computer to learn from previous experiences and comprehend the data
by interacting with it via a system consisting of many layers of concepts is an effectivemethod for resolving these issues This strategy enables computers to tackle complexproblems by constructing them from smaller ones If this hierarchy is represented
by a graph, it will be formed by many layers and defined by deep [2] That is theidea behind neural networks
A neural network is a model made up of many neurons Each neuron is an processing unit capable of receiving input, processing it, and giving appropriateoutput Figure 2.3 is the visual representation of a neural network
information-Figure 2.3: Neural Network
All neural networks have an input layer, into which data is supplied before passing
Trang 19through several layers and producing a final prediction at the output layer In aneural network, there are numerous hidden layers between the input and outputlayers; thus, the term Deep in Deep Learning and Deep Neural Networksrefers to the vast number of hidden layers – typically greater than three – at thecore of these neural networks.
Neural Networks enable computers to learn multi-step programs Wherein eachnetwork layer is analogous to the computer’s memory after running another set ofinstructions in parallel
Figure 2.4 illustrates the process of a deep learning model recognizing an image of
a person For a computer, an image is a set of pixels, and mapping a collection ofpixels to an object’s identity is an extremely complex process Therefore, attempting
to learn or evaluate this mapping directly appears overwhelming Instead, deeplearning overcomes this challenge by decomposing the intended complex mappinginto a series of layered simple mappings, each of which is defined by a distinct modellayer Each layer of the network represents the different features from the low level(edges, corners, contours) to the higher level features
Figure 2.4: Illustration of a deep learning model [2]
2.3.1.2 Perceptron
In ML, the perceptron is the most commonly used term for all folks It is a buildingblock of a neural network Invented by Frank Rosenblatt in the mid of the 19thcentury, perceptron is a linear ML algorithm used for supervised learning for binary
Trang 20The perceptron consists of three parts:
• Input nodes (or one input layer): this is the fundamental component of ceptron that accepts the initial data for subsequent processing Each inputnode carries a real-valued integer
per-• Weight and bias: weight shows the strength of the particular node, and bias
is a value to shift the activation function curve up or down
• Activation function: this component is used to map the input between therequired values like (0, 1) or (-1, 1)
Figure 2.5: Perceptron
The perceptron works on these simple steps:
1 All the inputs x are multiplied with their weights w
2 Add all the multiplied values and call them weighted sum
3 Apply that weighted sum to the activation function
Perceptron is also one of the most straightforward neural network neuron tations
represen-2.3.1.3 Feed forward
First appeared in the 50s, the feedforward neural network was the first and simplesttype of artificial neural network In this network, the information is processed inonly one direction - forward - from the input nodes, through hidden nodes, to theoutput nodes
Trang 21• Input layer: it contains neurons responsible for receiving input The data isthen transmitted to the subsequent tier The total number of neurons in theinput layer equals the number of variables in the dataset.
• Hidden layer: lies between the input and output layers, this layer contains ahigh number of neurons that modify the inputs, and they communicate withthe output layer
• Output layer: this is the final layer and its construction depends on the model’sconstruction In addition, the output layer is the expected characteristic sinceyou are aware of the desired result
2.3.1.4 Recurrent Neural Network
Recurrent neural network (RNN) introduced another type of node called a recurrentnode In RNN, the connection between nodes can create a cycle, so the output fromsome nodes can affect subsequent input to the same nodes RNNs are employedwhen the output is influenced by the context and order of the input fed to the model.Some typical use cases of RNNs are text autocompletion, speech recognition, andhandwriting recognition
2.3.1.5 Deep Convolutional Network
Deep Convolutional Networks (DCNs, or Convolutional Neural Networks - CNNs)are one of the most popular neural networks nowadays The original concepts ofDCN appeared in studies of the visual cortex since 1980 [24] Research into the visualcortex shows that some neurons respond to only a small area of the image, whileothers respond to larger areas of the image These large areas are a combination ofthe small areas in front Also, some neurons respond to horizontal lines, and othersrespond to vertical lines These observations lead to the idea that the neurons inthe higher layer synthesize features from lower layers, but each neuron only looks atpart of the layer below it, not synthesizing all the information from the lower layer.Research on the visual cortex inspired Yann Lecun in 1998 to introduce the CNNLeNet architecture [25] with the core components of two blocks: Convolutional andPooling
The architecture of CNN is described in detail in section 2.3.2
2.3.1.6 Training a Neural Network
The process of training a neural network can be summarized in the following steps:
1 Model initialization: the beginning is the initial stage in the learning process(the initial hypothesis) The training of neural networks can begin in any
Trang 22location It is, therefore, usual practice to randomize the initialization since
an iterative learning process can produce a pseudo-ideal model regardless ofthe beginning point
2 Forward propagation: after initializing the model, its performance must beevaluated The input will first be transmitted straight to the network layer
to calculate the model’s output This step is known as forward propagationbecause the calculation flow proceeds forward from the input to the outputvia the neural network
3 Calculate loss function: at this stage, once we have the output of the neuralnetwork, we need to compare it with the desired output by computing the value
of the loss function It evaluates the ability of the neural network to generateoutputs as close to the desired value as possible
4 propagation: this is the essence of neural network training propagation is the parameter tuning of the network based on the loss obtained
Back-in the previous step The error will be propagated from the output layer to thelayers before it Adjusting the weights appropriately reduces the error rate,making the model reliable
The optimization function will help to find the weights that will — hopefully
— yield a smaller loss in the next iteration Gradient descent [26] is thetechnique commonly used in this step
5 Iterate until convergence: as the weights are updated with a small deltastep at a time, the learning process will be gradual Steps 2 to 4 will berepeated until the stopping condition is reached It can be the number ofrepetitions of the training steps, or when the value of the loss function reaches
a threshold, etc
2.3.2 Convolutional Neural Network
Goodfellow et al defined Convolutional Neural Network (CNN, or ConvNet) as “aspecialized kind of neural network for processing data that has a known grid-liketopology Examples include time-series data, which can be thought of as a 1-D gridtaking samples at regular time intervals, and image data, which can be thought of
as a 2-D grid of pixels” [2] This model employs a mathematical operation calledconvolution, thus called Convolutional Neural Network
Why CNN over the feedforward neural network? It is due to the fact that while
a neural network can handle relatively simple data, it produces poor results withcomplicated data such as images, where pixels are interdependent Through the use
Trang 23of filters, CNN is able to comprehend the spatial dependence between image pixels.
In addition, this architecture also provides better performance for image data byreducing the number of related parameters and being able to reuse the weightswithout losing important features of the image
Figure 2.6 illustrates the architecture of a Convolutional Neural Network
Figure 2.6: Architecture of a CNN [3]
2.3.2.1 Image kernel
In computer vision, image kernels are useful for image processing techniques ferent image effects, such as outlining, sharpening, blurring, and embossing, areachieved by applying convolution operation between the image and different ker-nels In ML, they can be utilized for feature extraction, which is the process ofextracting the most important bits of input (image in this case)
Dif-In a technical sense, an image kernel is merely a matrix that provides the spatialweight of different image pixels
2.3.2.2 The convolution operation
Mathematically, convolution is a linear operation that results in a function by puting two existing functions
com-The general expression of a convolution is
g(x, y) = w ∗ f (x, y) =
m/2Xu=−m/2
n/2Xv=−n/2
w(u, v)f (x − u, y − v) (2.1)
Where
• g(x, y) is the filtered image
• f(x, y) is the original image
• w is the filter
Trang 24• (m x n) is the shape of the filter
An indispensable component of convolution is the kernel matrix (filter) The anchorpoint of the kernel will determine the matrix region on the image to be convolved;typically, the anchor point is the kernel’s center The value of each element on thekernel is regarded as a composite factor with the value of each pixel in the regioncorresponding to the kernel
The kernel matrix is shifted through each pixel in the image, beginning in the upperleft corner and working its way to the bottom right And position the associatedanchor point at the pixel under consideration At each displacement, calculate theresultant pixel using the convolution formula mentioned above
Figure 2.7 illustrates how the filter is applied to an image The source image is
an 8 × 8 matrix and convolution filter (kernel) is a 3 × 3 matrix Multiply thekernel value by the corresponding pixel values and add the product for each currentposition in the resultant matrix
Figure 2.7: Example of convolution operation [4]
2.3.2.3 Motivation
Applying convolution in DL is motivated by the fact that it can exploit three ideasthat boost NN efficiency: sparse interactions (or sparse weights), parametersharing, and equivariant representation Additionally, convolution allows forworking with inputs of variable size
In traditional neural networks, fully connected layers connect all input units to alloutput units, which means that every input unit interacts with every output unit.However, convolutional networks often have sparse interactions (also referred to
Trang 25as sparse connectivity or sparse weights) This is achieved by decreasing thesize of the kernel relative to the input.
When processing a picture, for instance, the input may have thousands or millions ofpixels, yet we may detect small, important features such as edges with kernels thatoccupy just tens or hundreds of pixels This implies that fewer parameters must
be stored, which decreases the model’s memory requirements and computationalcomplexity, and increases its statistical efficiency
Figure 2.8 and 2.9 are the demonstrations of sparse connectivity In Figure 2.8, onlythree outputs are affected by x instead of five Similarly, in Figure 2.9, only threeinputs affect s
Figure 2.8: Sparse connectivity, viewed from below [2]
Figure 2.9: Sparse connectivity, viewed from above [2]
Parameter sharing means using the same parameter for more than one function
in a model, unlike traditional neural networks where each component of the weightmatrix is multiplied by only one input In CNNs, each member of the kernel isused at every position of the input Due to the parameter sharing employed bythe convolution operation, rather than learning a distinct set of parameters foreach location, we only learn one set Parameter sharing helps reduce the storagerequirements of the model
Equivariance: a function is said to be equivariant if the output changes in the samemanner as the input Mathematically, a function f (x) is equivariant to a function g
Trang 26if f (g(x)) = g(f (x)) In the case of convolution, let g be any function that translatesthe input, then the convolution function is equivariant to g.
For instance, if we had a function g that shifts each pixel of the image I by onepixel to the right, i.e., I′(x, y) = I(x − 1, y), we can express this as follows: if weapply transformation g to the image and then convolution, the result will be thesame as if we applied convolution to I and then translation g to the output Thismeans that when processing images, if the input is moved one pixel to the right, itsrepresentations will also shift one pixel to the right [2]
This property is a result of the particular form of parameter sharing Since the sameweights are applied to all locations of the image, if an object appears in any of them,
it will be detected regardless of where it is in the image This trait is highly helpful
in applications like image classification and object detection when the object mayappear more than once or be moving
2.3.2.4 Activation function
In Neural Networks, the activation function is a node placed at the end or in themiddle It assists in determining whether or not a neuron will fire The activationfunction represents the nonlinear modification applied to the input signal Thealtered output is then forwarded to the subsequent layer as input
So why do we need nonlinear activation functions? They are used to prevent ity Without activation functions, a neural network is just a linear regression model(i.e., data would pass through layers of the model only going through linear func-tions), and it is not enough for the more complex tasks where we need to representthe complicated functions
linear-Figure 2.10 shows some common activation functions
Figure 2.10: Common activation functions [5]
Trang 27The formula of Sigmoid function is f (s) = 1+exp(−s)1 If the input is big, the functionwill give an output close to 1 With a small input, the function will give an outputclose to 0 This function was used a lot in the past because of its very nice derivative.
In recent years, this function is rarely used since it has one basic drawback:
• Sigmoid saturates and kills gradients: a noticeable disadvantage is that whenthe input has a large absolute value (very negative or very positive), the gra-dient of this function will be very close to 0 This means that the coefficientscorresponding to the unit under consideration will almost not be updated.The ReLU function is a widely used activation function in neural networks todaybecause of its simplicity The mathematical formula of ReLU is: f (x) = max(0, s).Some of its advantages are:
• The ReLU proved itself in accelerating the training of Neural Networks [27].This acceleration is attributed because the ReLU is calculated almost instan-taneously, and its gradient is also calculated extremely fast with a gradient of
1 if the input is greater than 0, and 0 if the input is less than 0
• Although the ReLU function has no derivative at s = 0, in practice, it iscommon to define ReLU′(0) = 0 and further assert that the probability thatthe input of a unit is 0 is very small
2.3.2.5 Pooling
According to [2], a typical convolutional network layer comprises three stages Inthe first stage, it executes many parallel convolutions to generate a set of linearactivations In the second stage, each linear activation is passed via a nonlinearactivation function, such as the Sigmoid or ReLU function This stage is also known
as the detector stage Finally, a pooling function is used in the third stage tofurther change the layer’s output
The first and second stages were introduced in sections 2.3.2.2 and 2.3.2.4, in thissection, we will discuss the pooling operation
The pooling layer is accountable for lowering the spatial dimension of the convolutedfeature (the feature map) Dimensionality reduction will be used to decrease theprocessing power required to process the data In addition, it is useful for extractingdominant characteristics that are rotationally and spatially invariant, hence sustain-ing the model training process The pooling layer summarizes the features present
in a region of the convolution layer-generated feature map Therefore, subsequentoperations are done using summarized features rather than precisely positioned fea-tures produced by the convolution layer This makes the model more robust againstvariations in the position of image features
Trang 28Similar to convolution, the pooling operation involves sliding a two-dimensional filterover each channel of the feature map and summarising the features lying within theregion covered by the filter.
There are numerous sorts of pooling operations, depending on the mechanism lized Two common pooling operations are max pooling and average pooling
uti-• Max pooling is a pooling operation that selects the maximum element fromthe region of the feature map covered by the filter Thus, the output afterthe max pooling layer would be a feature map containing the most prominentfeatures of the previous feature map Figure 2.11 shows how the max poolingworks
• Average pooling computes the average of the elements present in the region
of the feature map covered by the filter Thus, while max pooling gives themost prominent feature in a particular patch of the feature map, averagepooling gives the average of features present in a patch The average pooling
is illustrated in Figure 2.12
Figure 2.11: Max pooling
Figure 2.12: Average pooling
2.3.3 Fully convolutional network
Fully convolutional networks (FCNs) are neural networks that execute only tional operations Equivalently, an FCN is a normal CNN, where the fully connected
Trang 29convolu-layer is substituted by another convolutional convolu-layer.
A typical CNN is not FCN since it has fully connected layers that are dense, i.e., they have a large number of parameters Converting a CNN to an FCN
parameter-is predicated on the fact that the fully connected layers can also be consideredconvolutions that cover the entire input region [6]
Here, in a layer, each neuron only connects to a few local neurons in the previouslayers and the weight is shared between neurons This connection structure is typi-cally employed when the data can be understood as spatial and the features to beretrieved are spatially local and equally likely to exist at any input place The mostcommon use case for convolutional layers is image datasets
An FCN usually consists of two parts to obtain the output:
• Down-sampling path: this path is used to extract and interpret the contextsemantic/contextual information In this path, the width and height of featuremaps are gradually decreased, but their depth is increased
• Up-sampling path: this path is used to recover the precise spatial informationFCNs additionally use skip connections to retrieve the fine-grained spatial informa-tion that was lost during the downsampling process
Figure 2.13 shows the architecture of an FCN The model supports both feedforwardand back-propagation The FCN uses a 1x1 convolution as the last layer to classifypixels such that the final output will have the same shape as the input
Figure 2.13: Architecture of an FCN [6]
Trang 302.3.4 Some common convolutional network architectures
2.3.4.1 VGG
VGG stands for Visual Geometry Group Published in 2014 by Symonian et al [28],VGG is a standard deep convolutional neural network architecture with multiplelayers and has achieved remarkable achievements VGG16 achieves almost 92.7%top-5 test accuracy in ImageNet, VGG19 achieves 92.0% top-5 accuracy, 74.5% top-1accuracy in ImageNet (16 ad 19 stand for the number of layers in the network).The VGG architecture serves as the foundation for innovative object recognitionmodels Designed as a deep neural network, the VGG outperforms benchmarks onnumerous tasks and datasets outside ImageNet In addition, it remains one of themost prominent image recognition architectures
However, there are two major drawbacks to VGG:
Trang 31• Vanishing gradient: as the number of network layers increases, the value of theproduct of derivatives drops until the partial derivative of the loss function ap-proaches a value close to zero, at which point the partial derivative disappears.This problem can be partially solved by using Batch Normalization.
• Degradation: according to the observations in [8], as we increase networkdepth, the accuracy gets saturated and sometimes, it even drops
To address these problems, a simple and efficient concept called Residual Block(Figure 2.15) has been proposed This block ensures that the learning result of anew layer will be at least as efficient as the result of the previous layer
Figure 2.15: A residual block [8]
With some improvements, ResNet has achieved some remarkable achievements:93.95% top-5 accuracy, 78.25% top-1 accuracy on ImageNet
2.3.4.3 DenseNet
DenseNet is a densely connected-convolutional network It is pretty similar toResNet, although there are fundamental distinctions DenseNet concatenates theoutput of the previous layers with the output of the future layers, whereas ResNetuses an additive approach that combines the prior layer (identity) with the futurelayer Figure 2.16 shows the difference between DenseNet and Resnet architecture.DenseNet was designed to improve the accuracy of high-level neural networks af-flicted by the vanishing gradient, which occurs when the distance between the inputand output layers is so great that information is lost before reaching its intendeddestination
2.3.4.4 UNet
The UNet was introduced in [10] for Bio-Medical Image Segmentation It is a refineddesign derived from a fully convolutional neural network that is utilized for fastand accurate image segmentation It was given this name because its construction
Trang 32Figure 2.16: DenseNet architecture vs ResNet architecture [9]
resembles the letter U The architecture comprises a contracting path to capturecontext and a path that expands symmetrically to enable exact localization Thisstructure is similar to the architecture of an encoder-decoder neural network Thisnetwork can be trained end-to-end from very few images and outperforms the priorbest method, which was a sliding-window convolutional network
Figure 2.17: UNet architecture [10]
In Figure 2.17, the architecture of the UNet is presented where each blue box responds to a multi-channel feature map The number of channels is denoted on