Keywords: Face Reconstruction, ResNet, Image Segmentation, ConvolutionNeural Network, UNet, Attention, CUNet, UVGAN, UV Map Author Tran Quang Chung... From left to righ: input image, gro
Trang 1HANOI UNIVERSITY OF SCIENCE AND
TECHNOLOGY
Master’s Thesis
in Computer Science
An Improved UNets Architecture
and Its Applications
Cải tiến kiến trúc U-Nets và các ứng
dụng
TRAN QUANG CHUNG
Chung.TQCB190214@sis.hust.edu.vn
Supervisor: Dr Dinh Viet Sang
Department: Computer Science
Ha Noi, 4/2021
Trang 2Declaration of Authorship and Topic Sentences
• Proposing a new architecture to improve the existing one
• Applying the architecture for some tasks
• Evaluating on many standard benchmark datasets
4 Declaration of Authorship
I hereby declare that my thesis titled "An Improved UNets Architecture andIts Applications" is my own work and my supervisor Dr Dinh Viet Sang Allpapers, sources, tables, etc that used in this thesis are thoroughly cited
5 Supervisor Confirmation
Ha Noi, April, 2021Supervisor
Dr Dinh Viet Sang
Trang 3Throughout the writing of the thesis, I have received a large amount of support from
my teachers, my friends, my colleague
Firstly, I would like to express my deep and sincere gratitude to all of the ers of School of Information and Communication Technology - Hanoi University ofScience and Technology who equipped me a large amount of important information.Second, I would like to thank my supervisor, Dr Dinh Viet Sang, whose expertisewas invaluable in formulating the research questions and methodology for a newbielike me The first teacher taught me how to start in research work and write a goodscientific paper His insightful feedback pushed me to a higher level and taught mehow to solve a problem
teach-I would also like to thank VAteach-IS (Vietnam Artificial teach-Intelligence Solutions) pany for supporting me a lot The VAIS company provided me the hardware re-sources such as GPU, server, hard disk driver, etc to finish my thesis
com-Finally, I am grateful to my parents for their love, wise counsel Also, I express
my thanks to my friends who always support me in difficult situations
Trang 4In recent years, deep learning technology develops rapidly and applies to port many problems in life Almost the deep learning methods surpass the tradi-tional ones in many challenges such as image segmentation, image detection, facerecognition, etc However, it faced some challenges, and there is still room for im-provement In this thesis, we focus on two domains that are face reconstruction andpolyp segmentation Face reconstruction is an important module to improve theperformance of the pose-invariant face recognition system The recognition system
sup-is suffering its accuracy from the problems such as different poses, illumination, pression Hence, we propose two variants of a newly developed generative model(ResCUNet, Attention ResCUNet) that can transform a profile face into a frontalface The proposed model can reconstruct the frontal face from the profile face, andthe synthetic faces are natural, photorealistic, coherent, and identity-preserved As
ex-a result, our proposex-al improved the performex-ance of the fex-ace recognition system Forthe polyp segmentation task, the challenge is small polyps, illumination, and smalldata Thus, we proposed the Attention Ret-CUNeSt model to solve this challenge
On these tasks, we conducted many experiments, and our proposals surpass manyprevious studies in the standard benchmark datasets
Keywords: Face Reconstruction, ResNet, Image Segmentation, ConvolutionNeural Network, UNet, Attention, CUNet, UVGAN, UV Map
Author
Tran Quang Chung
Trang 51.1 Introduce some tasks in Computer Vision 1
1.1.1 Face recognition 2
1.1.2 Image segmentation 2
1.2 Introduce the problem and Motivation 3
1.2.1 Face Reconstruction 3
1.2.2 Polyp Segmentation 4
1.3 Contribution of the Master Thesis 5
1.4 Outline of the Master Thesis 5
2 Theoretical Basis 6 2.1 Convolution Neural Networks 6
2.1.1 Layers 6
2.1.2 Spatial Convolution 8
2.1.3 Spatial Pooling 9
2.1.4 Backpropagation algorithm 10
2.1.5 Gradient descent 11
2.1.6 Dropout 14
2.1.7 Tranfer Learning 15
3 Literature Review 16 3.1 Convolutional Neural Networks 16
3.1.1 LeNet 16
3.1.2 VGG 17
3.1.3 ResNet 18
3.2 Face Reconstruction 19
Trang 63.2.1 3D Morphable Model 19
3.2.2 3DDFA 20
3.2.3 UV-GAN 20
3.2.4 VGG 20
3.3 Polyp Segmentation 21
3.3.1 U-NET 21
3.3.2 ResUNet++ 22
3.3.3 Attention UNet 23
3.3.4 Pranet 25
4 Proposed Method 28 4.1 Face Reconstruction 28
4.2 Polyp Segmentation 34
4.2.1 Overall Architecture 36
4.2.2 Backbone: ResNet family 36
4.2.3 Coupled U-Nets 38
4.2.4 Loss function 38
5 Experiments and Results 39 5.1 The dataset 39
5.1.1 Multi-PIE 39
5.1.2 Dataset Verification 41
5.1.3 Polyp Segmentation 41
5.2 Face Reconstruction 42
5.2.1 Image Reconstruction 42
5.2.2 Pose Invariance Face Recognition 44
5.2.3 Attention map visualization 47
5.2.4 Failed Cases 48
5.3 Polyp Segmentation 49
5.3.1 Data augmentation 49
5.3.2 Evaluation metrics 49
5.3.3 Ablation study 50
5.3.4 Comparison to existing methods 51
Trang 7List of Tables
5.1 Evaluation of different methods on Multi-PIE dataset 42
5.2 Verification results on different poses on the Multi-PIE dataset 44
5.3 Verification accuracy (%) comparison on the LFW and CPLFW datasets 44
5.4 Verification accuracy (%) comparison on the CFP dataset 44
5.5 Performance metrics for model variants trained using Scenario 1, i.e.training on CVC-Colon and ETIS-Larib, testing on CVC-Clinic 50
5.6 Performance metrics for Mask-RCNN and Attention ResCUNeSt ing Scenario 2, i.e using CVC-Colon for training, CVC-Clinic fortesting 52
us-5.7 Performance metrics for Mask-RCNN, Double UNet and AttentionResCUNeSt using the Scenario 3, i.e using CVC-ClinicDB for train-ing, ETIS-Larib for testing 54
5.8 mDice and mIoU scores for models trained using the Scenario 4 onthe Kvasir-SEG and CVC-ClinicDB test sets 55
5.9 Performance metrics for UNet, MultiResUNet and Attention
ResCUNeSt-101 using the Scenario 5, i.e 5-fold cross-validation on the Clinic dataset 55
CVC-5.10 Performance metrics for UNet, ResUNet++, PraNet and AttentionResCUNeSt-101 using the Scenario 6, i.e 5-fold cross-validation onthe Kvasir-SEG dataset 56
Trang 8List of Figures
1.1 Facial Recognition System1 2
1.2 Four-level of image segmentation2 3
2.1 A simple neural architecture has three layers which are input, hidden, output layers Each neuron is connected by a directed arrow 3 6
2.2 a) a drawing of brain neuron, b) its mathematical function 4 7
2.3 a) sigmoid activation function, b) tanh activation function5 8
2.4 Convolution Operation6 9
2.5 Max-Pooling Operation 7 10
2.6 The backpropagation algorithm8 10
2.7 Gradient Descent Algorithm9 11
2.8 Dropout is used as a regularization technique10 14
3.1 The first deep learning architecture - LeNet11 16
3.2 VGG 16 architecture used for the ILSVRC-2012 and ILSVRC-2013 competitions12 17
3.3 A residual block13 18
3.4 UV-Gan framework consists of one generator (U-Net) and global and local discriminator 21
3.5 The UNET architecture has a contracting path and expanding path 22
3.6 The ResUnet++ architecture 23
3.7 The Attention Gate (AG) receives the two inputs input features (xl) and gating signals (g) Firstly, input features (x) are up-sampled and add gating signals, then passing to two activation functions (ReLU, Sigmoid) to produce attention coefficient maps (α) Finally, input features (ˆxl) are scaled with attention coefficients 14 24
3.8 Attention UNet architecture 25
3.9 The Pranet architecture includes three reverse attention modules at-taching at lass three high-level features 25
Trang 94.2 ResCUNet with a coupled U-Nets enhanced by residual connectionswithin each U-Net 30
4.3 Attention ResCUNet-A a advanced version of the previous work The generator of proposed Attention ResCUNet consists of cou-pled U-Nets Skip connections within each U-Net are enhanced withattention gates before concatenation The contextual informationfrom the first U-Net decoder is weighted fused with attentive low-level feature maps of the second U-Net encoder before concatenationwith the high-level coarse feature maps of the second U-Net decoder
net-An auxiliary loss is used to improve gradient flow during the trainingphase 30
4.4 Discriminators and identity preserving module of proposedAttention ResCUNet-GAN The global discriminator is responsi-ble for the global structure of entire UV maps The local discrimina-tor focuses on the local facial details The identity preserving modulekeeps the identity information unchanged during the modification ofthe generator 33
4.5 Overview of the proposed Attention ResCUNeSt Attention gateswithin each UNet are used to suppress irrelevant information in theencoder’s feature maps Skip connections across two UNets are alsoutilized to boost information flow and promote feature reuse 34
4.6 Split attention in the k-th cardinal group with R splits 37
5.1 Camera labels and approximate positions inside the gathering room.There were 13 cameras placed at head height, separated in 15◦ in-tervals Two added cameras (08_1 and 19_1) were positioned abovethe subject, simulating a typical surveillance camera 39
5.2 Montage of all 15 cameras in the dataset, exhibited with frontal flashillumination 13 of the 15 cameras were placed at head height withtwo extra cameras mounted higher up to receive views typically en-countered in surveillance purposes 40
5.3 The creation of ground-truth complete UV maps Three facialimages with yaw angles of 0◦, −30◦, 30◦ are fed to the 3DDFA model
to create three incomplete UV maps which are then merged by Poissonblending to generate the ground-truth complete UV map 40
Trang 10LIST OF FIGURES
5.4 Some samples of positive pairs from the CFP dataset 41
5.5 Results with frontal input images Incomplete UV maps aregenerated using 3DDFA Next columns are ground truth UV maps,results of UV-GAN, results of normal ResCUNet-GAN, intermediateresults of Attention ResCUNet-GAN (after the first U-Net) and fi-nal results of Attention ResCUNet-GAN (after the second U-Net),respectively The most right block shows some synthetic images gen-erated based on the final results of Attention ResCUNet-GAN 42
5.6 Results with profile input images Incomplete UV maps aregenerated using 3DDFA Next columns are ground truth UV maps,results of UV-GAN, results of normal ResCUNet-GAN, intermediateresults of Attention ResCUNet-GAN (after the first U-Net) and fi-nal results of Attention ResCUNet-GAN (after the second U-Net),respectively The most right block shows some synthetic images gen-erated based on the final results of Attention ResCUNet-GAN 43
5.7 Results with in-the-wild input images Incomplete UV mapsare generated using 3DDFA The ground truth UV maps are unavail-able The next columns are the results of UV-GAN, results of normalResCUNet-GAN, intermediate results of Attention ResCUNet-GAN(after the first U-Net), and final results of Attention ResCUNet-GAN(after the second U-Net), respectively The right block shows somesynthetic images generated based on the final results of AttentionResCUNet-GAN 43
5.8 Synthetic images for frontal input images The left block responds to the result of UV-GAN The right block corresponds tothe final result of Attention ResCUNet-GAN (after the second U-Net) 46
cor-5.9 Synthetic images for profile input images The left block sponds to the result of UV-GAN The right block corresponds to thefinal result of Attention ResCUNet-GAN (after the second U-Net) 46
corre-5.10 Synthetic images for in-the-wild input images The left blockcorresponds to the result of UV-GAN The right block corresponds tothe final result of Attention ResCUNet-GAN (after the second U-Net) 47
5.11 Attention map visualization The first column contains UV mapsgenerated by 3DDFA network, the second column contains generated
UV maps overlaid by attention masks, and the last column illustratesattention coefficients only 47
Trang 115.12 Some failed cases when the input facial images are mal" with respect to the training data The top row shows theinput images, the second row contains incomplete UV map and thethird row displays the completed UV maps generated by our Atten-tion ResCUNet-GAN 48
"abnor-5.13 Qualitative result comparison using Colon for training and Clinic fortesting From left to righ: input image, ground truth, visualization
of ResNet101-MaskR-CNN’s output in overlay mode, binary output
of ResNet101-MaskR-CNN, visualization of ResNet50-MaskR-CNN’soutput in overlay mode, binary output of ResNet50-MaskR-CNN,binary output of Attention ResCUNeSt-101, and attention map inthe last attention gate denote by S9 in Fig 4.5 The red color in theattention map indicates the region where the model focus on 53
5.14 The results of Attention ResCUNeSt-101 on CVC-Clinic dataset Fromleft to right: input image, ground truth, output of the first UNet, out-put of the second UNet, and attention map in the last attention gateS9 The red areas in the attention map are high probability wherepolyps appear 53
5.15 ROC curves and PR curves for Attention ResCUNeSt-101, PraNet,ResUNet++ and UNet in the Scenario 6, i.e., 5-fold cross-validation
on the Kvasir-SEG dataset All curves are averaged over five folds 56
5.16 Qualitative result comparison of different models trained in the nario 6, i.e., 5-fold cross-validation on the Kvasir-SEG dataset 56
Sce-5.17 Some failed cases of our model on the Kvasir-SEG dataset 57
Trang 12Chapter 1
Introduction
It is undeniable that artificial intelligence has brought human life a great deal
of convenience and benefit It is also the tool to free up the labor force and time
In the recent decade, Deep Learning (DL) has been surpassing a lot of traditionalmethods thanks to computational resources (GPU) and public datasets Anotherreason that deep learning has been evolving fast is the big data era Most alltechniques in computer vision, natural language processing, and speech processingare replaced by neural network algorithms Before, we would’ve never imaginedthat many applications are real or only in movies Some outstanding programs that
we can mention are google translate, self-driving car, face-recognition, healthcareprogram, etc In more detail, the google translate program is an intelligent systemfor anyone who must translate a source paper to a target paper The autonomouscar makes the journey safer thanks to a lot of sensors around it In addition, thereare many intelligent programs that we use every day
1.1 Introduce some tasks in Computer Vision
Nowadays, all the fields (medicine, traffic, manufacturing, astronomy, etc.) muststore their data (image, text, speech, etc.) This data is so valuable if we extractinformation from it However, it is still a challenge for many scientists, engineers.Recently, by using deep learning for image processing, we can be easily to exploitthe information from big data, and this field almost surpasses all traditional meth-ods More specifically, some tasks in computer vision are Face recognition, Imagesegmentation, etc that have many applications in reality
1
Trang 13Figure 1.1: Facial Recognition System 1
1 Classification: A entire image will be classified into a designated group such
as sheep, dog, person (see Fig1.2-a)
1
http://www.softscients.web.id/2016/09/face-detection-in-matlab-and-opencv html
2
Trang 142 Object Detection: specifying the location of the object and drawing abounding box around the object (see Fig 1.2-b)
3 Semantic Segmentation: This is a process that classifies a similar objectinto one group These groups are "semantically interpretable" and represent aclass in reality For instance, in the figure, the sheep are drawn by blue color,
a dog is associated with red color (see Fig 1.2-c)
4 Instance Segmentation: This technique is an upgrade version of tic Segmentation In this technique, a group of sheep will be separated intoindividual objects We can do action on each object, and this is the pointdifference comparing to Semantic Segmentation (see Fig 1.2-d)
Seman-Figure 1.2: Four-level of image segmentation 2
1.2 Introduce the problem and Motivation
In this thesis, we will focus on two domains Face Reconstruction and PolypSegmentation
1.2.1 Face Reconstruction
It is a fact that face recognition is so popular in many applications in reality.Hence, this domain has gained a lot of attention Contrary to other popular biomet-rics, face recognition applied to uncooperative subjects in a non-instructive manner
2
https://www.programmersought.com/article/55813687476/
3
Trang 15While (near)-frontal face recognition has gradually matured, face recognition in thewild is still challenging due to different unconstrained factors In fact, the perfor-mance of a face recognition system heavily depends on the pose of the input face.Recent studies show that the performance of face verification with the same view asfrontal-frontal or profile-profile is quite great However, the performance dramati-cally degrades when verifying faces in different views like frontal-profile [53] If werotate the profile face to the frontal face, the performance will increase This is a rea-son why we use this technique (face reconstruction) and propose a new architecture(ResCUNet, Attention ResCUNet) to improve the accuracy of face recognition.
1.2.2 Polyp Segmentation
Next, we want to bring the deep learning technique to medicine More ically, we use deep learning to segment the polyp region According to statistics,colorectal cancer (CRC) is the main cause of cancer deaths in the world for both menand women, and the number of patients increases quickly every year [4] Colonicpolyps that arise from glandular tissue in colonic mucosa are commonly found inthe colon, and stomach Most these adenomas are most often benign Some of thesetumors become malignant over time and affect organs, for example, the liver, thelungs, etc Eventually, the disease leads to death unless diagnosed and treated early[16] Nowadays, colonoscopy is the best standard device for colon screening Butsome following reasons can impact on detecting the polyps, and missing detection
specif-is so dangerous to patients When the endoscopspecif-ists explore the intestinal wall todetect polyps, they can skip small or flat polyps, the areas are smaller than 10 mm[34, 55] Colonoscopy is dependent on high skill and experiment endoscopists whomust operate eye-hand coordination competently Recently, some researches haveshown that 22%-28% of polyps are missed during colonoscopy [34] Consequently,the missing polyps can lead to reducing the survival rate to 10%, and this is undeni-able that segmentation and detection of cancer in the early stages of the dangerousdisease will increase the ability to cure [42] Besides, other factors can be men-tioned such as low quality of images, outdated tools, the clinicians’ concentration,etc [33, 1] In the past, some studies from scientists leveraged the power of thecomputer system and computer vision with the purpose that they can reduce themissing rate and improve the detection capability Most of the existing works in do-main automatic polyp segmentation and detection can divided into two big groups:1) methods which use handcrafted; 2) methods which use end-to-end learning, morespecifically deep learning methods In this thesis, we propose a novel architecture(Attention Res-CUNeSt), and our proposal surpasses many previous studies in thistask
4
Trang 161.3 Contribution of the Master Thesis
The main contribution of this thesis is:
• We propose three novel architectures for two tasks (ResCUNet ,Attention ResCUNet, Attention Res-CUNeSt): face reconstruction andpolyp segmentation
• Evaluate on the various dataset: For individual tasks (face reconstructionand polyp segmentation), we evaluate our proposal on many popular datasets
to obtain the best performance
1.4 Outline of the Master Thesis
The rest of this thesis is structured as follows:
1 Introduction: The section describes the problems and our contribution
2 Theoretical basis: The section describes the theory of computer vision anddeep learning
3 Literature Review: The section describes the related works and previousworks This section is an essential foundation to propose a novel architecture
4 Proposed method: The section describes the detail of three novel tures for two tasks: face reconstruction and polyp segmentation
architec-5 Experiments and Results: The section describes the dataset that uses inthis thesis The experiments and the results are represented here
6 Conclusion: The section gives the conclusion and the future works
5
Trang 17Chapter 2
Theoretical Basis
2.1 Convolution Neural Networks
2.1.1 Layers
2.1.1.1 Linear or Fully Connected
Figure 2.1: A simple neural architecture has three layers which are input, hidden,output layers Each neuron is connected by a directed arrow 1
The linear layer operates the computation like the human brain The humannervous system has approximately 86 billion neurons, and they are linked with
1
https://cs231n.github.io/neural-networks-1/
6
Trang 18nearly 1014-1015 synapses Overall, the input signal at dendrites passes to neuronsand yields the output at axon terminals We can visualize the model like the figure
2.2, and each node describes an artificial neuron, and an arrow draws a connectionbetween two neurons We can model the nervous system by a simple linear functionthat mimics the action of a human
Mathematically, we can consider a linear layer as a function which used a lineartransformation on the vector input I and output a vector O More detail, see theequation2.1 and 2.2 below
Figure 2.2: a) a drawing of brain neuron, b) its mathematical function2Sigmoid The Sigmoid non-linearity has the following mathematical form
2 https://cs231n.github.io/neural-networks-1/
7
Trang 19A characteristic of this function is an "S"-shape It takes a real value and squashes
it in the range 0-1 However, when the neuron’s activation saturates at two points 0
or 1, the gradient at these points is almost zero As a result, the back-propagationalgorithm fails to modify and update its parameters
Hyperbolic Tangent The TanH non-linearity function has the mathematicalform as below
2.1.2 Spatial Convolution
Convolution is a simple mathematical operation that is extremely important
to many image processing operators Convolution provides a way of multiplyingtwo arrays of numbers, generally of different sizes, but of the same dimensionality,usually has the size 3x3, 5x5, 7x7 to produce a third array of numbers of the samedimensionality In the image processing field, the first array is an image (gray-level
or rgb-level) and the second array is smaller than the first one, and the second one
3
https://www.programmersought.com/article/1060528072/
8
Trang 20is called by the kernel, convolution matrix, or mask This technique is used forblurring, sharpening, finding edges, etc.
Besides, the convolution operation has two hyperparameter that is padding andstride When we want to apply convolution operation multiple times, but the outputimage is always smaller than the original image As a result, we can lose muchinformation in this process To resolve this problem, we can pad the input imagebefore convolution by adding values at the border of the images There are manystrategies to pad value such as mirror, zero, etc In almost all of the cases, zero-padding is so popular Stride is an action that we tell the number of pixels we willjump when we convolve with the original image
The formula of a convolution is: G = H∗F
Where H is a 2D image, F is a 2D filter(kernel)
Convolution can visualize as the figure2.5
Figure 2.4: Convolution Operation4
2.1.3 Spatial Pooling
The pooling function is an operation that will reduce the size of the array, aswell as the parameter of the model This prevents overfitting of the model andspeeds up the computation There are many pooling functions such as max-polling,average-pooling, min-pooling, and max-pooling is used frequently One of the mostimportant reasons for using the pooling function is to make the input feature invari-ant to small translation This means that if we use the local translation, max-poolinghelps to maintain most of the crucial value Therefore, we can conclude that the
4 https://indoml.com/2018/03/07/student-notes-convolutional
9
Trang 21pooling function emphasizes the meaningful feature and ignores the irrelevant ture of subjects.
fea-Figure 2.5: Max-Pooling Operation 5
2.1.4 Backpropagation algorithm
The backpropagation algorithm is a popular technique for training artificial ral networks, especially deep neural networks The algorithm came around in 1960-
neu-1970, but it was used in 1986 when it was formally applied as the learning procedure
to train neural networks The backpropagation algorithm is needed to calculate thegradient, and it will adjust the weights of the weight matrices of the model Theweights of each neuron of the network are modified by calculating the gradient ofthe loss function A gradient descent optimization algorithm is used to solve thiswork The picture 2.6 shows how the algorithm works
Figure 2.6: The backpropagation algorithm6
Trang 222.1.5 Gradient descent
Gradient descent is one of the most important algorithms applied to train deepneural networks Gradient descent is an optimization algorithm used to minimize theweight by iteratively moving in the direction of the steepest descent The purpose
of the iterative optimization algorithm is to find the global or local minimum Morespecifically, the gradient descent algorithm can use to update the parameters of theneural network In machine learning, each model will define a loss function, andthen it will optimize the parameters of the network to receive the minimum of thefunction The pseudocode for this algorithm can describe in 2 steps
1 Use the first-order derivative to compute the gradient to determine the tion
direc-2 Move in the opposite direction of the increase of the slope to find the minima
Figure 2.7: Gradient Descent Algorithm7The figure 2.7 describes the overview of the algorithm
2.1.5.1 Stochastic Gradient Descent
In this algorithm, one training sample is passed through the neural network at
a time, and the parameters of each layer are updated with the computed gradient.Hence, a single training sample is passed through the model at a time, and the lossfunction will calculate the deviation between its corresponding label and the output
7
https://blog.clairvoyantsoft.com/the-ascent-of-gradient-descent-23356390836f
11
Trang 23of the model The weights of all the layers of the model are updated after everytraining sample For example, if the training dataset has ten samples, then theloss function will calculate ten times, and the weight of the model will update tentimes for each individual example The following equation describes the stochasticgradient descent algorithm and gradient descent Keep in mind that the stochasticgradient descent will iterate over N times for N training samples in the trainingdataset.
θj = θj − α ∂
In the above equation 2.7, θ is the weight (parameter) of the model, α corresponds
to the learning rate that it can adjust the speed of the model
Advantages of Stochastic Gradient Descent
• Firstly, it is suitable for small memory because the model processes a singletraining sample at a time
• Secondly, it is computationally fast as only one training sample is processed
at a time
• Thirdly, due to frequent updates the model can quickly converge to the localminimum
Disadvantages of Stochastic Gradient Descent
• Due to frequent updates the steps that go forward the minima are very noisy.This can sometimes lead that the model jumps into other directions
• The time for the training model will increase significantly because the lossfunction must calculate for each sample at each epoch
• It loses the advantage of vectorized operations as it deals with only a singleexample at a time
2.1.5.2 Batch Gradient Descent
The idea of the batch gradient descent algorithm works the same as stochasticgradient descent The big difference of the algorithm is that the model will pass alltraining samples, and the loss function will calculate the deviation and update theweight of the model More specifically, if the training dataset contains 100 samples,the model will receive all samples at a time and the parameters of the neural networkare updated once The equation2.7 is iterated over only once
Advantages of Batch Gradient Descent
12
Trang 24• The time for the training model is so quickly because the loss function lates for all samples once and the weights of the model update once.
calcu-• The algorithm makes the gradient become more stable than stochastic gradientdescent
• It is easier to find the local/global minima because it is less noisy and oscillated.The model will calculate the average of the error and carry out the updatedprocess for its hyperparameter
Disadvantages of Batch Gradient Descent
• This algorithm depends on the computer resource because the model receivesthe entire sample and calculates the output
• To train the model, we need a big memory due to which additional memorymight be needed
2.1.5.3 Mini Batch Gradient Descent Batch
This strategy mixes both stochastic and batch gradient descent This means thatthe training dataset is divided into multiple groups called batches The number oftraining samples in the batch called batch size At a time a single batch is passedthrough the model and the loss will calculate the error for this batch Afterward, themodel is updated its parameter by average the error For example, if the trainingset has 100 samples, we can divide it into 5 batches and each batch has 20 samples
So the equation 2.7 will be iterated over 5 times (number of batches)
Advantages of Mini Batch Gradient Descent
This technique is the combination of the advantages of the above methods So it isthe most commonly used in practice
• Require the small memory
• It is computationally efficient
• Benefit from vectorization
• Avoid stuck in local minima
• Stable gradients descent
• Converge quickly
13
Trang 25Figure 2.8: Dropout is used as a regularization technique 8
2.1.6 Dropout
In machine learning, "dropout" refers to the process of randomly dropping outcertain nodes in a layer during training In the figure2.8, the left side represents anormal neural network where all nodes are kept On the right, some connections areignored and it is shown by red nodes More specifically, the values of their weightsand biases are not considered during the training phase Each node in the modelcan remain with the probability p or ignore with probability 1-p The dropouttechnique is used as a regularization, so it can prevent the overfitting problem
In addition, there are many regularization approaches that are popularly used inmachine learning
• Early stopping: stop training automatically when a specific performancemeasure (e.g Validation loss, accuracy) stops improving
• Weight decay: This technique is used to prevent a few weights which have
a big value This makes the model jump in the wrong direction It tivizes the network to adjust the weight slowly by adding a penalty to the lossfunction
incen-• Noise: allow some random fluctuations in the data through augmentation(which makes the network robust to a larger distribution of inputs and henceimproves generalization)
• Model combination: average the outputs of separately trained neural works (requires a lot of computational power, data, and time)
net-8
https://laptrinhx.com/a-simple-introduction-to-dropout-regularization
14
Trang 262.1.7 Tranfer Learning
Transfer learning is an interesting idea to train the deep neural By using edge obtained from one task used to solve similar ones This idea comes from thefact that humans have the ability to leverage our knowledge to deal with relatedsituations We can use the existing knowledge to solve new problems faster withthe best solutions For example, If you know how to ride a bike, you can learnhow to drive a motorbike In deep learning, you want to build a model to classifythe images, and there are 1000 images in your dataset However, you wish thatyour model is very deep to learn complex features Consequently, your model willoverfit in just some steps The solution for these problems is that we can utilize apre-trained model such as VGG, ResNet, MobileNet, etc Before applying transferlearning, we need to answer two main questions:
learn-15
Trang 27Chapter 3
Literature Review
3.1 Convolutional Neural Networks
One of the famous architectures in deep learning is LeNet [31], VGG [49], ResNet[18], etc These models break the traditional methods and are state-of-the-art inimage processing The starting point of the deep learning era is LeNet architec-ture with the purpose of recognizing handwritten that proposed by Yan Le Cun.Recently, the rapid development of deep learning is impressive thanks to computa-tional resources, that is GPU, TPU VGG model and ResNet model are one of theefficient architectures for the classification task Besides, these networks are alsoefficient when using as a backbone for extracting meaningful features Here I willintroduce these famous architectures
3.1.1 LeNet
Figure 3.1: The first deep learning architecture - LeNet1This is one of the first successful deep learning architectures for image classifi-cation The model was developed by Yann LeCan et al in the 1990s He combined
a convolutional neural network trained by the back-propagation algorithm for ognizing the handwritten number Afterward, this architecture was used to identify
rec-1
https://medium.com/@pechyonkin/key-deep-learning-architectures-lenet-5-6fc3c59e6f4
16
Trang 28handwritten zip code numbers provided by US Postal Service Overall, this networkwas the starting architecture in the deep learning era Recently, many state-of-the-art models appear by inheriting this idea Figure 3.1 describes the detail of thenetwork.
LeNet-5 features can be summarized as:
• Sequence of 3 modules: convolution layer, pooling layer, fully-connected layer,
• Inputs are grayscale-level, and their values are normalized using the mean of
0 and standard deviation of 1 to accelerate the training phase
• Using two activation functions is hyperbolic tangent and sigmoid
• Using average pooling layer in the architecture
• Fully connected layers as a final classifier,
• Mean squared error as a loss function
1000 classes with over 14 million images It improves AlexNet by replacing an
2
https://towardsdatascience.com/step-by-step-vgg16
17
Trang 29oversized kernel filter (11 and 5 in the first and second layer, respectively) with a
smaller kernel filter(3x3) in the first two layers The architecture is described as
following:
• The input for VGG is 224x224, RGB color The authors cropped the center
of each image to keep the fixed input size
• Convolutional Layers: use kernel size (3x3), stride step is 1 pixel
• The network has five convolution layers The first two layers and the final
three layers use 2 and 3 convolution kernels consecutively, respectively
• Fully Connected Layer: The author uses three linear layers attaching at the
end of the network The first two fully connected layers have 4096-d, and the
final one is 1000-d that represents for 1000 class in the challenge
3.1.3 ResNet
Figure 3.3: A residual block 3
ResNet [18] was proposed by
Kaim-ing He and et al and took the deep
learning world by storm when it appears
as the first neural network that could
train hundreds of layers without
drop-ping its accuracy to the vanishing
gra-dient problem The network was
imple-mented to stack a lot of layers to avoid
some issues and without hurting the
performance thanks to a special
mech-anism Normally, neural networks are
trained by the backpropagation algorithm, which minimized the loss function to
find the local optimization However, when stacking many layers and many
calcula-tions, there is a challenging problem for training this model Because the gradients
from the final layer back-propagate to the first layer will disappear, causing accuracy
to saturate
The solution for the problem above is "identity shortcut connections" The model
stacks the identity mappings and plus them with the output of the layer These
shortcuts not only reuse the previous feature but also support faster learning In
the paper, the author al et conducted many experiments to demonstrate that the
deeper models are more efficient than their shallower counterparts As a result,
ResNet architecture quickly became one of the most popular models in image
pro-cessing fields
The figure3.3 will show "a residual block" that makes deeper models become more
efficient when training with many layers
3
https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec
18
Trang 303.2 Face Reconstruction
3.2.1 3D Morphable Model
Blanz and Vetter [5] introduce the 3D Morphable Model model (3DMM) torecover the 3D face from a 2D image Assuming that the 3D face scan with Nvertexes can be represented as a 3N × 1 vector S = [x1, y1, z1, , xN, yN, zN]T ∈∈
R3N, where [xi, yi, zi]T are the object-centered Cartesian coordinates of the i-thvertex Given the dataset of such 3D face scans, one would like to represent them
as a smaller set of variables The authors in [5] propose to use a two-stage principlecomponent analysis (PCA) to estimate the shape identity parameters along withexpression parameters of the 3D faces Suppose that, after the first stage, they keepfirst ns principal components and s1, s2, , sns are the corresponding orthonormalbasis, then a 3D face S can be represented as follows:
where ¯S ∈ R3N is the mean shape vector across the dataset of 3D face scans and
α = [α1, , αn s] is the shape parameters
In the second stage, a new PCA model is trained on the difference between pression scans and neutral scans After this stage, the final shape of a representation
where f is the scale factor, Pr = 1 0 0
0 1 0 is the orthographic projection matrix and19
Trang 31t2d is the principal point that is set to the image center.
Suppose that the set of all the model parameters are denoted by p = [f, R, t2d, α, β]
3.2.2 3DDFA
Method 3DDFA associates Cascaded Regression and a Convolutional NeuralNetwork (CNN) Cascaded CNN can be formulated as:
where pkis the model parameters at the k-th iteration, which is updated by applying
a CNN-based regressor N etk on the shape indexed feature F eat that depends onthe input image I and the current parameters pk
The purpose of the CNN regressors is to predict the parameter update ∆p toshift the initial parameter p0 as close as possible to the ground truth pg In term ofobjective function, [69] proposes to use the Optimized Weighted Parameter DistanceCost (OWPDC):
Eowpdc =(∆p + p0− pg)Tdiag(w∗) (3.6)
(∆p + p0− pg),where w∗ is the optimized parameter importance vector
3.2.3 UV-GAN
The UV-GAN [10] model is the first architecture that comprises 2 phases togenerate a synthetic facial image In the first phase, the author uses the traditional3DDM model to reconstruct a 3D face, 3D shape from a 2D facial image However,the 3DDM model can not recover the invisible part of the image, this leads to aproblem that is missing textures for not the frontal face Hence, the author proposes
a new model called UV-GAN to fill up the invisible facial region The framework
3.4 has one UV generation network, two discriminators
3.2.4 VGG
3.2.4.1 Generation Network
The Generation Network that the author used in his research is the U-Net model.The model works as an auto-encoder or generative model to transform image toimage More specifically, the author follows the architecture designed for image
to image translation task, and it is the pix2pix model The model has shortcut
20
Trang 32Figure 3.4: UV-Gan framework consists of one generator (U-Net) and global andlocal discriminator.
connections from encoder to decoder to enrich the feature and not increase theparameter of the model To handle the missing facial part, the author filled up
it using random noise and concatenate with its mirror image to produce input forthe model As a result, at the random location, the model can learn to find therelationship between visible and invisible parts Besides, to preserve the identityinformation, the author adopts L1 loss (pixel-wise) as the reconstruction loss
3.2.4.2 Discrimination Network
The generator that describes in the previous section can fill up the missing gions with small reconstruction errors But it does not guarantee about the facialoutput to be visually realistic and informative Hence, the author used a discrimi-native network to improve the quality of the synthetic image and also produce morephotorealistic results The task of the discriminator is that how to distinguish be-tween real and fake UV maps The discriminator has two networks: a global and
re-a locre-al discriminre-ator The globre-al discriminre-ator determines the fre-aithfulness of theentire UV maps And the local discriminator will focus on the facial center such asthe nose, eyes, mouth, forehead Because the inner face contains more information
to identify the people Therefore the benefit of using the global and local inator is to make the model focuses on not only the surrounding context but alsothe central face
discrim-3.3 Polyp Segmentation
3.3.1 U-NET
In 2015 [43], Olaf Ronneberger et al proposed a novel framework for imagesegmentation task and won the ISBI cell tracking challenge The network has twopaths that are contracting path (encoder path) and expanding path (decoder path)
In the contracting path, it will capture the context information By contrast, theexpanding path will enable to determine the localization precisely Hence, thisnetwork is the end-to-end segmentation
21
Trang 33Figure 3.5: The UNET architecture has a contracting path and expanding path.
The figure 3.5 describes the detail of the architecture In the encoder path, ateach block, the model uses two convolution operations consecutively with kernel-size 3x3 (unpadding), followed by a Relu activation function In addition, at eachdownsample step, the author doubles the number of channels With the decoderpath, firstly, the shape of the feature was doubled, followed by kernel-size 2x2.Then the author concatenates it with the feature from encoder path and uses filter3x3, followed by a ReLU function
3.3.2 ResUNet++
The ResUnet++ [28] model inspired by ResUNet model, which is an architecturethat uses the power of deep residual learning The proposed model used state-of-the-art modules such as the residual blocks, the squeeze and excitation block [20],ASPP [7], and the attention block, see figure 3.6
Firstly, The residual block is similar to ResNet architecture, which plus input tooutput as a shortcut connection Hence, the author can build a deeper model thatcan eliminate the "vanishing" problem The author used four encoder blocks Eachblock has two successive 3x3 convolution operations and one identity mapping Inaddition, each block consists of a batch normalization layer, a ReLU function layer,
a convolution layer The output of the first three blocks is passed through to Squeezeand Excite layer (SE) [20]
The SE module is to ensure that the network can increase its sensitivity to theimportant features and suppress the unimportant features The network has two
22
Trang 34steps to achieve its goal The first step is to squeeze feature (global informationembedding), use global average pooling in each channel Next, the second step isexcitation with the purpose of capturing the channel-wise dependencies Afterward,the final output of the encoder path is put through Atrous Spatial Pyramidal Pool-ing layer (ASPP).
Figure 3.6: The ResUnet++ architecture
The ASPP network acts as a bridge
between the encoder and the decoder
The network can capture the contextual
information at various scales Besides,
the input feature map uses many
par-allel atrous convolutions with different
rates, then all of it was fused The
rea-son that the author integrates this
net-work is its power, and it uses
success-fully in many segmentation tasks
To connect the feature from encoder
to decoder, the author used the
atten-tion mechanism, which is so popular
in Natural Language Processing (NLP)
The mechanism will determine what
re-gion that network focuses on The
important advantage of the attention
mechanism is to enhance the quality of
features, as well as to improve the
re-sults
3.3.3 Attention UNet
3.3.3.1 Attention Gate
In standard CNN architecture, the model will downsample gradually in order
to capture the semantic contextual information In this way, the features on thecoarse spatial level will represent the location and relationship between low-levelwith high-level However, with small objects, it still remains difficult for detectingand segmenting as well as decreases the missing predictions Hence, the author et
al proposed an attention gate (AG) [38] to improve the accuracy of the model.This module can integrate easily into the standard CNN model The goal of AGs is
to suppress feature responses in irrelevant locations and to only focuses on relevantregions The figure 3.7 describes the detail of the AGs
23
Trang 35Figure 3.7: The Attention Gate (AG) receives the two inputs input features (xl) andgating signals (g) Firstly, input features (x) are up-sampled and add gating signals,then passing to two activation functions (ReLU, Sigmoid) to produce attention co-efficient maps (α) Finally, input features (ˆxl) are scaled with attention coefficients.4
qattl = γT(α1(WxTxli+ WgTgi+ bg)) + bγ (3.7)
αli = σ2(qlatt(xil, gi, Θatt)) (3.8)where σ2 is sigmoid function
AGs contain set of parameters Θatt: Wx ∈ RF l ×F int, Wg ∈ RF g ×F int,
γ ∈ RFint ×1 and bias terms bγ ∈ R, bg ∈ RF
int.The range of attention coefficients is from 0 to 1, αi ∈ [0, 1], this makes the modelprune feature responses to preserve the activations relevant The module is repre-sented by a set of hyperparameter, that is Wx, Wg, γ The linear transformationsare computed using kernel-size 1x1x1 for the feature response Then, the output willpass through the sigmoid function to squeeze the value in the given range Next,the output of AGs is the element-wise multiplication of input feature-maps and at-tention coefficients to produce attention output, ˆxl
i,c = xl i,c· αl
i Finally, the authoremphasizes that the AG module can be trained with the standard back-propagationalgorithm, not like hard-attention
3.3.3.2 Attention UNet Architecture
The model 3.8 follows the standard UNet and integrates AGs into its ture Thus, the model has 2 paths, the encoder path, and the decoder path TheAGs like a bridge that connects the feature map from encoder and decoder Firstly,the shape of the feature map of the decoder will double but remain the number
architec-of channels Next, it will combine with the feature map architec-of the encode (the sameresolution) via AGs to determine the relevant information as well as suppress the
4 https://arxiv.org/pdf/1804.03999.pdf
24
Trang 36Figure 3.8: Attention UNet architecture
irrelevant information As a consequence, the new model can improve its accuracyand increase its sensitivity
3.3.4 Pranet
Figure 3.9: The Pranet architecture includes three reverse attention modules taching at lass three high-level features
at-3.3.4.1 Parallel Partial Decoder
Usually, the models (U-Net, ResUnet, etc.) that have encoder, decoder willaggregate all-level features to produce the final output However, Wu et al gave
25
Trang 37information that high-level features are more meaningful than low-level ones Inaddition, high-level features also require less computational resources due to theirsmall resolution This is a reason that the author uses the last three features in order
to aggregate them to yield the global map The block that merges these features iscalled a parallel partial decoder More specifically, when using Res2Net[15] to extractfeatures, they will have file levels of features fi, i = 1, , 5 Then, they split theminto two categories, low-level fi, i = 1, 2, high-level fi, i = 3, 4, 5 Next, the partialdecoder pd(.) that is a SOTA decoder component is computed by P D = pd(f3, f4, f5)
3.3.4.2 Reverse Attention Module
To supervise the feature at high-level feature, a reverse attention model is used
to refine the relatively rough segmentation Thus, the author integrated a trainablereverse attention module in three high-level features More specifically, the output ofreverses attention weight is in a range from 0 to 1 (element-wise ) , and multipliesthem with high-level features As a result, this technique not only makes the modelfocus on polyp location but also improves the accuracy In addition, it suppressesthe irrelevant information to reduce the mistake of the network The formula isdescribed as bellow:
Trang 38We have just analyzed some relevant studies on face reconstruction and polypsegmentation This is a baseline model and theoretical foundation to propose anew model This model is still facing some challenges For example, the pose-invariant face recognition system must identify the face images from different poses,illumination, expression UV Gan is a new idea that transforms from a 2D facialimage to a 3D face mesh and an incomplete UV map Then author uses a generativemodel to recover the self-occluded regions However, the author only uses a simpleU-Net model as the generative model This is the reason why we propose a novelarchitecture (ResCUNet , Attention ResCUNet) replacing the previous model Forthe polyp segmentation task, the challenge is small polyps, illumination, and smalldata Thus, we propose the Attention Res-CUNeSt model to solve this challenge.Afterward, we conducted a lot of experiments to demonstrate our proposal As aresult, our proposal surpasses the previous works on the standard benchmark.
27
Trang 39Chapter 4
Proposed Method
In this thesis, we proposed three novel architects (CUNet, Attention CUNet, Attention ResCUNeSt) to solve two tasks (face reconstruction, polyp seg-mentation) We applied many modern modules that were the Attention Gate mod-ule, the powerful backbone (ResNet, ResNeSt), auxiliary loss, many connectionsbetween UNets, double the original UNet model Then, the proposal was evaluated
Res-on standard benchmarks to compare with previous studies Our research for theface reconstruction task was publicized at MAPR-2020, SCI-journal-Q1
Recently, generative adversarial networks (GANs) [8] have proved to be erful to mimic data distribution GANs have been successfully applied to manycomputer vision tasks such as image inpainting [39, 59, 61], style transfer [36, 68],image synthesis [29, 30], super-resolution [32] and so on These successful appli-cations have motivated researchers to apply GANs to pose-invariant feature disen-tanglement [53, 54], face completion [56] and face frontalization [53, 21, 60,65,12].TP-GAN [21] uses a two-pathway GAN that simultaneously learns global struc-tures and local information for photorealistic frontal view synthesis Zhao et al.[66] propose a unified deep architecture containing a face frontalization module and
pow-a discriminpow-ative lepow-arning module, which cpow-an be jointly lepow-arned in pow-an end-to-endfashion TP-GAN [21] uses a two-pathway GAN that simultaneously learns global
28
Trang 40Figure 4.1: A pipeline process of face synthesis Using 3DDFA to obtain a3D mesh and an incomplete UV map Then a new generative model is applied torecover the self-occluded regions The completed UV map is attached to the fitted3D mesh to generate faces of arbitrary poses.
structures and local information for photorealistic frontal view synthesis Zhao et
al [66] propose a unified deep architecture containing a face frontalization moduleand a discriminative learning module, which can be jointly learned in an end-to-endfashion
In [10], Deng et al propose an adversarial UV map completion framework calledUV-GAN to solve pose-invariant face recognition without the need of extensive posecoverage in the training dataset The authors in [10] first fit a 3DMM [6] to 2D profileface and get an incomplete UV map, which is then fulfilled by a straightforwardpix2pix [23, 24] The generator architecture in pix2pix follows the general shape
of U-Net [43] to add skip connections between encoder and decoder subnetworks inorder to enhance the transfer of low-level information between input and output.One weakness of the original UV-GAN is the plain architecture of the generator,which is shown to be worse than residual networks [18] Another weakness is that oneU-Net block seems to be not enough to mix well low-level information in the encoderwith high-level semantic features in the decoder In [58], Deng et al use UV-GANwith similar architecture as in [10] to extract side information as well as subspacesand combine UV-GAN with robust PCA for the face recognition task He et al [19]introduce a framework for heterogeneous face synthesis from near-infrared (NIR) tothe visible domain The framework consists of two adversarial generators to estimate
a UV map and a facial texture map from an input NIR face and then generate acorresponding frontal visible face Nevertheless, both generators in this framework
29
... between encoder and decoder subnetworks inorder to enhance the transfer of low-level information between input and output.One weakness of the original UV-GAN is the plain architecture of the generator,which... ResNeSt), auxiliary loss, many connectionsbetween UNets, double the original UNet model Then, the proposal was evaluatedRes-on standard benchmarks to compare with previous studies Our research... module, which cpow -an be jointly lepow-arned in pow -an end-to-endfashion TP-GAN [21] uses a two-pathway GAN that simultaneously learns global
28
Trang