An improved unets architecture and its applications = cải tiến kiến trúc u nets và các ứng dụng

Keywords: Face Reconstruction, ResNet, Image Segmentation, ConvolutionNeural Network, UNet, Attention, CUNet, UVGAN, UV Map Author Tran Quang Chung... From left to righ: input image, gro

Trang 1

HANOI UNIVERSITY OF SCIENCE AND

TECHNOLOGY

Master’s Thesis

in Computer Science

An Improved UNets Architecture

and Its Applications

Cải tiến kiến trúc U-Nets và các ứng

dụng

TRAN QUANG CHUNG

Chung.TQCB190214@sis.hust.edu.vn

Supervisor: Dr Dinh Viet Sang

Department: Computer Science

Ha Noi, 4/2021

Trang 2

Declaration of Authorship and Topic Sentences

• Proposing a new architecture to improve the existing one

• Applying the architecture for some tasks

• Evaluating on many standard benchmark datasets

4 Declaration of Authorship

I hereby declare that my thesis titled "An Improved UNets Architecture andIts Applications" is my own work and my supervisor Dr Dinh Viet Sang Allpapers, sources, tables, etc that used in this thesis are thoroughly cited

5 Supervisor Confirmation

Ha Noi, April, 2021Supervisor

Dr Dinh Viet Sang

Trang 3

Throughout the writing of the thesis, I have received a large amount of support from

my teachers, my friends, my colleague

Firstly, I would like to express my deep and sincere gratitude to all of the ers of School of Information and Communication Technology - Hanoi University ofScience and Technology who equipped me a large amount of important information.Second, I would like to thank my supervisor, Dr Dinh Viet Sang, whose expertisewas invaluable in formulating the research questions and methodology for a newbielike me The first teacher taught me how to start in research work and write a goodscientific paper His insightful feedback pushed me to a higher level and taught mehow to solve a problem

teach-I would also like to thank VAteach-IS (Vietnam Artificial teach-Intelligence Solutions) pany for supporting me a lot The VAIS company provided me the hardware re-sources such as GPU, server, hard disk driver, etc to finish my thesis

com-Finally, I am grateful to my parents for their love, wise counsel Also, I express

my thanks to my friends who always support me in difficult situations

Trang 4

In recent years, deep learning technology develops rapidly and applies to port many problems in life Almost the deep learning methods surpass the tradi-tional ones in many challenges such as image segmentation, image detection, facerecognition, etc However, it faced some challenges, and there is still room for im-provement In this thesis, we focus on two domains that are face reconstruction andpolyp segmentation Face reconstruction is an important module to improve theperformance of the pose-invariant face recognition system The recognition system

sup-is suffering its accuracy from the problems such as different poses, illumination, pression Hence, we propose two variants of a newly developed generative model(ResCUNet, Attention ResCUNet) that can transform a profile face into a frontalface The proposed model can reconstruct the frontal face from the profile face, andthe synthetic faces are natural, photorealistic, coherent, and identity-preserved As

ex-a result, our proposex-al improved the performex-ance of the fex-ace recognition system Forthe polyp segmentation task, the challenge is small polyps, illumination, and smalldata Thus, we proposed the Attention Ret-CUNeSt model to solve this challenge

On these tasks, we conducted many experiments, and our proposals surpass manyprevious studies in the standard benchmark datasets

Keywords: Face Reconstruction, ResNet, Image Segmentation, ConvolutionNeural Network, UNet, Attention, CUNet, UVGAN, UV Map

Author

Tran Quang Chung

Trang 5

1.1 Introduce some tasks in Computer Vision 1

1.1.1 Face recognition 2

1.1.2 Image segmentation 2

1.2 Introduce the problem and Motivation 3

1.2.1 Face Reconstruction 3

1.2.2 Polyp Segmentation 4

1.3 Contribution of the Master Thesis 5

1.4 Outline of the Master Thesis 5

2 Theoretical Basis 6 2.1 Convolution Neural Networks 6

2.1.1 Layers 6

2.1.2 Spatial Convolution 8

2.1.3 Spatial Pooling 9

2.1.4 Backpropagation algorithm 10

2.1.5 Gradient descent 11

2.1.6 Dropout 14

2.1.7 Tranfer Learning 15

3 Literature Review 16 3.1 Convolutional Neural Networks 16

3.1.1 LeNet 16

3.1.2 VGG 17

3.1.3 ResNet 18

3.2 Face Reconstruction 19

Trang 6

3.2.1 3D Morphable Model 19

3.2.2 3DDFA 20

3.2.3 UV-GAN 20

3.2.4 VGG 20

3.3 Polyp Segmentation 21

3.3.1 U-NET 21

3.3.2 ResUNet++ 22

3.3.3 Attention UNet 23

3.3.4 Pranet 25

4 Proposed Method 28 4.1 Face Reconstruction 28

4.2.1 Overall Architecture 36

4.2.2 Backbone: ResNet family 36

4.2.3 Coupled U-Nets 38

4.2.4 Loss function 38

5 Experiments and Results 39 5.1 The dataset 39

5.1.1 Multi-PIE 39

5.1.2 Dataset Verification 41

5.1.3 Polyp Segmentation 41

5.2 Face Reconstruction 42

5.2.1 Image Reconstruction 42

5.2.2 Pose Invariance Face Recognition 44

5.2.3 Attention map visualization 47

5.2.4 Failed Cases 48

5.3.1 Data augmentation 49

5.3.2 Evaluation metrics 49

5.3.3 Ablation study 50

5.3.4 Comparison to existing methods 51

Trang 7

List of Tables

5.1 Evaluation of different methods on Multi-PIE dataset 42

5.2 Verification results on different poses on the Multi-PIE dataset 44

5.3 Verification accuracy (%) comparison on the LFW and CPLFW datasets 44

5.4 Verification accuracy (%) comparison on the CFP dataset 44

5.5 Performance metrics for model variants trained using Scenario 1, i.e.training on CVC-Colon and ETIS-Larib, testing on CVC-Clinic 50

5.6 Performance metrics for Mask-RCNN and Attention ResCUNeSt ing Scenario 2, i.e using CVC-Colon for training, CVC-Clinic fortesting 52

us-5.7 Performance metrics for Mask-RCNN, Double UNet and AttentionResCUNeSt using the Scenario 3, i.e using CVC-ClinicDB for train-ing, ETIS-Larib for testing 54

5.8 mDice and mIoU scores for models trained using the Scenario 4 onthe Kvasir-SEG and CVC-ClinicDB test sets 55

5.9 Performance metrics for UNet, MultiResUNet and Attention

ResCUNeSt-101 using the Scenario 5, i.e 5-fold cross-validation on the Clinic dataset 55

CVC-5.10 Performance metrics for UNet, ResUNet++, PraNet and AttentionResCUNeSt-101 using the Scenario 6, i.e 5-fold cross-validation onthe Kvasir-SEG dataset 56

Trang 8

List of Figures

1.1 Facial Recognition System1 2

1.2 Four-level of image segmentation2 3

2.1 A simple neural architecture has three layers which are input, hidden, output layers Each neuron is connected by a directed arrow 3 6

2.2 a) a drawing of brain neuron, b) its mathematical function 4 7

2.3 a) sigmoid activation function, b) tanh activation function5 8

2.4 Convolution Operation6 9

2.5 Max-Pooling Operation 7 10

2.6 The backpropagation algorithm8 10

2.7 Gradient Descent Algorithm9 11

2.8 Dropout is used as a regularization technique10 14

3.1 The first deep learning architecture - LeNet11 16

3.2 VGG 16 architecture used for the ILSVRC-2012 and ILSVRC-2013 competitions12 17

3.3 A residual block13 18

3.4 UV-Gan framework consists of one generator (U-Net) and global and local discriminator 21

3.5 The UNET architecture has a contracting path and expanding path 22

3.6 The ResUnet++ architecture 23

3.7 The Attention Gate (AG) receives the two inputs input features (xl) and gating signals (g) Firstly, input features (x) are up-sampled and add gating signals, then passing to two activation functions (ReLU, Sigmoid) to produce attention coefficient maps (α) Finally, input features (ˆxl) are scaled with attention coefficients 14 24

3.8 Attention UNet architecture 25

3.9 The Pranet architecture includes three reverse attention modules at-taching at lass three high-level features 25

Trang 9

4.2 ResCUNet with a coupled U-Nets enhanced by residual connectionswithin each U-Net 30

4.3 Attention ResCUNet-A a advanced version of the previous work The generator of proposed Attention ResCUNet consists of cou-pled U-Nets Skip connections within each U-Net are enhanced withattention gates before concatenation The contextual informationfrom the first U-Net decoder is weighted fused with attentive low-level feature maps of the second U-Net encoder before concatenationwith the high-level coarse feature maps of the second U-Net decoder

net-An auxiliary loss is used to improve gradient flow during the trainingphase 30

4.4 Discriminators and identity preserving module of proposedAttention ResCUNet-GAN The global discriminator is responsi-ble for the global structure of entire UV maps The local discrimina-tor focuses on the local facial details The identity preserving modulekeeps the identity information unchanged during the modification ofthe generator 33

4.5 Overview of the proposed Attention ResCUNeSt Attention gateswithin each UNet are used to suppress irrelevant information in theencoder’s feature maps Skip connections across two UNets are alsoutilized to boost information flow and promote feature reuse 34

4.6 Split attention in the k-th cardinal group with R splits 37

5.1 Camera labels and approximate positions inside the gathering room.There were 13 cameras placed at head height, separated in 15◦ in-tervals Two added cameras (08_1 and 19_1) were positioned abovethe subject, simulating a typical surveillance camera 39

5.2 Montage of all 15 cameras in the dataset, exhibited with frontal flashillumination 13 of the 15 cameras were placed at head height withtwo extra cameras mounted higher up to receive views typically en-countered in surveillance purposes 40

5.3 The creation of ground-truth complete UV maps Three facialimages with yaw angles of 0◦, −30◦, 30◦ are fed to the 3DDFA model

to create three incomplete UV maps which are then merged by Poissonblending to generate the ground-truth complete UV map 40

Trang 10

LIST OF FIGURES

5.4 Some samples of positive pairs from the CFP dataset 41

5.5 Results with frontal input images Incomplete UV maps aregenerated using 3DDFA Next columns are ground truth UV maps,results of UV-GAN, results of normal ResCUNet-GAN, intermediateresults of Attention ResCUNet-GAN (after the first U-Net) and fi-nal results of Attention ResCUNet-GAN (after the second U-Net),respectively The most right block shows some synthetic images gen-erated based on the final results of Attention ResCUNet-GAN 42

5.6 Results with profile input images Incomplete UV maps aregenerated using 3DDFA Next columns are ground truth UV maps,results of UV-GAN, results of normal ResCUNet-GAN, intermediateresults of Attention ResCUNet-GAN (after the first U-Net) and fi-nal results of Attention ResCUNet-GAN (after the second U-Net),respectively The most right block shows some synthetic images gen-erated based on the final results of Attention ResCUNet-GAN 43

5.7 Results with in-the-wild input images Incomplete UV mapsare generated using 3DDFA The ground truth UV maps are unavail-able The next columns are the results of UV-GAN, results of normalResCUNet-GAN, intermediate results of Attention ResCUNet-GAN(after the first U-Net), and final results of Attention ResCUNet-GAN(after the second U-Net), respectively The right block shows somesynthetic images generated based on the final results of AttentionResCUNet-GAN 43

5.8 Synthetic images for frontal input images The left block responds to the result of UV-GAN The right block corresponds tothe final result of Attention ResCUNet-GAN (after the second U-Net) 46

cor-5.9 Synthetic images for profile input images The left block sponds to the result of UV-GAN The right block corresponds to thefinal result of Attention ResCUNet-GAN (after the second U-Net) 46

corre-5.10 Synthetic images for in-the-wild input images The left blockcorresponds to the result of UV-GAN The right block corresponds tothe final result of Attention ResCUNet-GAN (after the second U-Net) 47

5.11 Attention map visualization The first column contains UV mapsgenerated by 3DDFA network, the second column contains generated

UV maps overlaid by attention masks, and the last column illustratesattention coefficients only 47

Trang 11

5.12 Some failed cases when the input facial images are mal" with respect to the training data The top row shows theinput images, the second row contains incomplete UV map and thethird row displays the completed UV maps generated by our Atten-tion ResCUNet-GAN 48

"abnor-5.13 Qualitative result comparison using Colon for training and Clinic fortesting From left to righ: input image, ground truth, visualization

of ResNet101-MaskR-CNN’s output in overlay mode, binary output

of ResNet101-MaskR-CNN, visualization of ResNet50-MaskR-CNN’soutput in overlay mode, binary output of ResNet50-MaskR-CNN,binary output of Attention ResCUNeSt-101, and attention map inthe last attention gate denote by S9 in Fig 4.5 The red color in theattention map indicates the region where the model focus on 53

5.14 The results of Attention ResCUNeSt-101 on CVC-Clinic dataset Fromleft to right: input image, ground truth, output of the first UNet, out-put of the second UNet, and attention map in the last attention gateS9 The red areas in the attention map are high probability wherepolyps appear 53

5.15 ROC curves and PR curves for Attention ResCUNeSt-101, PraNet,ResUNet++ and UNet in the Scenario 6, i.e., 5-fold cross-validation

on the Kvasir-SEG dataset All curves are averaged over five folds 56

5.16 Qualitative result comparison of different models trained in the nario 6, i.e., 5-fold cross-validation on the Kvasir-SEG dataset 56

Sce-5.17 Some failed cases of our model on the Kvasir-SEG dataset 57

Trang 12

Chapter 1

Introduction

It is undeniable that artificial intelligence has brought human life a great deal

of convenience and benefit It is also the tool to free up the labor force and time

In the recent decade, Deep Learning (DL) has been surpassing a lot of traditionalmethods thanks to computational resources (GPU) and public datasets Anotherreason that deep learning has been evolving fast is the big data era Most alltechniques in computer vision, natural language processing, and speech processingare replaced by neural network algorithms Before, we would’ve never imaginedthat many applications are real or only in movies Some outstanding programs that

we can mention are google translate, self-driving car, face-recognition, healthcareprogram, etc In more detail, the google translate program is an intelligent systemfor anyone who must translate a source paper to a target paper The autonomouscar makes the journey safer thanks to a lot of sensors around it In addition, thereare many intelligent programs that we use every day

1.1 Introduce some tasks in Computer Vision

Nowadays, all the fields (medicine, traffic, manufacturing, astronomy, etc.) muststore their data (image, text, speech, etc.) This data is so valuable if we extractinformation from it However, it is still a challenge for many scientists, engineers.Recently, by using deep learning for image processing, we can be easily to exploitthe information from big data, and this field almost surpasses all traditional meth-ods More specifically, some tasks in computer vision are Face recognition, Imagesegmentation, etc that have many applications in reality

1

Trang 13

Figure 1.1: Facial Recognition System 1

1 Classification: A entire image will be classified into a designated group such

as sheep, dog, person (see Fig1.2-a)

1

http://www.softscients.web.id/2016/09/face-detection-in-matlab-and-opencv html

2

Trang 14

2 Object Detection: specifying the location of the object and drawing abounding box around the object (see Fig 1.2-b)

3 Semantic Segmentation: This is a process that classifies a similar objectinto one group These groups are "semantically interpretable" and represent aclass in reality For instance, in the figure, the sheep are drawn by blue color,

a dog is associated with red color (see Fig 1.2-c)

4 Instance Segmentation: This technique is an upgrade version of tic Segmentation In this technique, a group of sheep will be separated intoindividual objects We can do action on each object, and this is the pointdifference comparing to Semantic Segmentation (see Fig 1.2-d)

Seman-Figure 1.2: Four-level of image segmentation 2

1.2 Introduce the problem and Motivation

In this thesis, we will focus on two domains Face Reconstruction and PolypSegmentation

1.2.1 Face Reconstruction

It is a fact that face recognition is so popular in many applications in reality.Hence, this domain has gained a lot of attention Contrary to other popular biomet-rics, face recognition applied to uncooperative subjects in a non-instructive manner

2

https://www.programmersought.com/article/55813687476/

3

Trang 15

While (near)-frontal face recognition has gradually matured, face recognition in thewild is still challenging due to different unconstrained factors In fact, the perfor-mance of a face recognition system heavily depends on the pose of the input face.Recent studies show that the performance of face verification with the same view asfrontal-frontal or profile-profile is quite great However, the performance dramati-cally degrades when verifying faces in different views like frontal-profile [53] If werotate the profile face to the frontal face, the performance will increase This is a rea-son why we use this technique (face reconstruction) and propose a new architecture(ResCUNet, Attention ResCUNet) to improve the accuracy of face recognition.

1.2.2 Polyp Segmentation

Next, we want to bring the deep learning technique to medicine More ically, we use deep learning to segment the polyp region According to statistics,colorectal cancer (CRC) is the main cause of cancer deaths in the world for both menand women, and the number of patients increases quickly every year [4] Colonicpolyps that arise from glandular tissue in colonic mucosa are commonly found inthe colon, and stomach Most these adenomas are most often benign Some of thesetumors become malignant over time and affect organs, for example, the liver, thelungs, etc Eventually, the disease leads to death unless diagnosed and treated early[16] Nowadays, colonoscopy is the best standard device for colon screening Butsome following reasons can impact on detecting the polyps, and missing detection

specif-is so dangerous to patients When the endoscopspecif-ists explore the intestinal wall todetect polyps, they can skip small or flat polyps, the areas are smaller than 10 mm[34, 55] Colonoscopy is dependent on high skill and experiment endoscopists whomust operate eye-hand coordination competently Recently, some researches haveshown that 22%-28% of polyps are missed during colonoscopy [34] Consequently,the missing polyps can lead to reducing the survival rate to 10%, and this is undeni-able that segmentation and detection of cancer in the early stages of the dangerousdisease will increase the ability to cure [42] Besides, other factors can be men-tioned such as low quality of images, outdated tools, the clinicians’ concentration,etc [33, 1] In the past, some studies from scientists leveraged the power of thecomputer system and computer vision with the purpose that they can reduce themissing rate and improve the detection capability Most of the existing works in do-main automatic polyp segmentation and detection can divided into two big groups:1) methods which use handcrafted; 2) methods which use end-to-end learning, morespecifically deep learning methods In this thesis, we propose a novel architecture(Attention Res-CUNeSt), and our proposal surpasses many previous studies in thistask

4

Trang 16

1.3 Contribution of the Master Thesis

The main contribution of this thesis is:

• We propose three novel architectures for two tasks (ResCUNet ,Attention ResCUNet, Attention Res-CUNeSt): face reconstruction andpolyp segmentation

• Evaluate on the various dataset: For individual tasks (face reconstructionand polyp segmentation), we evaluate our proposal on many popular datasets

to obtain the best performance

1.4 Outline of the Master Thesis

The rest of this thesis is structured as follows:

1 Introduction: The section describes the problems and our contribution

2 Theoretical basis: The section describes the theory of computer vision anddeep learning

3 Literature Review: The section describes the related works and previousworks This section is an essential foundation to propose a novel architecture

4 Proposed method: The section describes the detail of three novel tures for two tasks: face reconstruction and polyp segmentation

architec-5 Experiments and Results: The section describes the dataset that uses inthis thesis The experiments and the results are represented here

6 Conclusion: The section gives the conclusion and the future works

5

Trang 17

Chapter 2

Theoretical Basis

2.1 Convolution Neural Networks

2.1.1 Layers

2.1.1.1 Linear or Fully Connected

Figure 2.1: A simple neural architecture has three layers which are input, hidden,output layers Each neuron is connected by a directed arrow 1

The linear layer operates the computation like the human brain The humannervous system has approximately 86 billion neurons, and they are linked with

1

https://cs231n.github.io/neural-networks-1/

6

Trang 18

nearly 1014-1015 synapses Overall, the input signal at dendrites passes to neuronsand yields the output at axon terminals We can visualize the model like the figure

2.2, and each node describes an artificial neuron, and an arrow draws a connectionbetween two neurons We can model the nervous system by a simple linear functionthat mimics the action of a human

Mathematically, we can consider a linear layer as a function which used a lineartransformation on the vector input I and output a vector O More detail, see theequation2.1 and 2.2 below

Figure 2.2: a) a drawing of brain neuron, b) its mathematical function2Sigmoid The Sigmoid non-linearity has the following mathematical form

2 https://cs231n.github.io/neural-networks-1/

7

Trang 19

A characteristic of this function is an "S"-shape It takes a real value and squashes

it in the range 0-1 However, when the neuron’s activation saturates at two points 0

or 1, the gradient at these points is almost zero As a result, the back-propagationalgorithm fails to modify and update its parameters

Hyperbolic Tangent The TanH non-linearity function has the mathematicalform as below

2.1.2 Spatial Convolution

Convolution is a simple mathematical operation that is extremely important

to many image processing operators Convolution provides a way of multiplyingtwo arrays of numbers, generally of different sizes, but of the same dimensionality,usually has the size 3x3, 5x5, 7x7 to produce a third array of numbers of the samedimensionality In the image processing field, the first array is an image (gray-level

or rgb-level) and the second array is smaller than the first one, and the second one

3

https://www.programmersought.com/article/1060528072/

8

Trang 20

is called by the kernel, convolution matrix, or mask This technique is used forblurring, sharpening, finding edges, etc.

Besides, the convolution operation has two hyperparameter that is padding andstride When we want to apply convolution operation multiple times, but the outputimage is always smaller than the original image As a result, we can lose muchinformation in this process To resolve this problem, we can pad the input imagebefore convolution by adding values at the border of the images There are manystrategies to pad value such as mirror, zero, etc In almost all of the cases, zero-padding is so popular Stride is an action that we tell the number of pixels we willjump when we convolve with the original image

The formula of a convolution is: G = H∗F

Where H is a 2D image, F is a 2D filter(kernel)

Convolution can visualize as the figure2.5

Figure 2.4: Convolution Operation4

2.1.3 Spatial Pooling

The pooling function is an operation that will reduce the size of the array, aswell as the parameter of the model This prevents overfitting of the model andspeeds up the computation There are many pooling functions such as max-polling,average-pooling, min-pooling, and max-pooling is used frequently One of the mostimportant reasons for using the pooling function is to make the input feature invari-ant to small translation This means that if we use the local translation, max-poolinghelps to maintain most of the crucial value Therefore, we can conclude that the

4 https://indoml.com/2018/03/07/student-notes-convolutional

9

Trang 21

pooling function emphasizes the meaningful feature and ignores the irrelevant ture of subjects.

fea-Figure 2.5: Max-Pooling Operation 5

2.1.4 Backpropagation algorithm

The backpropagation algorithm is a popular technique for training artificial ral networks, especially deep neural networks The algorithm came around in 1960-

neu-1970, but it was used in 1986 when it was formally applied as the learning procedure

to train neural networks The backpropagation algorithm is needed to calculate thegradient, and it will adjust the weights of the weight matrices of the model Theweights of each neuron of the network are modified by calculating the gradient ofthe loss function A gradient descent optimization algorithm is used to solve thiswork The picture 2.6 shows how the algorithm works

Figure 2.6: The backpropagation algorithm6

Trang 22

2.1.5 Gradient descent

Gradient descent is one of the most important algorithms applied to train deepneural networks Gradient descent is an optimization algorithm used to minimize theweight by iteratively moving in the direction of the steepest descent The purpose

of the iterative optimization algorithm is to find the global or local minimum Morespecifically, the gradient descent algorithm can use to update the parameters of theneural network In machine learning, each model will define a loss function, andthen it will optimize the parameters of the network to receive the minimum of thefunction The pseudocode for this algorithm can describe in 2 steps

1 Use the first-order derivative to compute the gradient to determine the tion

direc-2 Move in the opposite direction of the increase of the slope to find the minima

Figure 2.7: Gradient Descent Algorithm7The figure 2.7 describes the overview of the algorithm

2.1.5.1 Stochastic Gradient Descent

In this algorithm, one training sample is passed through the neural network at

a time, and the parameters of each layer are updated with the computed gradient.Hence, a single training sample is passed through the model at a time, and the lossfunction will calculate the deviation between its corresponding label and the output

7

https://blog.clairvoyantsoft.com/the-ascent-of-gradient-descent-23356390836f

11

Trang 23

of the model The weights of all the layers of the model are updated after everytraining sample For example, if the training dataset has ten samples, then theloss function will calculate ten times, and the weight of the model will update tentimes for each individual example The following equation describes the stochasticgradient descent algorithm and gradient descent Keep in mind that the stochasticgradient descent will iterate over N times for N training samples in the trainingdataset.

θj = θj − α ∂

In the above equation 2.7, θ is the weight (parameter) of the model, α corresponds

to the learning rate that it can adjust the speed of the model

Advantages of Stochastic Gradient Descent

• Firstly, it is suitable for small memory because the model processes a singletraining sample at a time

• Secondly, it is computationally fast as only one training sample is processed

at a time

• Thirdly, due to frequent updates the model can quickly converge to the localminimum

Disadvantages of Stochastic Gradient Descent

• Due to frequent updates the steps that go forward the minima are very noisy.This can sometimes lead that the model jumps into other directions

• The time for the training model will increase significantly because the lossfunction must calculate for each sample at each epoch

• It loses the advantage of vectorized operations as it deals with only a singleexample at a time

2.1.5.2 Batch Gradient Descent

The idea of the batch gradient descent algorithm works the same as stochasticgradient descent The big difference of the algorithm is that the model will pass alltraining samples, and the loss function will calculate the deviation and update theweight of the model More specifically, if the training dataset contains 100 samples,the model will receive all samples at a time and the parameters of the neural networkare updated once The equation2.7 is iterated over only once

Advantages of Batch Gradient Descent

12

Trang 24

• The time for the training model is so quickly because the loss function lates for all samples once and the weights of the model update once.

calcu-• The algorithm makes the gradient become more stable than stochastic gradientdescent

• It is easier to find the local/global minima because it is less noisy and oscillated.The model will calculate the average of the error and carry out the updatedprocess for its hyperparameter

Disadvantages of Batch Gradient Descent

• This algorithm depends on the computer resource because the model receivesthe entire sample and calculates the output

• To train the model, we need a big memory due to which additional memorymight be needed

2.1.5.3 Mini Batch Gradient Descent Batch

This strategy mixes both stochastic and batch gradient descent This means thatthe training dataset is divided into multiple groups called batches The number oftraining samples in the batch called batch size At a time a single batch is passedthrough the model and the loss will calculate the error for this batch Afterward, themodel is updated its parameter by average the error For example, if the trainingset has 100 samples, we can divide it into 5 batches and each batch has 20 samples

So the equation 2.7 will be iterated over 5 times (number of batches)

Advantages of Mini Batch Gradient Descent

This technique is the combination of the advantages of the above methods So it isthe most commonly used in practice

• Require the small memory

• It is computationally efficient

• Benefit from vectorization

• Avoid stuck in local minima

• Stable gradients descent

• Converge quickly

13

Trang 25

Figure 2.8: Dropout is used as a regularization technique 8

2.1.6 Dropout

In machine learning, "dropout" refers to the process of randomly dropping outcertain nodes in a layer during training In the figure2.8, the left side represents anormal neural network where all nodes are kept On the right, some connections areignored and it is shown by red nodes More specifically, the values of their weightsand biases are not considered during the training phase Each node in the modelcan remain with the probability p or ignore with probability 1-p The dropouttechnique is used as a regularization, so it can prevent the overfitting problem

In addition, there are many regularization approaches that are popularly used inmachine learning

• Early stopping: stop training automatically when a specific performancemeasure (e.g Validation loss, accuracy) stops improving

• Weight decay: This technique is used to prevent a few weights which have

a big value This makes the model jump in the wrong direction It tivizes the network to adjust the weight slowly by adding a penalty to the lossfunction

incen-• Noise: allow some random fluctuations in the data through augmentation(which makes the network robust to a larger distribution of inputs and henceimproves generalization)

• Model combination: average the outputs of separately trained neural works (requires a lot of computational power, data, and time)

net-8

https://laptrinhx.com/a-simple-introduction-to-dropout-regularization

14

Trang 26

2.1.7 Tranfer Learning

Transfer learning is an interesting idea to train the deep neural By using edge obtained from one task used to solve similar ones This idea comes from thefact that humans have the ability to leverage our knowledge to deal with relatedsituations We can use the existing knowledge to solve new problems faster withthe best solutions For example, If you know how to ride a bike, you can learnhow to drive a motorbike In deep learning, you want to build a model to classifythe images, and there are 1000 images in your dataset However, you wish thatyour model is very deep to learn complex features Consequently, your model willoverfit in just some steps The solution for these problems is that we can utilize apre-trained model such as VGG, ResNet, MobileNet, etc Before applying transferlearning, we need to answer two main questions:

learn-15

Trang 27

Chapter 3

Literature Review

3.1 Convolutional Neural Networks

One of the famous architectures in deep learning is LeNet [31], VGG [49], ResNet[18], etc These models break the traditional methods and are state-of-the-art inimage processing The starting point of the deep learning era is LeNet architec-ture with the purpose of recognizing handwritten that proposed by Yan Le Cun.Recently, the rapid development of deep learning is impressive thanks to computa-tional resources, that is GPU, TPU VGG model and ResNet model are one of theefficient architectures for the classification task Besides, these networks are alsoefficient when using as a backbone for extracting meaningful features Here I willintroduce these famous architectures

3.1.1 LeNet

Figure 3.1: The first deep learning architecture - LeNet1This is one of the first successful deep learning architectures for image classifi-cation The model was developed by Yann LeCan et al in the 1990s He combined

a convolutional neural network trained by the back-propagation algorithm for ognizing the handwritten number Afterward, this architecture was used to identify

rec-1

https://medium.com/@pechyonkin/key-deep-learning-architectures-lenet-5-6fc3c59e6f4

16

Trang 28

handwritten zip code numbers provided by US Postal Service Overall, this networkwas the starting architecture in the deep learning era Recently, many state-of-the-art models appear by inheriting this idea Figure 3.1 describes the detail of thenetwork.

LeNet-5 features can be summarized as:

• Sequence of 3 modules: convolution layer, pooling layer, fully-connected layer,

• Inputs are grayscale-level, and their values are normalized using the mean of

0 and standard deviation of 1 to accelerate the training phase

• Using two activation functions is hyperbolic tangent and sigmoid

• Using average pooling layer in the architecture

• Fully connected layers as a final classifier,

• Mean squared error as a loss function

1000 classes with over 14 million images It improves AlexNet by replacing an

2

https://towardsdatascience.com/step-by-step-vgg16

17

Trang 29

oversized kernel filter (11 and 5 in the first and second layer, respectively) with a

smaller kernel filter(3x3) in the first two layers The architecture is described as

following:

• The input for VGG is 224x224, RGB color The authors cropped the center

of each image to keep the fixed input size

• Convolutional Layers: use kernel size (3x3), stride step is 1 pixel

• The network has five convolution layers The first two layers and the final

three layers use 2 and 3 convolution kernels consecutively, respectively

• Fully Connected Layer: The author uses three linear layers attaching at the

end of the network The first two fully connected layers have 4096-d, and the

final one is 1000-d that represents for 1000 class in the challenge

3.1.3 ResNet

Figure 3.3: A residual block 3

ResNet [18] was proposed by

Kaim-ing He and et al and took the deep

learning world by storm when it appears

as the first neural network that could

train hundreds of layers without

drop-ping its accuracy to the vanishing

gra-dient problem The network was

imple-mented to stack a lot of layers to avoid

some issues and without hurting the

performance thanks to a special

mech-anism Normally, neural networks are

trained by the backpropagation algorithm, which minimized the loss function to

find the local optimization However, when stacking many layers and many

calcula-tions, there is a challenging problem for training this model Because the gradients

from the final layer back-propagate to the first layer will disappear, causing accuracy

to saturate

The solution for the problem above is "identity shortcut connections" The model

stacks the identity mappings and plus them with the output of the layer These

shortcuts not only reuse the previous feature but also support faster learning In

the paper, the author al et conducted many experiments to demonstrate that the

deeper models are more efficient than their shallower counterparts As a result,

ResNet architecture quickly became one of the most popular models in image

pro-cessing fields

The figure3.3 will show "a residual block" that makes deeper models become more

efficient when training with many layers

3

https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec

18

Trang 30

3.2 Face Reconstruction

3.2.1 3D Morphable Model

Blanz and Vetter [5] introduce the 3D Morphable Model model (3DMM) torecover the 3D face from a 2D image Assuming that the 3D face scan with Nvertexes can be represented as a 3N × 1 vector S = [x1, y1, z1, , xN, yN, zN]T ∈∈

R3N, where [xi, yi, zi]T are the object-centered Cartesian coordinates of the i-thvertex Given the dataset of such 3D face scans, one would like to represent them

as a smaller set of variables The authors in [5] propose to use a two-stage principlecomponent analysis (PCA) to estimate the shape identity parameters along withexpression parameters of the 3D faces Suppose that, after the first stage, they keepfirst ns principal components and s1, s2, , sns are the corresponding orthonormalbasis, then a 3D face S can be represented as follows:

where ¯S ∈ R3N is the mean shape vector across the dataset of 3D face scans and

α = [α1, , αn s] is the shape parameters

In the second stage, a new PCA model is trained on the difference between pression scans and neutral scans After this stage, the final shape of a representation

where f is the scale factor, Pr = 1 0 0

0 1 0 is the orthographic projection matrix and19

Trang 31

t2d is the principal point that is set to the image center.

Suppose that the set of all the model parameters are denoted by p = [f, R, t2d, α, β]

3.2.2 3DDFA

Method 3DDFA associates Cascaded Regression and a Convolutional NeuralNetwork (CNN) Cascaded CNN can be formulated as:

where pkis the model parameters at the k-th iteration, which is updated by applying

a CNN-based regressor N etk on the shape indexed feature F eat that depends onthe input image I and the current parameters pk

The purpose of the CNN regressors is to predict the parameter update ∆p toshift the initial parameter p0 as close as possible to the ground truth pg In term ofobjective function, [69] proposes to use the Optimized Weighted Parameter DistanceCost (OWPDC):

Eowpdc =(∆p + p0− pg)Tdiag(w∗) (3.6)

(∆p + p0− pg),where w∗ is the optimized parameter importance vector

3.2.3 UV-GAN

The UV-GAN [10] model is the first architecture that comprises 2 phases togenerate a synthetic facial image In the first phase, the author uses the traditional3DDM model to reconstruct a 3D face, 3D shape from a 2D facial image However,the 3DDM model can not recover the invisible part of the image, this leads to aproblem that is missing textures for not the frontal face Hence, the author proposes

a new model called UV-GAN to fill up the invisible facial region The framework

3.4 has one UV generation network, two discriminators

3.2.4 VGG

3.2.4.1 Generation Network

The Generation Network that the author used in his research is the U-Net model.The model works as an auto-encoder or generative model to transform image toimage More specifically, the author follows the architecture designed for image

to image translation task, and it is the pix2pix model The model has shortcut

20

Trang 32

Figure 3.4: UV-Gan framework consists of one generator (U-Net) and global andlocal discriminator.

connections from encoder to decoder to enrich the feature and not increase theparameter of the model To handle the missing facial part, the author filled up

it using random noise and concatenate with its mirror image to produce input forthe model As a result, at the random location, the model can learn to find therelationship between visible and invisible parts Besides, to preserve the identityinformation, the author adopts L1 loss (pixel-wise) as the reconstruction loss

3.2.4.2 Discrimination Network

The generator that describes in the previous section can fill up the missing gions with small reconstruction errors But it does not guarantee about the facialoutput to be visually realistic and informative Hence, the author used a discrimi-native network to improve the quality of the synthetic image and also produce morephotorealistic results The task of the discriminator is that how to distinguish be-tween real and fake UV maps The discriminator has two networks: a global and

re-a locre-al discriminre-ator The globre-al discriminre-ator determines the fre-aithfulness of theentire UV maps And the local discriminator will focus on the facial center such asthe nose, eyes, mouth, forehead Because the inner face contains more information

to identify the people Therefore the benefit of using the global and local inator is to make the model focuses on not only the surrounding context but alsothe central face

discrim-3.3 Polyp Segmentation

3.3.1 U-NET

In 2015 [43], Olaf Ronneberger et al proposed a novel framework for imagesegmentation task and won the ISBI cell tracking challenge The network has twopaths that are contracting path (encoder path) and expanding path (decoder path)

In the contracting path, it will capture the context information By contrast, theexpanding path will enable to determine the localization precisely Hence, thisnetwork is the end-to-end segmentation

21

Trang 33

Figure 3.5: The UNET architecture has a contracting path and expanding path.

The figure 3.5 describes the detail of the architecture In the encoder path, ateach block, the model uses two convolution operations consecutively with kernel-size 3x3 (unpadding), followed by a Relu activation function In addition, at eachdownsample step, the author doubles the number of channels With the decoderpath, firstly, the shape of the feature was doubled, followed by kernel-size 2x2.Then the author concatenates it with the feature from encoder path and uses filter3x3, followed by a ReLU function

3.3.2 ResUNet++

The ResUnet++ [28] model inspired by ResUNet model, which is an architecturethat uses the power of deep residual learning The proposed model used state-of-the-art modules such as the residual blocks, the squeeze and excitation block [20],ASPP [7], and the attention block, see figure 3.6

Firstly, The residual block is similar to ResNet architecture, which plus input tooutput as a shortcut connection Hence, the author can build a deeper model thatcan eliminate the "vanishing" problem The author used four encoder blocks Eachblock has two successive 3x3 convolution operations and one identity mapping Inaddition, each block consists of a batch normalization layer, a ReLU function layer,

a convolution layer The output of the first three blocks is passed through to Squeezeand Excite layer (SE) [20]

The SE module is to ensure that the network can increase its sensitivity to theimportant features and suppress the unimportant features The network has two

22

Trang 34

steps to achieve its goal The first step is to squeeze feature (global informationembedding), use global average pooling in each channel Next, the second step isexcitation with the purpose of capturing the channel-wise dependencies Afterward,the final output of the encoder path is put through Atrous Spatial Pyramidal Pool-ing layer (ASPP).

Figure 3.6: The ResUnet++ architecture

The ASPP network acts as a bridge

between the encoder and the decoder

The network can capture the contextual

information at various scales Besides,

the input feature map uses many

par-allel atrous convolutions with different

rates, then all of it was fused The

rea-son that the author integrates this

net-work is its power, and it uses

success-fully in many segmentation tasks

To connect the feature from encoder

to decoder, the author used the

atten-tion mechanism, which is so popular

in Natural Language Processing (NLP)

The mechanism will determine what

re-gion that network focuses on The

important advantage of the attention

mechanism is to enhance the quality of

features, as well as to improve the

re-sults

3.3.3 Attention UNet

3.3.3.1 Attention Gate

In standard CNN architecture, the model will downsample gradually in order

to capture the semantic contextual information In this way, the features on thecoarse spatial level will represent the location and relationship between low-levelwith high-level However, with small objects, it still remains difficult for detectingand segmenting as well as decreases the missing predictions Hence, the author et

al proposed an attention gate (AG) [38] to improve the accuracy of the model.This module can integrate easily into the standard CNN model The goal of AGs is

to suppress feature responses in irrelevant locations and to only focuses on relevantregions The figure 3.7 describes the detail of the AGs

23

Trang 35

Figure 3.7: The Attention Gate (AG) receives the two inputs input features (xl) andgating signals (g) Firstly, input features (x) are up-sampled and add gating signals,then passing to two activation functions (ReLU, Sigmoid) to produce attention co-efficient maps (α) Finally, input features (ˆxl) are scaled with attention coefficients.4

qattl = γT(α1(WxTxli+ WgTgi+ bg)) + bγ (3.7)

αli = σ2(qlatt(xil, gi, Θatt)) (3.8)where σ2 is sigmoid function

AGs contain set of parameters Θatt: Wx ∈ RF l ×F int, Wg ∈ RF g ×F int,

γ ∈ RFint ×1 and bias terms bγ ∈ R, bg ∈ RF

int.The range of attention coefficients is from 0 to 1, αi ∈ [0, 1], this makes the modelprune feature responses to preserve the activations relevant The module is repre-sented by a set of hyperparameter, that is Wx, Wg, γ The linear transformationsare computed using kernel-size 1x1x1 for the feature response Then, the output willpass through the sigmoid function to squeeze the value in the given range Next,the output of AGs is the element-wise multiplication of input feature-maps and at-tention coefficients to produce attention output, ˆxl

i,c = xl i,c· αl

i Finally, the authoremphasizes that the AG module can be trained with the standard back-propagationalgorithm, not like hard-attention

3.3.3.2 Attention UNet Architecture

The model 3.8 follows the standard UNet and integrates AGs into its ture Thus, the model has 2 paths, the encoder path, and the decoder path TheAGs like a bridge that connects the feature map from encoder and decoder Firstly,the shape of the feature map of the decoder will double but remain the number

architec-of channels Next, it will combine with the feature map architec-of the encode (the sameresolution) via AGs to determine the relevant information as well as suppress the

4 https://arxiv.org/pdf/1804.03999.pdf

24

Trang 36

Figure 3.8: Attention UNet architecture

irrelevant information As a consequence, the new model can improve its accuracyand increase its sensitivity

3.3.4 Pranet

Figure 3.9: The Pranet architecture includes three reverse attention modules taching at lass three high-level features

at-3.3.4.1 Parallel Partial Decoder

Usually, the models (U-Net, ResUnet, etc.) that have encoder, decoder willaggregate all-level features to produce the final output However, Wu et al gave

25

Trang 37

information that high-level features are more meaningful than low-level ones Inaddition, high-level features also require less computational resources due to theirsmall resolution This is a reason that the author uses the last three features in order

to aggregate them to yield the global map The block that merges these features iscalled a parallel partial decoder More specifically, when using Res2Net[15] to extractfeatures, they will have file levels of features fi, i = 1, , 5 Then, they split theminto two categories, low-level fi, i = 1, 2, high-level fi, i = 3, 4, 5 Next, the partialdecoder pd(.) that is a SOTA decoder component is computed by P D = pd(f3, f4, f5)

3.3.4.2 Reverse Attention Module

To supervise the feature at high-level feature, a reverse attention model is used

to refine the relatively rough segmentation Thus, the author integrated a trainablereverse attention module in three high-level features More specifically, the output ofreverses attention weight is in a range from 0 to 1 (element-wise ) , and multipliesthem with high-level features As a result, this technique not only makes the modelfocus on polyp location but also improves the accuracy In addition, it suppressesthe irrelevant information to reduce the mistake of the network The formula isdescribed as bellow:

Trang 38

We have just analyzed some relevant studies on face reconstruction and polypsegmentation This is a baseline model and theoretical foundation to propose anew model This model is still facing some challenges For example, the pose-invariant face recognition system must identify the face images from different poses,illumination, expression UV Gan is a new idea that transforms from a 2D facialimage to a 3D face mesh and an incomplete UV map Then author uses a generativemodel to recover the self-occluded regions However, the author only uses a simpleU-Net model as the generative model This is the reason why we propose a novelarchitecture (ResCUNet , Attention ResCUNet) replacing the previous model Forthe polyp segmentation task, the challenge is small polyps, illumination, and smalldata Thus, we propose the Attention Res-CUNeSt model to solve this challenge.Afterward, we conducted a lot of experiments to demonstrate our proposal As aresult, our proposal surpasses the previous works on the standard benchmark.

27

Trang 39

Chapter 4

Proposed Method

In this thesis, we proposed three novel architects (CUNet, Attention CUNet, Attention ResCUNeSt) to solve two tasks (face reconstruction, polyp seg-mentation) We applied many modern modules that were the Attention Gate mod-ule, the powerful backbone (ResNet, ResNeSt), auxiliary loss, many connectionsbetween UNets, double the original UNet model Then, the proposal was evaluated

Res-on standard benchmarks to compare with previous studies Our research for theface reconstruction task was publicized at MAPR-2020, SCI-journal-Q1

Recently, generative adversarial networks (GANs) [8] have proved to be erful to mimic data distribution GANs have been successfully applied to manycomputer vision tasks such as image inpainting [39, 59, 61], style transfer [36, 68],image synthesis [29, 30], super-resolution [32] and so on These successful appli-cations have motivated researchers to apply GANs to pose-invariant feature disen-tanglement [53, 54], face completion [56] and face frontalization [53, 21, 60,65,12].TP-GAN [21] uses a two-pathway GAN that simultaneously learns global struc-tures and local information for photorealistic frontal view synthesis Zhao et al.[66] propose a unified deep architecture containing a face frontalization module and

pow-a discriminpow-ative lepow-arning module, which cpow-an be jointly lepow-arned in pow-an end-to-endfashion TP-GAN [21] uses a two-pathway GAN that simultaneously learns global

28

Trang 40

Figure 4.1: A pipeline process of face synthesis Using 3DDFA to obtain a3D mesh and an incomplete UV map Then a new generative model is applied torecover the self-occluded regions The completed UV map is attached to the fitted3D mesh to generate faces of arbitrary poses.

structures and local information for photorealistic frontal view synthesis Zhao et

al [66] propose a unified deep architecture containing a face frontalization moduleand a discriminative learning module, which can be jointly learned in an end-to-endfashion

In [10], Deng et al propose an adversarial UV map completion framework calledUV-GAN to solve pose-invariant face recognition without the need of extensive posecoverage in the training dataset The authors in [10] first fit a 3DMM [6] to 2D profileface and get an incomplete UV map, which is then fulfilled by a straightforwardpix2pix [23, 24] The generator architecture in pix2pix follows the general shape

of U-Net [43] to add skip connections between encoder and decoder subnetworks inorder to enhance the transfer of low-level information between input and output.One weakness of the original UV-GAN is the plain architecture of the generator,which is shown to be worse than residual networks [18] Another weakness is that oneU-Net block seems to be not enough to mix well low-level information in the encoderwith high-level semantic features in the decoder In [58], Deng et al use UV-GANwith similar architecture as in [10] to extract side information as well as subspacesand combine UV-GAN with robust PCA for the face recognition task He et al [19]introduce a framework for heterogeneous face synthesis from near-infrared (NIR) tothe visible domain The framework consists of two adversarial generators to estimate

a UV map and a facial texture map from an input NIR face and then generate acorresponding frontal visible face Nevertheless, both generators in this framework

29

Res-on standard benchmarks to compare with previous studies Our research... module, which cpow -an be jointly lepow-arned in pow -an end-to-endfashion TP-GAN [21] uses a two-pathway GAN that simultaneously learns global

28

Trang

Định dạng
Số trang	81
Dung lượng	1,58 MB