Effectively apply vietnamese for visual question answering system

Perceptron is a single neuron model that was a precursor to larger neural networks with aninput layer and an output layer, this is also known as a linear separator.. It is the result of

Trang 1

HO CHI MINH UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

GRADUATION THESIS

EFFECTIVELY APPLY VIETNAMESE FOR VISUAL QUESTION ANSWERING SYSTEM

(Old title: Development of a VQA system)

Major: Computer Science

Council : Software Engineering

Instructor : Dr Quan Thanh Tho

Reviewer : Mr Le Dinh Thuan

Ho Chi Minh City, July 2021

Trang 2

VT姶云PI"A萎K"J窺E"DèEJ"MJQC

KHOA: KH & KT Máy tính PJK烏O"X影"NW一P"èP"V渦V"PIJK烏R

D浦"OðP< KHMT _ Ej¿"#<"Ukpj"xk‒p"rjＶk"f p"vぜ"p {"x q"vtcpi"pjＸv"eてc"dＶp"vjw{ｘv"vrình

J窺"XÉ"VçP< Vt亥p"Jq pi"Piw{‒p MSSV: 1712396

NGÀNH: KHMT N閏R<"MT17KH03 _ J窺"XÉ"VçP< Piw{宇p"D違q"Rj¿e MSSV: 1712674

‚ Z¤{"f詠pi"x "jw医p"nw{羽p"o»"j·pj"j丑e"o {"f詠c"vt‒p"f英"nk羽w"8«"8逢嬰e"z穎"n#0

‚ Rj¤p"v ej"x "vjk院v"m院"j羽"vj嘘pi"Xkuwcn"Swguvkqp"Cpuygtkpi"jq p"ej雨pj0

‚ Z¤{"f詠pi"ygdukvg<"vjk院v"m院"ikcq"fk羽p."rj v"vtk吋p"htqpv-end, back-gpf"x "vtk吋p"mjck"j羽"vj嘘pi0

Trang 3

Ngày tháng năm

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)

1 Họ và tên SV: Trần Hoàng Nguyên

MSSV: 1712396 Ngành (chuyên ngành): KHMT

Họ và tên SV: Nguyễn Bảo Phúc

2 Đề tài: Phát triển hệ thống Visual Question Answering

3 Họ tên người hướng dẫn/phản biện: ThS Lê Đình Thuận

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

- Đề tài xây dựng hệ thống thông minh trả lời câu hỏi nội dung hình ảnh Sinh viên đã xây dựng được hệ thống VQA và huấn luyện được mô hình thành công Đề tài còn có sự phát triển trong việc huấn luyện mô hình để

- Luận văn được trình bày đầy đủ và rõ ràng Sinh viên chú ý sắp xếp phần demo để thể hiện trọng tâm công việc của đề tài.

7 Những thiếu sót chính của LVTN:

8 Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không được bảo vệ □

9 3 câu hỏi SV phải trả lời trước Hội đồng:

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Điểm : 10/10

Ký tên (ghi rõ họ tên)

ThS Lê Đình Thuận

Trang 4

Ngày tháng năm

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)

1 Họ và tên SV: Trần Hoàng Nguyên

Họ và tên SV: Nguyễn Bảo Phúc

2 Đề tài: Phát triển hệ thống Visual Question Answering

3 Họ tên người hướng dẫn/phản biện: PGS.TS Quản Thành Thơ

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

- Sinh viên đã hoàn thành một hệ thống VQA như yêu cầu đề ra Sinh viên nắm vững, hiểu rõ các nội dung lý thuyết và phát triển xây dựng ứng

dụng model thành công Sinh viên cũng đã dịch tập dữ liệu huấn luyện sang tiếng Việt để hỗ trợ cho việc trả lời câu hỏi tiếng Việt.

- Một phần công việc của luận văn đã được viết thành bài báo khoa học và nộp cho hội nghị FAIR.

- Phần công việc của luận văn cũng được tiếp tục được mở rộng cho một

dự án hợp tác nghiên cứu với nhóm nghiên cứu một giáo sưc khác tại Đài Loan.

- Luận văn được viết bằng tiếng Anh tương đối chuẩn và rõ ràng.

7 Những thiếu sót chính của LVTN:

8 Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không được bảo vệ □

9 3 câu hỏi SV phải trả lời trước Hội đồng:

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Điểm : 9.8/10

Ký tên (ghi rõ họ tên)

PGS.TS Quản Thành Thơ

Trang 5

We assure that the graduation thesis "Effectively apply Vietnamese for Visual Question

Answering system" is the original report of our research We have finalized our graduation

thesis honestly and guarantee the truth of our work for this thesis We are solely responsible for the precision and reliability of the above information.

Ho Chi Minh City, August, 9th, 2021

Trang 6

First and foremost, we would like to thank Dr Quan Thanh Tho, Associate sor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT) for the support throughout our research work It has been our great fortune, to work and finish our thesis under his supervision He is the most knowledgeable and insightful person we have ever met He helped us throughout the project with his wise knowledge and enthusiasm in deep learning From him, we have learned how to do deep learning research by a critical way and have a chance to widen our knowledge He also let us join his research group, URA This opportunity not only allow us to get more useful suggestions about our thesis from everybody in group, but we also have learned many new things, new skills day by day, such as by joining seminars held by members in group From that, we can create more interesting ideas focusing on our thesis.

Profes-Our sincere thanks also goes to Mr Le Dinh Thuan, Master of Engineering in the ulty of Computer Science and Engineering, Ho Chi Minh City University of Technology, for being our reviewer His feedback, suggestion and advice was essential and influential for the completion of our thesis We are thankful for having such a good reviewer like him Last but not least, we would like to thank the entire teachers at HCMC University of Technology, especially Faculty of Computer Science and Engineering, where it has been our pleasure and honor to studied for the last four years Also our beloved friends and family, who always support us with a constant love and encouragement.

Fac-AUTHORS

Trang 7

In recent years, deep learning has emerged as a promising technology with the hope that

it can be designed to tackle practical problems, which had been considered inconceivable for previous approaches Specifically, the blind and visual impaired are usually afraid to be burdensome for their family, their friends, when they need visual guidance However, there

is still a lack of modern systems, which can be a virtual friend to help them interact with the surrounding environment Therefore, we research and develop a novel deep learning application, which can capture the complex relationship between the surrounding objects and deliver assistance to the blind and visual impaired With this dissertation, we propose

a novel visual question answering model in Vietnamese and a development of practical systems that utilize our model to address the aforementioned problems.

Trang 8

List of figures x

1.1 Motivation 2

1.2 Topic’s scientific and practical importance 3

1.3 Thesis objectives and scope 4

1.4 Our contribution 5

1.5 Thesis structure 6

Chapter 2 THEORETICAL OVERVIEW 7 2.1 Deep learning neural network 8

2.1.1 Perceptron 8

2.1.2 Multi layer perceptron 9

2.1.3 Activation functions 10

2.1.4 Loss functions 12

2.1.5 Backpropagation and optimization 14

2.2 Computer vision theoretical background 16

2.2.1 Convolutional Network 16

2.2.2 Pooling 19

2.2.3 CNNs variants 19

2.2.4 Regional-based Convolutional Neural Networks 22

2.3 Natural language processing theoretical background 27

2.3.1 Word Embedding 27

2.3.2 Recurrent Neural Network (RNN) 33

2.3.3 LSTM - Long Short Term Memory 34

2.3.4 GRU - Gated Recurrent Network 36

2.3.5 Attention mechanism 37

2.3.6 Bidirectional Encoder Representations from Transformers 39

2.4 Visual and Language tasks related to VQA 43

2.4.1 Image Captioning 43

Trang 9

2.4.2 Visual Commonsense Reasoning 43

2.4.3 Other Visual and Language tasks 44

Chapter 3 RELATED WORK 45 3.1 Overall 46

3.2 Bottom-Up and Top-Down Attention for Image Captioning and Visual Ques-tion Answering 47

3.3 Pythia v0.1: the Winning Entry to the VQA Challenge 2018 49

3.4 Deep Modular Co-Attention Networks for Visual Question Answering 49

3.5 ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data 51

Chapter 4 METHODOLOGY 53 4.1 Feature extraction and co-attention method 54

4.1.1 Visual feature 54

4.1.2 Textual feature 54

4.2 Co-attention layer 55

4.3 Our proposal model 58

Chapter 5 VIETNAMESE VQA DATASET 61 5.1 VQA-v2 dataset 62

5.2 Visual Genome dataset 63

5.3 Challenge 63

5.4 Automatic data generation 64

5.5 Data refinement 65

5.6 Statistical Analysis 65

5.7 Data sample 66

Chapter 6 EXPERIMENTAL ANALYSIS 68 6.1 Experimental setup 69

6.1.1 Computing Resources 69

6.1.2 Dataset 69

6.1.3 Evaluation metric 69

6.1.4 Implementation details 70

6.1.5 Training strategy 70

6.2 Experimental results 71

Trang 10

Chapter 7 APPLICATION 75

7.1 Technology terminology 76

7.1.1 Flask 76

7.1.2 ReactJS 77

7.1.3 React Native 77

7.1.4 Docker 78

7.1.5 C4 Model: Describing Software Architecture 78

7.2 System functionality 80

7.2.1 Web application system 80

7.2.2 Mobile application system 80

7.3 System diagram 81

7.3.1 Overview 82

7.3.2 Use case diagram 83

7.3.3 Activity diagram 85

7.4 System architecture 87

7.4.1 System components 87

7.4.2 Our result 90

Chapter 8 CONCLUSION 94 8.1 Summary 95

8.2 Limitation and broader future works 95

8.2.1 Improve existing Vietnamese VQA models 95

8.2.2 Give Vietnamese VQA a new direction 96

Trang 11

1.1 Overview of the Visual Question Answering task 3

2.1 Perceptron 9

2.2 Symbols and calculating process in Multilayer Perceptron 10

2.3 Multi layer perceptron with two hidden layers 11

2.4 Convolution operation between 2-D input image and 2-D kernel 17

2.5 The receptive field of the units in the deeper layers of a convolutional net-work is larger than the receptive field of the units in the shallow layers 18

2.6 Receptive field of one output unit in CNNs 19

2.7 Apply 2 × 2 pooling layer to 6 × 6 input 20

2.8 AlexNet’s architecture 20

2.9 VGG16 (left) and VGG19 (right) architecture 21

2.10 Residual function (left) and ResNet-18 architecture 22

2.11 The architecture of R-CNN 23

2.12 The architecture of Fast R-CNN 24

2.13 The architecture of Faster R-CNN 26

2.14 Word2Vec overview 29

2.15 CBOW and Skip-gram Architecture 31

2.16 Recurrent network architecture 33

2.17 LSTM architecture 35

2.18 GRU architecture 37

2.19 Self-attention 38

2.20 Multihead-attention 39

2.21 BERT for Masked LM 41

2.22 BERT for Next Sentence Prediction 42

3.1 Typically, attention models operate on CNN features corresponding to a uni-form grid of equally-sized image regions (left) Bottom-up approach enables attention to be calculated at the level of objects and other salient image re-gions (right) 48

3.2 Bottom-up attention in VQA task 48

Trang 12

3.3 The overall flowchart of the deep Modular Co-attention Networks They also provided two different strategies for deep co-attention learning, namely

stacking and encoder-decoder 50

3.4 Architecture of the ImageBERT model 52

4.1 Our proposed question processor for Vietnamese VQA task 55

4.2 Architecture of multi-head attention module 57

4.3 Architecture of self-attention unit (left) and guided-attention unit (right) 58

4.4 Architecture of our proposed model 60

5.1 Sample image in VQA-v2 dataset 62

5.2 The list of answers that have the most occurrence 64

5.3 The list of answers that have the least occurrence 64

5.4 Sample image in VQA-v2 dataset 66

6.1 Our learning rate is controlled by Adam optimizer and warmup scheduler 71

6.2 Accuracy and co-attention depth relationship All of this experienced of the test set and used the small specs 71

6.3 Our loss values on train and validation set, which consist of the value from epoch 1 to 18 72

6.4 Our loss values on train and validation set, which consist of the value from epoch 2 to 18 72

6.5 The overall accuracies 72

6.6 The accuracies of Yes/No question 73

6.7 The accuracies of number question 73

6.8 The accuracies of other question 73

7.1 Components of a C4 model 79

7.2 Usecase diagram of Vietnamese VQA Web System 83

7.3 Usecase diagram of Vietnamese VQA Mobile System 84

7.4 Activity diagram of Vietnamese VQA Web System 85

7.5 Activity diagram of Vietnamese VQA Mobile System 86

7.6 Component level description of our whole system 87

7.7 Homepage of our web application 90

7.8 Introduction for Vietnamese VQA - Give an VQA example 90

7.9 Introduction for Vietnamese VQA - What is VQA ? 91

Trang 13

7.10 Choose an image and enter a question in Vietnamese, both of them are used

as input for VQA In this case, the question we enter is "Ở đây có thứ gì?" 917.11 Top-5 answers are generated from our Vietnamese VQA model In this case,for the above question, the top 5 answers sound quite good The top-1 answer

is "Sách" 927.12 Overview of our VQA web system on mobile Within the image, the ques-tion, the answer generated from VQA is very clear and helpful 927.13 User can upload a favorite image and then ask a question The VQA systemwill response after few seconds 937.14 Our application on mobile device User can ask a question about visualinformation (left) or daily information like datetime, weather, position, (right) 93

B.1 Our application of VQA for predicting the potential of natural disaster Forexample, giving a question "what is the overall condition of the given image

?", VQA can generate answer based on the visual content of the image Theanswer here is "Non-flooded" 112B.2 Our proposed wildfires surveillance system pipeline 113

Trang 14

5.1 Summary for one sample in VQA-v2 dataset 635.2 Statistical description of our Vietnamese dataset 65

6.1 Summary of our model with the large specs and BERT as our language

processor The experiments is conducted on our validation set 72

6.2 The results of our models with 4 variants All of the results are conducted

on our Vietnamese VQA test set 73

Trang 15

This chapter gives an outline of the thesis topic, including its research aims, research scope, and scientific and practical value.

Trang 16

1.1 Motivation

Vision impairment severely impacts quality of life among both adult and young populations.Young children with early onset vision impairment can experience limited cognitive devel-opment, also it leads to lifelong consequences Adults with vision impairment have lowerproductivity and also higher rates of anxiety and depression In case of the older visual im-paired, it can lead to a result of social isolation, and low level of navigability The number ofpeople visual impaired was estimated to be about 285 million, in which 39 million peopleare blind and 246 million people have low vision.1

In recent years, Artificial Intelligence (AI) has not just made our lives easier by tomating tedious and dangerous tasks, but it has unlocked myriad possibilities to peoplewith disabilities and promising them unique ways of experiencing the world More andmore AI-powered applications in the industry of assistive technology have been put intopractice and shown its benefits In this regards, one of the most promising AI tasks that canhelp visual impaired in their daily life is Visual Question Answering [1]

au-Visual Question Answering (VQA) is a research field of multimodal learning in cial intelligence Multimodal learning is the task requiring us to propose a deep neural net-work that can model features over multiple modalities - multiple data sources (text, audio,images, numbers, ) to solve problems and achieve high accuracy, which makes it become amore interesting and challenging assignment A typical VQA input consists of two objects:

artifi-an image artifi-and a text question The task of VQA is formulated as follows: given a question, artifi-animage, the model must predict the correct answer This answer generally needs to be chosenfrom a defined set of possible choices For example, in fig 1.1, given the below image andthe question, "What is the moustache made of ?", the model must answer "Banana"

This task is challenging on many levels First of all, the model needs to understand thetext of the question and the visual signals from the image Secondly, it should correctlycorrelate text with the visual signals On top of understanding text and visual signals, themodel also needs to use common sense reasoning, knowledge base reasoning, and identifythe context of the image This means that a VQA system needs to be capable of processingimages, such as detecting objects, recognizing entities and activities At the same time, thissystem must be able to handle text processing as a natural language of humans The realchallenge in VQA is the combination of techniques from both computer vision and natural

1 The number is obtained from WHO, 2010.

Trang 17

Figure 1.1:Overview of the Visual Question Answering task.

language processing to produce a meaningful and accurate answer that provides relevantinformation and it is beneficial for human In this graduation thesis, we aim to build a VQAsystem that could understand Vietnamese questions and give a meaningful, accurate answerwritten in Vietnamese

Since the VQA task appeared, there have been many proposals for VQA models that aremore and more complex and capable of answering previously unseen questions and obtainhigher and higher accuracy Nowadays, a number of recent works have proposed attentionmodels for VQA Co-attention mechanism is applied for VQA model more and more popu-larly, and it gets better results in VQA challenges We use it to build our Vietnamese VQAmodel so that we could make the accuracy of the model as high as possible

There are many potential applications for VQA Probably the most direct application is tohelp blind and visually impaired users A VQA system could provide information about animage on the internet or any social media

Another obvious application is to integrate VQA into image retrieval systems Thiscould have a huge impact on social media or e-commerce VQA can also be used for edu-cational or recreational purposes

Trang 18

Furthermore, as we have known, natural disasters, including wildfires, flooding, icejam, cause a lot of great damages in our life, seriously disrupts the functioning of a com-munity or society Natural disaster’s impact usually includes loss of life, injury, disease, andother negative effects Therefore, it is necessary that we need an approach of disaster riskreduction to save lives and minimize disability and disease Disaster surveillance allows us

to identify risk factors, track disease trends, determine action items, and target tions We can apply VQA to develop a tool that tells if there is ice or not on a water body, orwhether there has smoke and fire in the forest or not From that, we can gain an early triggerfor a potential disaster

In this dissertation, we try our best to obtain results corresponding to the objectives are asfollows:

• Release a novel VQA dataset written in Vietnamese and provide an in-depth analysis

• Design, develop and apply our models to one mobile app and one web app, which canhelp the visual impaired in their daily life

• Elaborate on our solutions and examine the limitations of our models

Trang 19

1.4 Our contribution

Thesis’s contribution

Our first contribution is a new Vietnamese VQA dataset that we built by taking advantagefrom the previous VQA-v2 dataset The dataset consists of images from MS-COCO1 , andabout a million of question-answer pairs By thoroughly examine our dataset, we find thatour dataset can significantly accelerate performance, and play a key role in addressing theVietnamese VQA task

Moreover, we propose an effective pipeline for developing Vietnamese VQA model

By using some modern techniques, our model can overcome the problems in a complexlanguage like Vietnamese and efficiently capture relationships between textual and visualpresentation We also quantitively and qualitatively evaluate our models to show that howeffective our model can deal with the Vietnamese VQA task

Furthermore, we design and develop two applications on both mobile and web, whichare used our models as their core component, in the hope that they can make the daily life

of visual impaired people easier

Paper’s contribution

As a part of our work, we submit one paper into FAIR 2021 Conference The paper includesthe summary of our thesis work, from constructing a novel Vietnamese VQA dataset toproposing a pipeline which can effectively apply our model to address Vietnamese VQAtask

We hope that our paper will add a small contribution to the knowledge of the artificialintelligence, computer vision and natural language processing societies, and also assist inadvancing the field of the Vietnamese Visual Question Answering

1 https://cocodataset.org/

Trang 20

1.5 Thesis structure

This dissertation consists of 8 chapters, the introductory chapter serving as chapter 1 andthe conclusion as chapter 8 From chapters 2-7, we will give key materials that we usethroughout our thesis A brief overview of the contents of each chapter is presented below

Chapter 2presents a theoretical background for the thesis It is the foundation of edge that is really necessary for us to gain a deep understanding of VQA

knowl-Chapter 3 reviews relevant related work and raises some problems of traditional proaches for the VQA task

ap-Chapter 4 has the description of the feature extraction method, co-attention method,and our proposed architecture of Vietnamese VQA model

Chapter 5 describes a process of building our Vietnamese dataset and provide an depth analysis on the dataset

in-Chapter 6 gives an analysis of our conducted experiments and results

Chapter 7 includes the description of our AI application that utilizes VQA model

Chapter 8 is the end of our thesis We discuss and summarize the achievements anddrawbacks of our VQA system and present future plans for our thesis

Trang 21

THEORETICAL OVERVIEW

This chapter presents a theoretical background for the thesis It forms the basis of later understanding that helps throughout this work.

Trang 22

2.1 Deep learning neural network

In its most general form, the deep learning problems are considered estimation problems,the goal of deep learning algorithms are to approximate some function f∗ to formulatetraining data distribution

Neural network is a functional unit of deep learning Deep learning uses neural works to mimic human brain activity to solve complex data-driven problems A neural net-work functions when some input data is fed to it This data is then processed via layers ofperceptrons to produce the desired output

Perceptron is a single neuron model that was a precursor to larger neural networks with aninput layer and an output layer, this is also known as a linear separator Each layer includesone or more units called a node Except for input layer, each node in the following layer isconnected to all nodes in the previous layer Each connection has a specified weight Some-times, a layer has a special node called bias which has connections to all nodes in the nextlayer and all their weights get value one Perceptron produces a single output based on sev-eral real-valued inputs by forming a linear combination using input weights and sometimespassing the output through a non-linear activation function The purpose of the non-linearactivation functions will be introduced in the next section Since the output of a perceptron

is binary, we can use it for binary classification, i.e., an input belongs to only one of twoclasses

Here, we give an example presenting the process of perceptron to generate the followingresult:

Trang 23

Figure 2.1:Perceptron

If perceptron is limited to solving problems that are linearly separable, the multi-layer ceptron is more general Multi-layer perceptron is simply composed of more than one per-ceptron It is the result of organizing multiple perceptrons into layers and using an inter-mediate layer (hidden layer) to allow for the solution of non-linearly separable problems.This neural network is the foundation of many modern complex neural networks that havebeen popularly used in computer vision such as image classification, object detection, etc,furthermore, in natural language processing There are three types of layer in multi-layerperceptron

per-• Input layer brings the initial data into the system for further processing by subsequentlayers of artificial neurons

• Hidden layer is the layer between input and output layer, where artificial neuron works take in a set of weighted inputs and produce an output through an activationfunction

net-• Output layer is the last layer of neuron networks that produces given outputs for theprogram

A multi-layer perceptron is composed of an input layer to receive the signal, an output layerthat makes a decision or prediction about the input, and in between those two, an arbitrarynumber of hidden layers that are the true computational engine of a multi-layer perceptron.Similar to perceptron, the value of each node in hidden layers and output layer is calcu-lated into two steps: calculating linear sum and passing the result of the previous step into

Trang 24

nonlinear activation function The detail is given below:

bk ∈ Rlk x 1 presents bias of units in k-th layer, bk

i presents bias of i-th unit in k-th layer.For the i-th unit of k-th layer bk

i, calculating process is shown as follow:

• Calculate linear sum: zki = ∑lj=1k−1akj−1∗ wk

zk ∈ Rlk x1 is the output of k-th layer after calculating linear sum.

ak ∈ Rlk x1 is the output of k-th layer after applying activation function.

Figure 2.2: Symbols and calculating process in Multilayer Perceptron

The weighted inputs are summed and passed through an activation function, which can

be the final result or the inputs of the following layer They are non-linear functions thatallow the network to combine the inputs in more complex ways and in turn provide a richercapability in the functions they can model Activation functions are decision-making units

of neural networks

Trang 25

Figure 2.3:Multi layer perceptron with two hidden layers

Trang 26

ReLU is also the most widely used activation function for deep networks since it has beenproven that its usage has resulted in improved performance of deep learning models Thisactivation function is defined as just the positive part of the input

Softmax

The softmax activation function is commonly employed in the network’s final output layer,which is usually designed for multi-class classification The softmax function produces aprobability distribution over the number of target variables, and the final output is projected

as the class with the highest probability value in this distribution It converts any distributionover n classes into a probability distribution in the range of (0, 1) over the same n classesmathematically

Typically, the loss function of can be written as an average loss of predicted output fw(x)

Trang 27

when the input is x and the ground-truth output is y over the training set:

LD( fw) = 1

|D| ∑(x,y)∈D

where L is per-sample loss function and D is the training set distribution We should mize this objective function across all data-generated distribution in the hope that doing sowill improve our deep learning performance

mini-Mean Squared Error

The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator sures the average of error squares It is also the most widely used loss function for regressionproblems The MSE loss function in its simplest form can be represented by the followingformula

mea-MSE= 1

N

∑i=1

Cross-entropy

Cross-entropy loss is frequently used in machine learning and optimization problems Whenmeasuring the relative entropy between two probability distributions p and q over the sameset of events X, we can define it as follows:

Binary cross entropy

An alternative to entropy loss for binary classification problems is the binary

cross-entropy loss function, also known as log loss Intuitively, log loss is the negative average

of the log of corrected predicted probabilities for each instance and can be written as thefollowing formula:

Trang 28

BCE= −N1

N

∑i=1

yi· log(p(yi)) + (1 − yi) · log(1 − p(yi)) (2.8)

Backpropagation is a popular method for training artificial neural networks, especially deepneural networks Backpropagation is needed to calculate the gradient, which we need toadapt the weights of the weight matrices The weights of the neurons of the neural networkare adjusted by calculating the gradient of the loss function

We need a way to update the neural network’s parameters so that the loss value reducesgiven a neural network and a loss determined for a certain sample In general, optimizationprocedures such as stochastic gradient descent are used The gradient of the loss functionconcerning each weight in each of the network’s layers must be determined for algorithmsbased on gradient methods The backpropagation method does this by calculating the first-order derivative of the loss function to the weights at the output using the chain rule Thisgradient is then back-propagated backward through the network, till the first layer, wherethe partial derivative of the loss with respect to each weight is determined for each layer Inessence, we now know how much each weight contributed to the decrease, whether it wasgood or negative

Backpropagation requires the differentiability of the activation functions utilized out the network These partial derivatives are then used by gradient-based optimization al-gorithms to update each weight The update for a weight update based on gradient descent

through-is a generalization of the perceptron weight update in chain rule, making it applicable formultilayer neural networks The gradient descent update for a weight wjis defined in belowequation:

Trang 29

Batch gradient descent is computationally intensive since it generates gradients based

on all input patterns in the training set Stochastic gradient descent (SGD) is a method forcalculating gradients based on one input sample at a time, which is more cost-effective andaids model convergence for big datasets

Weight updates based on SGD, on the other hand, are frequently noisy In practice, acombination of batch gradient descent and SGD is utilized, with gradients computed for asubset of input patterns, or mini-batch

Adam optimizer

In this study, an extension to the SGD algorithm called the Adaptive Moment tion [3] (Adam) is used It is an algorithm for first-order gradient-based optimization ofstochastic objective functions, based on adaptive estimates of lower-order moments It uses

Optimiza-an exponentially decaying average of the previous gradients (first moment) Optimiza-and the square

of previous gradients (second moment) to compute an adaptive learning rate for each rameter This means that, in addition to a global learning rate η, Adam requires additionalhyper-parameters β1 and β2 that determine the decay rates of the first and second moments.Mathematically, Adam optimizer can be derived as following equations:

• gt: Gradient at time t along ωt

• νt: Exponential average of gradients along ωt

• st: Exponential average of square of gradients along ωt

Adam is a state-of-the-art method that is quite popular in deep learning, as it is able toachieve good results quickly

Trang 30

Warmup scheduler

For any parameterized optimization algorithm like Adam, we can apply a warmup factor

ω∈ [0,1] and replace initial learning rate α with αt= α·ωt in a update rule By reasonablychoosing parameter ω, this update can help the model achieve more stability in the trainingstage

mod-In its most general form, the convolution operation is typically denoted with an asterisk,

we can now specify the convolution operation between two functions f, g: R as

conve-H = I⊗ K ⇔ H[x,y] =

wK

∑i=1

hK

∑j=1

K[i, j]∗ I[x + kw− i,y + kh− j] (2.11)

where:

• Iw I ×h I is an input image with the size of wI× hI

• KwK×h K is a kernel with the size of wK× hK

Trang 31

• HwH×h H is an output of convolution operation, also called a feature map.

Figure 2.4:Convolution operation between 2-D input image and 2-D kernel

Using this property, we can derive a formula for convolution operation in multi-dimensionalinputs, also known as multi-channel images

H = I⊗ K ⇔ H[z] =

c I

∑c=1

where:

• Ic I ×w I ×h I is an input image with cI channel, with the size of wI× hI

• KcH×c I ×w K ×h K is an kernel with cH channel, with the size of wK× hK

• I[c]⊗ K[z,c] is a convolution operation on channel c of the input image I and kernel K

• Hc H ×w H ×h H is an output feature map of the operation

Convolutional neural networks have been tremendously successful and are still one of themost prominent algorithms for deep learning with grid-like data, especially for time-seriesdata, which can be considered as a 1-D grid of data in each interval, and image data, whichcan be considered as a 2-D (or 3-D) grid of pixels

In the image processing world, convolutional neural networks seem to be a salient pointwith outstanding accuracies and performances Traditional image processing algorithmsrequire a manual feature extraction and do not gain any robustness in a variety of envi-ronments, and traditional machine learning methods require a huge amount of parameterand computational resources due to using matrix multiplication to describe the interactionbetween each input unit and each output unit, this means each output unit will be derivedfrom every input unit However, convolutional neural networks have had some improve-ments in these perspectives and obtained impressive results in classical computer vision

Trang 32

tasks, e.g, object detection [4] and image classification [5] Convolutional neural networksfocus on three ideas to improve previous techniques: sparse interactions, parameter sharing,and equivariant representation.

• Sparse interactions: This property is not strictly for theoretical perspective but it may

be very useful when dealing with large inputs to improve computational efficiency.This property allows the network to describe the complicated interaction between somesmall but useful features in the image such as edges or important points, which areusually in only ten or hundred pixels by using a small kernel for the input image

• Parameter sharing: This property refers to using the same parameter more than once

to produce one output unit Whereas in traditional neural networks such as MLP, oneinput unit only occupies exactly once when computing one unit input, this means everyinput unit has an equal contribution in one unit in the output layer In convolutional

neural networks, there is one thing called a receptive field, it is defined as the size of

the region in the input that produces the feature Each output unit is calculated from all

of the input units within its receptive field, the unit is near the center of the receptivefield occupies more times than the others, therefore it has a larger contribution to theoutput unit

Figure 2.5:The receptive field of the units in the deeper layers of a convolutional network

is larger than the receptive field of the units in the shallow layers

• Equivariant representation: To say a function f(x) is equivariant to function g(x),

it means if f (x) changes, g(x) will change in the same way To be more specific, a

Trang 33

Figure 2.6:Receptive field of one output unit in CNNs.

function f (x) is equivariant to a function g(x) if f (g(x)) = g( f (x)) In case of theconvolution operation for image inputs, if we move the object in the input, its represen-tation will move the same amount in the output

Pooling, or down-sampling layer, performs nonlinear down-sampling and cuts down thenumber of parameters for a simpler output For example, the max-pooling operation pro-duces a maximum output within a rectangular neighborhood Because pooling summarizesthe whole neighborhood for each unit in the input and responses a decreased version ofthe input in terms of the spatial size, the pooling layer is normally used after the output

of the convolutional layer to reduce computational costs but can preserve enough features

of the input One key insight is that overusing pooling layer can lead to an underfittingphenomenon for the network

AlexNet

AlexNet [6] is considered a breakthrough and one of the most influential papers published

in computer vision, developed in 2012, significantly improved on the best performance inImageNet[5] competition by utilizing the graphics processing units (GPUs) during training.The neural network, which has 60 million parameters and 650,000 neurons, consists of fiveconvolutional layers, some of which are followed by max-pooling layers, and three fully-

Trang 34

Figure 2.7: Apply 2 × 2 pooling layer to 6 × 6 input

Trang 35

Figure 2.9: VGG16 (left) and VGG19 (right) architecture.

Trang 36

After being developed in 2015, ResNet [8] has gained tremendous success in computervision They provide a novel method to ease the degradation problem (with the networkdepth increasing, accuracy gets saturated and then degrades rapidly when training deepneural networks) by presented residual functions ResNet and its variants such as ResNet-

18, ResNet-34, ResNet-50, ResNet-101, have marked a rapid and significant growth ofthe residual functions in deep learning

Figure 2.10:Residual function (left) and ResNet-18 architecture.

R-CNN

R-CNNRegional-based Convolutional Neural Networks Since the object detection task wasintroduced, many methods have been proposed to address this task, from using classicalimage processing methods such as [9] to employing a complex ensemble system, but objectdetection performance had been still stagnated

The object detection problem cannot be solved using the traditional convolutional work (CNN) model The big problem is that we can’t create an efficient output layer because

net-we don’t know how many objects are in the image ahead of time R-CNN (regional lutional neural network) was created as a result

Trang 37

convo-R-CNN is firstly introduced in 2014, in "Rich feature hierarchies for accurate object

detection and semantic segmentation"[10], written by Ross Girshick and colleagues in UCBerkeley

R-CNN consists of 3 components:

• Region Proposal: Region proposals are simply smaller portions of the original image

that we believe may contain the things we’re looking for To generate region als, selective search method is utilized R-CNN uses selective search by [2] to generateabout 2K region proposals, i.e bounding boxes for image classification This may ap-pear to be a large amount, but it pales in comparison to the brute-force sliding windowapproach

propos-• Feature Extractor: For example, AlexNet is used as a feature extractor AlexNet had

previously been trained for the classification task Then this network is removed fromthe last softmax layer after the training The fully connected 4096-dimensional layer isnow the final layer Output features are 4096 dimensional as a result of this

• Classifier: After creating feature vectors from the image proposals, we need to classify

those feature vectors We are trying to figure out what kind of object those featurevectors belong to R-CNN utilizes an SVM classifier for this Each object class has itsown SVM, which we use all of This means we have n outputs for each feature vector,where n is the number of different things we want to detect A confidence score is theresult It presents how certain we sure that this feature vector accurately reflects thisclass is

Figure 2.11:The architecture of R-CNN

Trang 38

• It was time-consuming For each region proposal, we had to calculate a feature map(one CNN forward pass).

• We had to save every feature map of each region proposal, which required a lot ofmemory This will necessitate a large amount of RAM

After R-CNN was introduced, Ross Girshick continued to introduce Fast R-CNN [11] FastR-CNN only has one system, that we can train end-to-end We merge the three components

of the R-CNN system (CNN, SVM, and Bounding Box Regressor) into a single architecture,

which is the architecture of Fast R-CNN

Figure 2.12:The architecture of Fast R-CNN

This is how it works:

• Process the whole image with the CNN The result is a feature map of the image

• For each region proposal extract the corresponding part from the feature map This iswhat we’ll refer to as the region proposal feature map With the help of a pooling layer,

we scale the region proposal feature map from the feature map to a fixed-size Thispooling layer is called the Region of interest (RoI) pooling layer

Trang 39

• Flatten this fixed sized region proposal feature map This is now a feature vector, thatalways has the same size.

• This feature vector will now be used as the input for the final part These are two outputlayers that are fully connected The first layer is the softmax classification layer, whichdetermines which object class we have detected The Bounding Box Regressor, whichoutputs the bounding box coordinates for each object class, is the second

The inference of Fast R-CNN is 213 times faster and achieves a higher mAP

Faster R-CNN

In 2016, Shaoqing Ren and his colleagues at Microsoft Research proposed Faster R-CNN

[12] Faster R-CNN has two networks: a region proposal network (RPN) for generating gion proposals and a network for detecting objects utilizing these proposals The primary

re-difference between Fast R-CNN and R-CNN, Fast R-CNN is that those earlier SOTAs

gen-erate region proposals via selective search When RPN shares the most computation withthe object detection network, the time cost of generating region proposals is substantiallylower in RPN than in selective search RPN ranks region boxes (also known as anchors) andindicates the ones that are most likely to contain objects Anchors play an important role inFaster R-CNN An anchor is a box In the default configuration of Faster R-CNN, there are 9anchors at a position of an image A region proposal network’s (RPN) output is a collection

of boxes/proposals that will be processed by a classifier and regressor to evaluate the ence of objects RPN predicts whether an anchor will be in the background or foreground,then refines the anchor correspondingly

pres-This is how it works:

• Take an input image and pass it to the ConvNet which returns feature maps for theimage

• Apply Region Proposal Network (RPN) on these feature maps and get object proposals

• Apply ROI pooling layer to bring down all the proposals to the same size

• Finally, pass these proposals to a fully connected layer in order to classify any predictthe bounding boxes for the image

Trang 40

Figure 2.13:The architecture of Faster R-CNN

Until now, Faster R-CNN has been the most effective R-CNN Regional-based Convolutional

Neural Networks based architecture with improved accuracy and increased computation

time up to 250 times compared to the first version - R-CNN.

Tiêu đề	Effectively Apply Vietnamese For Visual Question Answering System
Tác giả	Nguyen Bao Phuc, Tran Hoang Nguyen
Người hướng dẫn	Dr. Quan Thanh Tho, Mr. Le Dinh Thuan
Trường học	Ho Chi Minh University Of Technology
Chuyên ngành	Computer Science
Thể loại	graduation thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	127
Dung lượng	2,69 MB