Đồ án tốt nghiệp Công Nghệ Thông Tin Trí tuệ nhân tạo 9 điểm

Đồ Án Tốt Nghiệp Công Nghệ thông tin trường Bách khoa đà nẵng 2016 về mảng trí tuệ nhân tạo (Machine Learning) Mathematics is one of the most important fields of the people, is studied, developed, and applied a lot in real life. Mathematics helps solve many problems in life. All of us have been looking at math for a long time. Many people love this subject, but many people have difficultly to solve math problems. Nowadays, with the vigorous development of science and technology, especially Artificial Intelligence (AI), AI has many outstanding achievements, which can solve many humans works. This thesis introduces the application that helps to solve math problems by applying some machine learning algorithms.

Trang 1

Instructor: Le Thi My Hanh, Ph.D Date

Suggestions/Comments:

………

Trang 2

Topic title: MathOCR: Solving math problems using machine learning

Student name: Bui Dang Quang Dung

Student ID: 103160193 Class: 16TCLC3

Mathematics is one of the most important fields of the people, is studied, developed, and applied a lot in real life Mathematics helps solve many problems in life All of us have been looking at math for a long time Many people love this subject, but many people have difficultly to solve math problems Nowadays, with the vigorous development of science and technology, especially Artificial Intelligence (AI), AI has many outstanding achievements, which can solve many human's works This thesis introduces the application that helps to solve math problems by applying some machine learning algorithms

Trang 3

UNIVERSITY OF SCIENCE AND

TECHNOLOGY

FACULTY OF INFORMATION

TECHNOLOGY

Independence - Freedom - Happiness

GRADUATION PROJECT REQUIREMENTS

Student Name: BUI DANG QUANG DUNG Student ID: 103160193

Class: 16TCLC3 Faculty: Information Technology Major: Information Technology

1 Topic title: MathOCR: Solving math problems using machine learning

2 Project topic: ☐ has signed intellectual property agreement for the final result

3 Initial figure and data:

Data is collected from many resources

4 Content of the explanations and calculations:

The content contains five parts:

 Machine Learning, Computer Vision, Natural Language Processing and their applications;

 Models, details in MathOCR

 Introduce MathOCR Application

 MathOCR Experiments

 Data pre-processing process:

 Generating data process of models in MathOCR

 Experiment results of models

 Conclusion

5 Drawings, charts

6 Instructor name: Le Thi My Hanh PhD, Information Technology Faculty, University

of Danang - University of Science and Technology

Trang 4

PREFACE

During the project, I would like to express my sincere thanks to Le Thi My Hanh,

PhD Thank you for giving me a lot of ideas, solutions, and knowledge to complete this

project

And I would also like to really appreciate the teachers, and students of the

Faculty of Information Technology – University of Danang - University of

Science and Technology for helping me in the past four years of study, passing me on

the necessary knowledge and valuable experience for me to be able to do this project

And finally, I would also like to express my special thanks to my family who

supported, gave me motivation and help, both financially and spiritually, for this project

Although I tried my best to do this project, it is impossible to avoid mistakes or

incompletes I hope that I can receive valuable comments and recommendations from

the teachers to complete my thesis

Da Nang, December 11th 2020

Students

Bui Dang Quang Dung

Trang 5

ASSURANCE

I understand the University’s policy about anti-plagiarism and guarantee that:

1 The contents of this thesis project are performed by myself following the guidance of Le Thi My Hanh, PhD

2 All the references, which I used in this thesis, are quoted with the author’s name, project’s name, time, and location to publish clearly and faithfully

3 This project's contents are my work and have not been copied from other sources

or been previously submitted for award or assessment

Student Performed

Bui Dang Quang Dung

Trang 6

TABLE OF CONTENT

SUMMARY ii

GRADUATION PROJECT REQUIREMENTS iii

PREFACE iv

ASSURANCE v

LIST OF PICTURE ix

LIST OF TABLE xi

LIST OF ACRONYM xii

INTRODUCTION 1

Reason for doing thesis 1

Scope and Objective 1

Overview 1

CHAPTER 1: MACHINE LEARNING, COMPUTER VISION, NATURAL LANGUAGE AND THEIR APPLICATION 3

1.1 Introduction 3

1.2 Machine Learning 3

1.2.1 What is Machine Learning 3

1.2.2 Supervised Learning 3

1.2.3 Unsupervised Learning 4

1.2.4 Reinforcement Learning 5

1.3 Computer Vision 6

1.3.1 What is Computer Vision 6

1.3.2 Computer Vision tasks 6

1.3.3 Applications of Computer Vision 7

1.4 Natural Language Processing 8

1.4.1 What is Natual Language Processing 8

1.4.2 Natural Language Processing tasks and their application 8

CHAPTER 2: MODELS, DETAILS IN MATHOCR 10

2.1 Introduction 10

2.2 The Vietnamese recognition model with the Transformer 10

2.2.1 Introduction 10

2.2.2 Backbone 11

Trang 7

2.2.3 Encoder 12

2.2.4 Decoder 12

2.2.5 Multi-Head Attention 12

2.2.6 Position-wise Feed-Forward Networks 14

2.2.7 Positional Encoding 14

2.3 The Image to Latex model 15

2.3.2 Model Architecture 15

2.3.3 Encoder 15

2.3.3.1 Convolution 15

2.3.3.2 Positional Encoding 16

2.3.4 Decoder 16

2.3.4.1 Token Embedding 16

2.3.4.2 LSTM network 17

2.4 The YOLOv4 Model 18

2.4.2 YOLOv4 Architecture 18

2.4.2.1 Backbone 19

2.4.2.2 Neck 20

2.4.2.3 Head 21

2.4.2.4 Bag Of Freebies 22

2.4.2.5 Bag Of Specials 22

2.5 Metric Evaluation 22

2.5.1 BLEU 22

2.5.2 mAP 23

2.4.2.1 Precision and Recall 24

2.5.2.2 IoU 24

2.5.2.3 AP 25

CHAPTER 3 INTRODUCE MATHOCR APPLICATION 29

3.1 Introduction 29

3.2 Front-end 29

3.3 Server 30

3.4 Features Specification 30

3.3.1 Document Scanner 31

Trang 8

3.3.2 Math Formula Recognition 32

3.3.3 Vietnamese Text Recognition 33

3.3.4 Solving Math equations 34

CHAPTER 4 MATHOCR EXPERIMENTS 36

4.1 Introduction 36

4.2 Data Preprocessing Process 36

4.3 VietnameseOCR Model Experiments 38

4.3.1 Data Source 38

4.3.2 Training parameters 41

4.3.3 Experimental Results 41

4.4 Im2LaTex Model 42

4.4.3 Experimental Results 43

4.5 YOLOv4 Model 44

4.5.3 Experiment result 45

CHAPTER 5 CONCLUSION 46

5.1 Archived results: 47

5.2 Limitations: 47

5.3 Development: 47

REFERENCES 47

Trang 9

LIST OF PICTURE

Figure 1.1 Supervised Learning 4

Figure 1.2 Unsupervised Learning 4

Figure 1.3 Reinforcement Learning 6

Figure 1.4 Subfiles of Computer Vision 7

Figure 2.1 The Transformer Architecture 12

Figure 2.2 (left) Scaled Dot-Product Attention, (right) Multi-Head Attention consists of several attention layers running in parallel 13

Figure 2.3 Object detector 19

Figure 2.4 DenseNet CSP 19

Figure 2.5 Modified PAN 21

Figure 2.6 Modified SAM 21

Figure 2.7 IoU 24

Figure 2.8 The result calculated Precision and Recall 26

Figure 2.9 The results after being smoothed 26

Figure 2.10 Calculate max of precision at each level 27

Figure 2.11 Normalize by VOC format 27

Figure 3.1 The React Native Framework 29

Figure 3.2 Python, Flask and Pytorch 30

Figure 3.3 Document Scanner Screen 31

Figure 3.4 Math Formula Recognition Screen 32

Figure 3.5 Vietnamese Text Recognition Screen 33

Figure 3.6 Solving Math Equations Screen 34

Figure 4.1 Data preprocessing process 36

Figure 4.2 Data normalization 38

Figure 4.3 Request and Beautiful soup libraries 39

Figure 4.4 Process of generating data from existing text 39

Figure 4.5 Process of generating data from PDF files 40

Figure 4.6 Results of Vietnamese text recognition 41

Figure 4.7 Good prediction results 43

Figure 4.7 Unexpected prediction results 43

Trang 10

Figure 4.8 The statistics bounding box of objects 44

Figure 4.9 Experiment result 45

Figure 4.10 Good prediction results 45

Figure 4.11 The predicted results are average 45

Figure 4.12 Unexpected prediction results 46

Trang 11

LIST OF TABLE

Table 2.1 The VGG19 model Architecture 11

Table 2.2 The CNN model configurations 15

Table 2.3 DarkNet-53 Architecture 20

Table 2.4 The prediction apple result 25

Trang 12

LIST OF ACRONYM

BLEU Bilingual Evaluation Understudy

Trang 13

INTRODUCTION

Reason for doing this thesis

Mathematics is one of the essential fields, the foundation of all other science fields, which is studied, developed, and applied a lot in real life, helping solve many life problems All of us have been taught math for a long time Learning math helps us know

to solve the issues and develop other skills, such as giving us analytical, reasoning, and problem-solving skills to help us become smarter and think more quickly So math has become a favourite subject for many people But along with that, many people have lost inspiration when studying this subject It seems that they will feel shy and unable to receive useful knowledge from this subject One of the main reasons is that it is not easy

to solve difficult math problems, so depression is manageable

With the vigorous development of science and technology, especially in recent times, Artificial Intelligence has made many outstanding achievements, helping people save time and work more productively The application of AI achievements in solving human problems is significant

For the above reasons, in this thesis, I will introduce an application that helps solve some math problems It will serve as a source of reference for you who are disoriented in their approach to mathematics Simultaneously, it also provides other functions in recognizing Vietnamese script, making it easy for us to save content and be able to embed in different file formats

Scope and Objective

 This thesis will apply OCR, object detection technologies

 The main objective of Vietnamese text and math formula

 The solvers will be math equations

Overview

This thesis can be divided into four parts:

 Machine Learning, Computer Vision, Natural Language Processing and their applications;

 Describe machine learning models in the thesis

 Introduce MathOCR Application

Trang 14

 Experiments Results of ML models

 Conclusion

Trang 15

CHAPTER 1: MACHINE LEARNING, COMPUTER VISION, NATURAL LANGUAGE AND THEIR APPLICATION

1.1 Introduction

This thesis's primary focus is on the OCR, which is an application of Machine Learning, Computer Vision, and Natural Language Processing This chapter will give a general view of these fields of research and their applications

1.2 Machine Learning

1.2.1 What is Machine Learning

In recent years, there are many achievements in Machine Learning Machine Learning has become one of the most critical industries and contributes to human developments in the Fourth Industrial Revolution

Machine learning (ML) is the study of computer algorithms that improve automatically through experience It is seen as a subset of artificial intelligence Machine learning algorithms build a model based on sample data, known as "training data", to make predictions or decisions without being explicitly programmed to do so [1]

There are three types of Machine Learning techniques:

Supervised learning can be separated into two types of problems when data classification and regression [2]:

mining- Classification uses an algorithm to assign test data into specific categories accurately It recognizes particular entities within the dataset and attempts to conclude how those entities should be labelled or defined Classification algorithms are commonly linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbour, and random forest

Trang 16

 Regression is used to understand the relationship between dependent and independent variables It is commonly used to make projections, such as for sales revenue for a given business Linear regression, logistical regression, and polynomial regression are popular regression algorithms

Figure 1.1 Supervised Learning

1.2.3 Unsupervised Learning

Unsupervised learning (unsupervised machine learning) uses machine learning algorithms to analyze and cluster unlabeled datasets These algorithms discover hidden patterns or data groupings without the need for human intervention

Figure 1.2 Unsupervised Learning

Trang 17

Unsupervised learning models are utilized for three main tasks-clustering, association, and dimensionality reduction [3]:

 Clustering is a data mining technique that groups unlabeled data based on their similarities or differences Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures or patterns in the information Clustering algorithms can be categorized into a few types, specifically exclusive, overlapping, hierarchical, and probabilistic

 An association rule is a rule-based method for finding relationships between variables in each dataset These methods are frequently used for market basket analysis, allowing companies to understand relationships between various products better

 Dimensionality reduction: While more data yields more accurate results, it can also impact the performance of machine learning algorithms (e.g., overfitting), and it can also make it challenging to visualize datasets Dimensionality reduction is a technique used when the number of features, or dimensions, in each dataset is too high It reduces the number of data inputs to a manageable size while also preserving the dataset's integrity as much as possible It is commonly used in the preprocessing data stage,

1.2.4 Reinforcement Learning

Reinforcement learning is an area of Machine Learning It is about taking suitable action to maximize reward in a particular situation It is employed by various software and machines to find the best possible behaviour or path it should take in a specific situation

Reinforcement learning differs from supervised learning in a way that in supervised learning, the training data has the answer key with it, so the model is trained with the correct answer itself In contrast, there is no answer in reinforcement learning, but the reinforcement agent decides what to do to perform the given task In the absence

of a training dataset, it is bound to learn from its experience

There are two types of Reinforcement:

 Positive reinforcement means giving something to the subject when they perform the desired action, so they associate the action with the reward and do it more often The reward is a reinforcing stimulus

Trang 18

 Negative Reinforcement is defined as the strengthening of behavior because a negative condition is stopped or avoided

Figure 1.3 Reinforcement Learning

1.3 Computer Vision

1.3.1 What is Computer Vision

Computer vision is the field of study surrounding how computers see and understand digital images and videos Computer vision spans all tasks performed by biological vision systems, including "seeing" or sensing a visual stimulus, understanding what is being seen, and extracting complex information into a form that can be used in other processes This interdisciplinary field simulates and automates these elements of human vision systems using sensors, computers, and machine learning algorithms Computer vision is the theory underlying artificial intelligence systems' ability to see and understand their surrounding environment

1.3.2 Computer Vision tasks

Computer Vision is used and researched widely throughout life It has many tasks, such as:

 Object Detection is the ability to detect or identify objects in any given image correctly along with their spatial position in the given image, in the form of rectangular boxes (known as Bounding Boxes) which bound the object within it

An example is shown below, which detects objects such as laptops, glasses, notebooks, coffee, and iPhone in their Bounding Boxes

Trang 19

 Another computer vision task which is popular is Image Classification Image Classification means identifying what class the object belongs to For example,

in the image shown below, there are objects present belonging to various classes such as trees, huts, giraffe, etc

 Image Captioning is looking at an image and describing what is happening in the image The image given below contains annotations or labels which describe what is happening in the picture, which should give you a good idea about what Image Captioning does

 Image Segmentation: Identifying parts of the image and understanding what object they belong to Segmentation lays the basis for performing object detection and classification

Figure 1.4 Subfiles of Computer Vision

1.3.3 Applications of Computer Vision

There are many examples of computer vision applied because its theory spans any area where a computer will see its surroundings in some form Below are examples of computer vision [4]:

 Autonomous Vehicles: Self-driving cars need to gather information about their surroundings to decide how to behave

 Facial Recognition: Businesses and personal electronics use facial

recognition technology to“see” who is trying to access something It has

become a powerful security tool

Trang 20

 Image Search and Object Recognition: Many applications use data vision theory to identify objects within images, search through catalogues of images, and extract information out of images

 Robotics: Most robotic machines, often in manufacturing, need to see their surroundings to perform the task at hand In manufacturing, machines may be

used to inspect assembly tolerances by “looking at” them

1.4 Natural Language Processing

1.4.1 What is Natual Language Processing

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding

1.4.2 Natural Language Processing tasks and their application

There are many tasks in NLP has been developed and researched:

 Text Classification Tasks

 Representation: bag of words (does not preserve word order)

 Goal: predict tags, categories, sentiment

 Application: filtering spam emails, classifying documents based on dominant content

 Word Sequence Tasks

 Representation: sequences (preserves word order)

 Goal: language modeling - predict next/previous word(s), text generation

 Application: translation, chatbots, sequence tagging (predict POS tags for each word in sequence), named entity recognition

 Text Meaning Tasks

 Representation: word vectors, the mapping of words to vectors

(n-dimensional numeric vectors) aka embeddings

 Goal: how do we represent meaning?

 Application: finding similar words (similar vectors), sentence embeddings (as opposed to word embeddings), topic modeling, search, question answering

Trang 21

Figure 1.5 Named entity recognition (NER)

 Sequence to Sequence Tasks

 Many tasks in NLP can be framed as such

 Examples are machine translation, summarization, simplification, Q&A systems

 Such systems are characterized by encoders and decoders, which work in complement to find a hidden representation of text and to use that hidden representation

 Dialog Systems

 Two main categories of dialog systems, categorized by their scope of use

 Goal-oriented dialog systems focus on being useful in a particular, restricted domain; more precision, less generalizable

 Conversational dialog systems are concerned with being helpful or entertaining in a much more general context; less precision, more generalization

Trang 22

CHAPTER 2: MODELS, DETAILS IN MATHOCR

2.1 Introduction

In this section, I will introduce in detail the machine learning models used in the project included:

 The Vietnamese text recognition model: VietnameseOCR

 The mathematical formula recognition model: Im2LateX

 The Vietnamese Text and math formula detection model: YoLoV4

Mainly focused on the architectural description of the models, the arithmetic

operations, and the two metric values used to evaluate the model's quality are BLEU and mAP

2.2 The Vietnamese recognition model with the Transformer

2.2.1 Introduction

The Transformer is a term introduced in the paper "Attention Is All You Need" [5] This paper describes the Transformer and what is called a sequence-to-sequence (Seq2Seq) architecture Seq2Seq is a neural network to transforms given sequence elements, such as the sequence of words in a sentence, into another sequence

Another new tech help to improve the accuracy of Seq2Seq models is Attention

The attention mechanism looks at an input sequence and decides at each step which other parts of the sequence are essential It sounds abstract, but this mechanism is also related to us, or in words, we also have our mechanism For example, our eyes have vision 120 degrees both vertically and horizontally However, we are only "attention" a small part of the image to extract information This mechanism helps us not need a lot

of energy to make decisions but still provides reliable results

The Transformer is an architecture for transforming one sequence into another sequence help of two parts (Encoder and Decoder) The Encoder part is used to learn the vector present of a sentence hoping this vector carries complete sentence information The Decoder part performs the function of converting the vector into the output sequence

Trang 23

2.2.2 Backbone

The main object of the Vietnamese word recognition model is a picture containing Vietnamese text The model needs to extract this image's characteristics and convert the image's information into a vector Many pre-trained models are trained from the extremely famous ImageNet dataset These include Resnet, VGG, Xception

For OCR tasks, models with vanilla architecture are beneficial and applied much

in practice Among them can be mentioned models VGG family (VGG16, VGG19 ) [6] In the Vietnamese word recognition model, the backbone will be the VGG19 model The model also needs to be customized with parameters to suit the task The architecture VGG19 finetune shows below:

Table 2.1 The VGG19 model Architecture

Trang 24

2.2.3 Encoder

Figure 2.1 The Transformer Architecture [5]

The Encoder consists of N layers Each layer consists of two sub-layers, Head Attention, and a Feedforward network At each sub-layer, the residual block is built into the normalization layer The output of each section will be 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 (𝑥 + 𝑆𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥)), where 𝑆𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥) is a function built by that sub-layer itself The output will be a vector of size 512

Multi-2.2.4 Decoder

In general, the Decoder has a similar architecture to the Encoder, which has N layers In addition to the two sub-layer like the Encoder, a subclass is added, which performs Multi-Head Attention over the encoder stack's output The Self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position 𝑖 can depend only on the known outputs at positions less than 𝑖

2.2.5 Multi-Head Attention

Trang 25

The Multi-Head Attention is self-attention, but for the model to pay attention to many different patterns, the author uses many self-attention Each of these sub-attention layers is introduced by the author under the name Scaled Dot-Product Attention

Each word will be mapped into the embedding space to form a vector, then create three vectors Q (query), K (key), and V (value) by multiplying by the matrix corresponding to the word's vector

 Value vector: vector representing the content and meaning of words

Figure 2.2 (left) Scaled Dot-Product Attention, (right) Multi-Head Attention consists

of several attention layers running in parallel [5]

Assume, The input consists of queries and keys of dimension 𝑑𝑘, and values of dimension 𝑑𝑣 Multiplying key matrix with a query matrix is calculated above to compare the query and the key to learning the correlation, divide each by √𝑑𝑘 Then normalize to the segment [0-1] by using the softmax function The result is more near one means the query is the more same as the key, and opposite Finally, multiply it with value matrix

Trang 26

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑄𝐾

𝑇

√𝑑𝑘) 𝑉 Instead of performing a single attention function with 𝑑𝑚𝑜𝑑𝑒𝑙-dimensional keys,

values, and queries The three input vector linearly project h times to generate 𝑑𝑘, 𝑑𝑘and 𝑑𝑣 dimensions, respectively

Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions With a single attention head, averaging inhibits this

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ)𝑊𝑂

where ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖𝐾, 𝑉𝑊𝑖𝑉) Where the projections are parameter matrices:

𝑊𝑖𝑄 ∈ 𝑅𝑑 𝑚𝑜𝑑𝑒𝑙 ´𝑑 𝑘, 𝑊𝑖𝐾 ∈ 𝑅𝑑 𝑚𝑜𝑑𝑒𝑙 ´𝑑 𝑘, 𝑊𝑖𝑉 ∈ 𝑅𝑑 𝑚𝑜𝑑𝑒𝑙 ´𝑑 𝑣

2.2.6 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in Encoder and Decoder contains a fully connected feed-forward network applied to each position separately and identically This consists of two linear transformations with a ReLU activation in between

FFN(𝑥) = max(0, 𝑥𝑊1+ 𝑏1) 𝑊2+ 𝑏2

2.2.7 Positional Encoding

Positional Encoding is a re-representation of a word's values and its position in a sentence (given that it is not the same to be at the beginning that at the end or middle) The positional encodings have the same dimension 𝑑𝑚𝑜𝑑𝑒𝑙 as the embeddings, so that the two can be summed:

𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(𝑝𝑜𝑠/100002𝑖/𝑑 model)

𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(𝑝𝑜𝑠/100002𝑖/𝑑model)

Where pos is the position, and i is the dimension That is, each dimension of the

Positional Encoding corresponds to a sinusoid The wavelengths form a geometric progression from 2𝜋 to 10000.2𝜋

Trang 27

2.3 The Image to Latex model

Nowadays, LaTeX has been the most popular tool for visualizing mathematical

expression in math tests, research papers Extracting the math formulas from digital

documents and translating them into markup languages is especially useful for a wide

range of information retrieval tasks Recognizing math formulas from different

document types such as PDF, image, MS Word… is problematic because it usually has

unusual symbols and complex layout structures This model translates the image of math

formula into their LaTeX markup sequences

2.3.2 Model Architecture

In this section, I am going to introduce details about the Im2LaTeX model The

Im2LaTeX model is built based on the Encoder-Decoder structure and the Decoder part

with an Adaptive Attention, as shown in Figure 1

2.3.3 Encoder

2.3.3.1 Convolution

The Encoder part has Vanilla CNN architecture to extract features from input

images The model architecture is based on the VGG Net [6] that has been adapted

particularly for OCR applications The CNN has consisted of convolution, pooling, and

activation layers Table 1 shows the structure of the CNN model

Table 2.2 The CNN model configurations: channel: the number of output channel of features, k: kernel size, p: padding size, s:

Trang 28

Assume the original image has size H x W, the Encoder map the image into a new feature space of the size H’ x W’ x D where D is the number of last convolution channels Thus, the original image is split into D regions where each region has a size

of H’ x W’

2.3.3.2 Positional Encoding

Positional Encoding is the technique introduced in the paper “Attention is All You Need” [5] to the image representation In the math formula, the spatial relationship among symbols spans along different directions left-right, top-bot, The positional relationships among math symbols carry critical math semantics As such, special efforts

to preserve spatial locality are necessary

The ideal of positional encoding show below:

𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = 𝑠𝑖𝑛 ( 𝑝𝑜𝑠

100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙)

𝑃𝐸(𝑝𝑜𝑠, 2𝑖 + 1) = 𝑐𝑜𝑠 ( 𝑝𝑜𝑠

100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙) Where pos is the position of word w,

2.3.4 Decoder

After having feature maps from Encoder, the Decoder will decode to LaTeX code RNN or different types of it such as LSTMs, GRUs are well for sequence task, because it maintains a history of the previous predictions and can traverse from the start

to the end of sequence at arbitrary length

2.3.4.1 Token Embedding

In computer programming, the machine cannot understand math symbols, which only know numbers The math symbols must be present in number form There are many solutions to solve it The word embedding solution is used in NLP, where each token is projected into a high-dimensional vector

𝑤𝑡 = 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑦𝑡) Where 𝑦𝑡 is the token t th of a math formula, 𝑤𝑡 is embedding vector corresponding

Trang 29

2.3.4.2 LSTM network

In this part, the LSTM network is applied because it is very efficient and handles the vanishing gradient problem [7] To be more specific

ℎ𝑡 = 𝐿𝑆𝑇𝑀(𝑥𝑡, ℎ𝑡−1, 𝑚𝑡−1) Where ℎ𝑡 is the hidden state of the network at time t, and 𝑥𝑡 is the input vector,

and m is the previous memory cell With the hidden state ℎ𝑡 and the image vector V, I then need to compute a context vector 𝑐𝑡 as:

𝑐𝑡 = 𝑔(V, ℎ𝑡)

Where g is the attention function To be more specific, V and ℎ𝑡 are both fed into

a single layer neural network followed by a softmax function The attention distribution over the k regions of the image is computed as follow:

Where 𝑊𝑣 and 𝑊𝑔 are trainable weights and α is the attention weight over features

in V Unlike traditional spatial attention model, my Decoder implements adaptive attention model which is able to determine whether it needs to attend the image to predict next word with a newly introduced variable 𝑠𝑡 named “visual sentinel”, 𝑠𝑡 is computed

as follow

𝑔𝑡 = σ(𝑊𝑥𝑥𝑡 + 𝑊ℎℎ𝑡−1)

𝑠𝑡 = 𝑔𝑡 ⊙ 𝑡𝑎𝑛ℎ(𝑚𝑡) Where 𝑔𝑡 is the gate applied to the memory cell 𝑚𝑡, 𝑊𝑡 and 𝑊ℎ are trainable weights, 𝑥𝑡 is the input at time 𝑡, σ is the logistic sigmoid function, and ⊙ indicates element wise product Thus, the new context vector 𝑐𝑡 is now computed as

𝑐̂ = β𝑡 𝑡𝑠𝑡 + (1 − β𝑡)𝑐𝑡Where β𝑡 is the new sentinel gate at time 𝑡 range from [0,1] To compute β𝑡, a new element is added to 𝑡, which indicates how much “attention” the network is placing

on the sentinel (as opposed to the image features) to obtain a new α𝑡α

α𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑧𝑡; 𝑤ℎ𝑇𝑡𝑎𝑛ℎ(𝑊𝑠𝑠𝑡+ 𝑊𝑔ℎ𝑡))

Trang 30

Where [·; ·] simply means concatenation Thus, my new α𝑡 has size 𝑘 + 1, and

β𝑡 = α̂𝑡[𝑘 + 1] With 𝑐̂𝑡, the model can adaptively attend to either the image or the visual sentinel when generating the next word while the sentinel vector is updated at each time step

2.4 The YOLOv4 Model

Object detection is the most popular task in Computer Vision Many models implemented excellent performance One problem that I need to solve in my thesis is identifying a math problem's elements on an image Need to determine where to contain math formula, where is the Vietnamese script In this section, I will introduce the YOLOv4 model to solve this problem

2.4.2 YOLOv4 Architecture

Usually, the object detection model is composed of several parts [8]:

 Input: Image, Patches, Image Pyramid

 Backbone: VGG16, Resnet-50, SpineNet, EfficientNet-B0/B7, CSPResNeXt50, CSPDarknet53

 Neck:

 Additional blocks: SPP, ASPP, RFB, SAM

 Path-aggregation blocks: FPN, PAN, NAS-FPN, Fully-connected FPN, BiFPN, ASFF, SFAM

 Heads:

 Faster R-CNN, R-FCN, Mask R-CNN

 RepPoints

Định dạng
Số trang	60
Dung lượng	2,8 MB