61 2.9 Performance comparison between the proposed and existing methods of isolated expression detection on the Marmot dataset highest scores are in bold.. 64 2.10 Performance comparison
Trang 1MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
BUI HAI PHONG
ENHANCING PERFORMANCE OF MATHEMATICAL
EXPRESSION DETECTION IN SCIENTIFIC
DOCUMENT IMAGES
DOCTORAL DISSERTATION IN
COMPUTER SCIENCE
Hanoi−2021
Trang 2MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
BUI HAI PHONG
ENHANCING PERFORMANCE OF MATHEMATICAL
EXPRESSION DETECTION IN SCIENTIFIC
DOCUMENT IMAGES Major: Computer Science Code: 9480101
Trang 3DECLARATION OF AUTHORSHIP
I, Bui Hai Phong, declare that the thesis titled "Enhancing performance of matical expression detection in scientific document images" has been entirely composed
mathe-by myself I assure some points as follows:
This work was done wholly or mainly while in candidature for a Ph.D researchdegree at Hanoi University of Science and Technology
The work has not be submitted for any other degree or qualifications at HanoiUniversity of Science and Technology or any other institutions
Appropriate acknowledge has been given within this thesis where reference hasbeen made to the published work of others
The thesis submitted is my own, except where work in the collaboration has beenincluded The collaborative contributions have been clearly indicated
Hanoi, September, 2021
PhD Student
SUPERVISORS
1.Assoc Prof Hoang Manh Thang
2.Assoc Prof Le Thi Lan
Trang 4I decided to pursue a PhD in Computer Science at MICA International ResearchInstitute, Hanoi University of Science and Technology (HUST) in 2017 It has beenone of the best decisions I could have made HUST is a really special place where Ihave accumulated immense knowledge I would like to thank Executive Board and allmembers of MICA Research Institute, HUST for the kind support in the PhD course
I wish to express my deepest gratitude to my supervisors Assoc.Prof Hoang ManhThang and Assoc.Prof Le Thi Lan for their continuous instruction, advice and support
in the PhD course The thesis cannot be fulfilled without the specific direction of mysupervisors
I wish to thank all members of Computer Vision Department, MICA ResearchInstitute, HUST for the frequent support in the PhD course
I wish to thank Executive Board and all members of School of Graduate Education;School of School of Electronics and Telecommunications and School of Information andCommunication Technology, HUST for the specific comments and suggestion for thethesis
I wish to thank all members of Faculty of Information Technology, Hanoi tectural University for the support in the professional work in the completion of thePhD
Archi-I wish to thank Professor Akiko Aizawa and members of Aizawa Laboratory, tional Institute of Informatics, Tokyo, Japan where I have obtained many scientificexperiences during the internship of the PhD
Na-I wish to thank anonymous reviewers for valuable comments during the completion
of the PhD
I gratefully acknowledge the funding from SAHEP HUST project number SAHEP-008 and Domestic Master/ PhD Scholarship Programme of Vingroup Innova-tion Foundation 2019-2021
T2020-I wish to express my sincere gratitude to my family and friends for the continuoussupport and encouragement in the completion of the PhD
Hanoi, 2021Ph.D Student
Trang 5Mathematical expressions (MEs) play an important role in scientific documents and
a huge number of scientific documents have been produced over years Therefore, thedemand of document digitization for researching and studying purposes has contin-uously increased Detection and recognition of MEs in documents are considered asessential steps for document digitization The detection of expressions aims to locatethe position of expressions within documents Meanwhile, the recognition of MEs aims
at converting expressions from image format to string In the documents, mathematicalexpressions are classified in two categories: isolated (displayed) and inline (embedded)expressions An isolated expression displays in a separate line, an inline expression ismixed with other components (texts) Mathematical expressions may consist of math-ematical operators (e.g +, -, ×, ÷), functions (log sin, cos) and variables (i, j, r).Large expressions may consist of multiple text lines Meanwhile, small expressionsmay consist of one character The accuracy of the detection of isolated expressions hasbeen gradually improved However, the detection of inline expressions is considered
as a challenging task In practice, the detection and recognition of MEs in documentimages are closely related The accuracy of the detection allows to obtain accuracy ofthe recognition In contrast, the incorrect detection may cause errors in the recognition
of MEs
This thesis presents three main contributions in the detection and recognition ofMEs in scientific document images:
(1) First, a hybrid method of two stages has been proposed for the effective detection
of MEs At first stage, the layout analysis of entire document images is introduced
to improve the accuracy of text line and word segmentation At second stage, bothisolated and inline MEs in document images are detected Both hand-crafted and deeplearning features are extensively investigated and combined to improve the detectionaccuracy In the handcrafted feature extraction approach, the Fast Fourier Transform(FFT) is applied for text line images for the detection of isolated MEs The Gaussianparameters of projection profile are applied as the feature extraction for the detection
of inline MEs After the feature extraction, various machine learning classifiers havebeen fine tuned for the detection In the deep learning approach, the CNNs (Alexnetand ResNet) have been optimized for the detection of MEs The fusion of handcraftedand deep learning features based on the prediction scores has been applied The merit
of the method is that it can operate directly on the ME images without the employment
of character recognition
(2) Second, an end-to-end framework for mathematical expression detection in
Trang 6sci-entific document images is proposed without using any Optical Character Recognition(OCR) or Document Analysis techniques as in conventional methods The distancetransform is firstly applied for input document images in order to take advantages ofthe distinguished features of spatial layout of MEs Then, the transformed images arefed into the Faster Region with Convolutional Neural Network (Faster R-CNN) thathas been optimized to improve the accuracy of the detection Specifically, the optimiza-tion and generation strategies of anchor boxes of the Region Proposal Network havebeen proposed to improve the accuracy of expression detection of various sizes Theproposed methods for the detection of MEs have been tested on two public datasets(Marmot and GTDB) The obtained accuracies of isolated and inline expressions in theMarmot dataset are 92.09% and 85.90% while those in the GTDB dataset are 91.04%and 85.15%, respectively The performance comparison with conventional methodsshows the effectiveness of the proposed method.
(3) Finally, the detection and recognition of MEs have been integrated in a system.The MEs in document images have been detected and recognized The recognitionresults are represented in Latex The application aims to support end users to use thedetection and recognition of MEs in document images conveniently
Hanoi, 2021Ph.D Student
Trang 7DECLARATION OF AUTHORSHIP i
ACKNOWLEDGEMENT ii
ABSTRACT iii
CONTENTS viii
ABBREVIATIONS viii
LIST OF TABLES xi
LIST OF FIGURES xviii
INTRODUCTION 1
0.1 Motivation 1
0.2 Hypotheses 1
0.3 Objectives of the thesis 2
0.4 Introduction of the ME detection and recognition 2
0.4.1 Introduction of MEs 2
0.4.2 Introduction of ME detection 4
0.4.3 Introduction of ME recognition 6
0.5 Contributions of this thesis 7
0.6 Structure of this thesis 8
CHAPTER 1 LITERATURE REVIEW 10
1.1 Document analysis 11
1.2 ME detection methods in document images 13
1.2.1 Rule-based detection 13
1.2.2 Handcrafted feature extraction methods for the ME detection 14
1.2.3 Deep neural network for ME detection 15
1.2.3.1 Deep neural networks 15
1.2.3.2 Deep neural network models for ME detection 21
1.3 ME recognition 22
1.3.1 Traditional approaches for ME recognition 22
1.3.2 Neural network approaches for ME recognition 25
1.4 Datasets and evaluation metrics 27
1.4.1 Datasets 27
1.4.2 Evaluation metrics 30
Trang 81.5 Existing systems for ME recognition 33
1.6 Summary of the chapter 35
CHAPTER 2 THE DETECTION OF MEs USING THE LATE FUSION OF HANDCRAFTED AND DEEP LEARNING FEATURES 37
2.1 Overview of the proposed method 37
2.2 Page segmentation 38
2.3 Handcrafted feature extraction for ME detection 42
2.3.1 Handcrafted feature extraction for isolated ME detection 43
2.3.2 Handcrafted feature extraction for inline ME detection 47
2.4 Deep learning method for ME detection 53
2.5 Late fusion of handcrafted and deep learning features for ME detection 55
2.6 Post-processing for ME detection 58
2.7 Experimental results 60
2.7.1 Performance evaluation of the detection of MEs using different machine learning algorithms 61
2.7.2 Performance evaluation of the detection of MEs using the fusion of hand-crafted and deep learning features with different operations
63 2.7.3 Performance evaluation of the detection of isolated and inline MEs on dif-ferent public datasets
63 2.7.4 Evaluation of the impact of image resolution on the ME detection 66
2.7.5 Evaluation of the impact of the post-processing 67
2.7.6 Visualization of extracted features of images using the handcrafted and deep learning feature approaches 68
2.7.7 Error analysis and discussion 73
2.7.8 Measurement of execution time 75
2.8 Summary of the chapter 76
CHAPTER 3 THE DETECTION OF MEs USING THE COMBINATION OF THE DISTANCE TRANSFORM AND FASTER R-CNN 78
3.1 Overview of the proposed method for ME detection using the DT and the Faster R-CNN 78
3.2 The detection of MEs using the DT and the Faster R-CNN 79
3.2.1 Distance transform of document image 79
3.2.2 ME detection using a Faster R-CNN 82
3.2.2.1 Region proposal network 83
3.2.2.2 Fully connected detection network 86
Trang 93.2.2.3 Loss function of the training Faster R-CNN 87
3.3 Experimental results 89
3.3.1 Loss function of the training process of Faster R-CNN 89
3.3.2 Evaluation of the impact of the DT and anchor box generation to the per-formance of the ME detection
91 3.3.3 Comparison of Faster R-CNN models in ME detection 94
3.3.4 Comparison of the proposed and state-of-the-art methods used in ME detection 95 3.3.5 Performance comparison of the proposed method on cross datasets 97
3.3.6 Illustration of feature extraction of the Resnet-50 98
3.3.7 Error analysis and discussion 102
3.3.8 Measurement of execution time 103
3.4 Summary of the chapter 104
CHAPTER 4 THE DETECTION AND RECOGNITION OF MEs IN DOCUMENT IMAGES 105
4.1 Overview of the proposed system for the detection and recognition of MEs
105 4.2 ME recognition using the WAP network 106
4.2.1 Watcher module of the WAP network 107
4.2.2 Parser module of the WAP network 108
4.2.3 Training the WAP network 112
4.3 Experimental results 113
4.3.1 Performance evaluation of the detection and recognition of MEs 114
4.3.2 Error analysis and discussion 119
4.3.3 Measurement of execution time 120
4.4 Summary of the chapter 120
CONCLUSIONS 121
PUBLICATIONS 123
Bibliography 125
Trang 10No Abbreviation Meaning
3 ExpRate Expression Error Rate
5 Faster R-CNN Faster Regions Convolutional Neural Network
12 Mask R-CNN Mask Region with Convolutional Neural Network
22 t-SNE t- Distributed Stochastic Neighbor Embedding
Trang 11LIST OF TABLES
1.1 Results of document analysis of participating methods in competition 2019 13
1.2 Summary of significant handcrafted features for isolated ME detection 15
1.3 Summary of significant handcrafted features for inline ME detection 15
1.4 Milestones in the development of DNNs 16
1.5 Parameters of Alexnet 16
1.6 Parameters of Resnet18 17
1.7 Statistic of the Marmot and GTDB datasets 27
2.1 Features of VPP of variable and word images in Figure 2.11 51
2.2 Comparison of VPP features between italic and non-italic styles of char-acter "a" of Arial font 51
2.3 Alexnet architecture and layer parameters 55
2.4 ResNet-18 architecture and layer parameters 56
2.5 Performance comparison on isolated expression detection on the Marmot dataset using different machine learning algorithms (highest scores are in bold) 60
2.6 Performance comparison on inline expression detection on the Marmot dataset using different machine learning algorithms (highest scores are in bold) 60
2.7 Performance comparison on isolated expression detection on the Marmot dataset using different fusion techniques (highest scores are in bold) 61
2.8 Performance comparison on inline expression detection on the Marmot dataset using different fusion techniques (highest scores are in bold) 61
2.9 Performance comparison between the proposed and existing methods of isolated expression detection on the Marmot dataset (highest scores are in bold) 64
2.10 Performance comparison between the proposed and existing methods of inline expression detection on Marmot dataset (highest scores are in bold) 64 2.11 Performance comparison between the proposed and existing methods of isolated expression detection on the GTDB dataset (highest scores are in bold) 65
2.12 Performance comparison between the proposed and existing methods of inline expression detection on GTDB dataset (highest scores are in bold) 65 2.13 Performance comparison of the proposed and the state-of-the-art meth-ods on the GTDB dataset 66
Trang 122.14 The average time of the detection of expressions in a document page
in the Marmot dataset by different methods (Bold value indicates the
smallest detection time) 753.1 The ID and size of optimal anchor boxes for isolated ME detection 853.2 The ID and size of optimal anchor boxes for inline ME detection 863.3 Faster R-CNN architecture and layer parameters using ResNet-50 backbone 873.4 Performance comparison of the proposed method for isolated expression
detection on the Marmot dataset applying various parameters in the
RGB conversion (the highest performance is in bold) 903.5 Performance comparison of the proposed method for inline expression
detection on the Marmot dataset applying various parameters in the
RGB conversion (the highest performance is in bold) 913.6 Performance evaluation of the proposed method for isolated expression
detection on the Marmot dataset (Highest performance is in bold) 923.7 Performance evaluation of the proposed method for inline expression
detection on the Marmot dataset (Highest performance is in bold) 933.8 Performance evaluation of the proposed method for isolated expression
detection on the GTDB dataset (Highest performance is in bold) 933.9 Performance evaluation of the proposed method for inline expression
detection on the GTDB dataset (Highest performance is in bold) 933.10 Performance comparison between the proposed and existing methods of
isolated expression detection on the Marmot dataset (Highest scores of
the proposed method are in bold) 933.11 Performance comparison between the proposed and existing methods of
inline expression detection on the Marmot dataset (Highest scores of the
proposed method are in bold) 943.12 Performance comparison between the proposed and existing methods of
isolated expression detection on the GTDB dataset (Highest scores of
the proposed method are in bold) 943.13 Performance comparison between the proposed and existing methods of
inline expression detection on the GTDB dataset (Highest scores of the
proposed method are in bold) 943.14 Performance evaluation of isolated ME detection on the Marmot dataset
using the Faster R-CNN models trained on different datasets (Highest
performance is in bold) 983.15 Performance evaluation of inline ME detection on the Marmot dataset
using the Faster R-CNN models trained on different datasets (Highest
performance is in bold) 98
Trang 133.16 Performance comparison of the proposed and the state-of-the-art
meth-ods on the GTDB dataset 983.17 The average time (in second) of the detection of expressions in a doc-
ument page in the Marmot dataset by different methods (Bold value
indicates the smallest detection time) 1034.1 Performance evaluation of ME recognition using the WAP model on the
Marmot dataset 114
Trang 14LIST OF FIGURES
1 Examples of printed historic documents [2] 3
2 Examples of MEs in handwritten (a), image (b) and pdf (c) format 4
3 Example of the detection (a) and a detected ME in a document image (b) Isolated and inline MEs are denoted in red and blue, respectively Extracted ME is recognized and represented using Latex (c) 5
4 Examples of the isolated and inline expressions in a sample document page that are marked in red and blue bounding boxes, respectively 6
1.1 Structure of the literature review of the detection and recognition of MEs in document images 10
1.2 Examples of incorrect text line segmentation of large MEs Larges MEs are split into multi text lines 12
1.3 Architecture of the Alexnet 17
1.4 Residual block of Resnet 18
1.5 Popular activate functions of CNNs [54] 18
1.6 Architecture of the R-CNN 19
1.7 Architecture of the Fast R-CNN 20
1.8 Architecture of the Faster R-CNN 22
1.9 Examples of construction of a graph for an input ME Nodes are symbols Graph edges are partial relationship between symbols [74] 24
1.10 Example of the symbol segmentation for printed ME recognition [77] 25
1.11 Example of a parse tree for printed ME recognition [77] 26
1.12 The occurrence of isolated (a) and inline (b) MEs in document pages in Marmot dataset The x-axis represents the number of expressions and the y-axis represents the Page ID in the dataset The structure of XML file storing ground-truth information of expression (c) 29
1.13 Examples of isolated MEs and ground truth in the Marmot dataset Position information and the content of characters of the isolated MEs are annotated 30
1.14 Examples of inline MEs and ground truth in the Marmot dataset Po-sition information and the content of characters of the inline MEs are annotated 31
1.15 Examples of MEs and ground truth in the GTDB dataset Each charac-ter is annotated as textual or mathematical symbol Position informa-tion of each character is also annotated 32
Trang 151.16 Examples of download links for some PDF documents in the GTDB dataset 33
1.17 Overview of a Mathematical search system [86] 35
1.18 Example of a text to speech application for visually impaired people [90] 35 2.1 Overall description of the proposed system for the ME detection 38
2.2 Example of the text line segmentation in a sample document image The input sample page (a), the horizontal projection profile of the page image (b) and the text line segmentation of the page (c) The x-axis represents the sum of black pixels of each row in page image and y-axis represents the rows of the image 39
2.3 The word segmentation of the text line image (a) based on the estimation of the vertical projection profile (b) and the results (c) The x-axis represents the columns of text line image and y-axis represents the sum of black pixels of each column 40
2.4 The flowchart of the isolated and inline expression detection by using handcrafted feature extraction 41
2.5 An example of a text line (a) and an isolated expression (b) 44
2.6 An example of reconstruction image: Original image (a) and reconstruc-tion by Magnitude (b) Phase (c) 44
2.7 Example of Power Spectrum of an image of ordinary text and displayed ME 44
2.8 Diagram of classification of normal text lines and isolated MEs 46
2.9 Distribution of FFT phase and magnitude of isolated and text line images 47 2.10 Example of the detection of isolated MEs in a sample page 47
2.11 Examples of peaks and valleys of VPP of variable and word images 48
2.12 Image of character "a" in italic (a) and non-italic (c) styles in Arial font Peaks and valleys of VPP of the character in italic (b) and non-italic (d) styles 49
2.13 Bounding boxes of characters of variable ai (a), peaks and valleys of VPP (b) and HPP (c) of the variable 50
2.14 Flowchart of the classification of inline ME and word images 52
2.15 Example of inline ME detection in a sample page using the handcrafted feature extraction 52
2.16 The flowchart of the isolated and inline expression classification by using the transfer learning of CNNs 53
2.17 The loss function of the training process of the AlexNet 54
2.18 Flowchart of early fusion 56
2.19 Flowchart of late fusion 56
Trang 162.20 The flowchart of the late fusion of handcrafted and deep learning features
in the classification of isolated MEs 572.21 The flowchart of the late fusion of handcrafted and deep learning features
in the classification of inline MEs 572.22 Example of the post-processing of a mathematical expression that is
split into two text lines The detected and ground-truth expressions
are marked in blue and red, respectively (a) before and (b) after the
post-processing 602.23 The relationship between the classification error (y-axis) and the number
of trees of RF (x-axis) 612.24 Percentage of detected isolated and inline expressions in partial detection
category The partial detection category is divided into five equal
sub-ranges based on IoU values 622.25 Performance evaluation of the detection of isolated (a) and inline (b)
expressions in document images at various resolution The testing
im-ages in the Marmot dataset are rendered at 500, 300 and 150 dpi The
performance of the detection is denoted by blue, orange and gray for
document images at 500, 300 and 150 dpi, respectively 672.26 Examples of the detection of inline expressions in a sample page at 500
(a) and 150 (b) dpi The inline expressions detected by proposed system
and ground-truth expressions are marked in black and red, respectively 682.27 Performance comparison of the proposed method before (in blue) and
after (in orange) the post-processing in the detection of isolated (a) and
inline (b) expressions 692.28 The feature distribution of isolated ME and text line images The ex-
tracted features of isolated expressions and normal text lines in Marmot
dataset by using the ResNet-18 are illustrated in red and blue,
respec-tively The visualization of extracted feature of text lines and isolated
expression by using the t-SNE dimensional reduction with the
Maha-lanobis (a) and Cosine (b) distance metrics 702.29 Illustration of extracted features of inline ME (red) and word (black) im-
ages in the Marmot dataset The t-SNE is used for feature visualization
with the Mahalanobis (a) and Cosine (b) distance metric 712.30 Feature distribution of each English character in italic and non-italic
styles in (a) lower-case and (b) upper-case in Arial font 722.31 Feature distribution of each English character in italic and non-italic
styles in (a) lower-case and (b) upper-case in Times New Roman font 73
Trang 172.32 Examples of the isolated and inline expression detection in one-column
(a) and two-column (b) pages in the Marmot dataset The detection of
isolated, inline and ground-truth expressions are marked in blue, black
and red, respectively 742.33 Examples of the expression detection in a sample page in the GTDB
dataset The detection and ground-truth expressions are marked in blue
and red, respectively 742.34 Examples of the false (a) and missed (b) detection of inline expression
The inline expressions detected by proposed system and ground-truth
expressions are marked in black and red, respectively 753.1 Flowchart of the proposed method for the ME detection using the DT
and the Faster R-CNN The detected isolated and inline MEs are
de-noted in blue and black, respectively 783.2 Faster R-CNN based on Resnet-50 in this thesis consists of an RPN and
fully connected detection sub-networks The detected MEs are marked
in blue 793.3 Distance transform of input image using various distance metrics Bi-
nary input images are converted to grayscale images using the DT with
Euclidean, City Block and Chessboard metrics 813.4 (a) Input document image and (b) image after the distance transform
with the Euclidean metric The colour bar shows the colour scale of the
image vertically 823.5 Page image after RGB conversion using the (a) Euclidean and (b) city
block metrics The width and height of the image are shown on the
x-and y-axes The colour bar shows the colour scale of the image vertically 833.6 The visualization of width and height of isolated and inline MEs in the
Marmot training dataset The x-axis and y-axis represent the width and
height of MEs 843.7 Estimation of the number of anchor boxes that is used in the Region
Proposal Network based on the overlap ratio between the anchor boxes
and MEs in the ground truth The x-axis and y-axis present the overlap
ratio and the number of anchor boxes, respectively 853.8 Architecture of RPN for ME detection The set of 12 and 15 anchor
boxes are optimized to detect inline and isolated MEs, respectively 863.9 Examples of the ME detection by applying various anchor boxes Small,
medium and large anchor boxes are in black; ground truth is in red;
optimal anchor box is in blue 86
Trang 183.10 Example of training data for Faster R-CNN Training data consists of
document images’ name 883.11 Training progress of the Faster R-CNN for expression detection on the
Marmot dataset The x-asis and y-asis represent the iterations and loss
values of the training, respectively 893.12 Performance comparison of the proposed method that applies Distance
Transform with Euclidean, City Block and Chessboard metrics for
ex-pression detection in page images on the Marmot dataset are in blue,
orange and gray, respectively 903.13 Performance comparison of the proposed method that applies Distance
Transform with Euclidean, City Block and Chessboard metrics for
ex-pression detection in page images on the GTDB dataset are in blue,
orange and gray, respectively 913.14 Performance comparison of Faster R-CNN models for the detection of
isolated (a) and inline (b) MEs on the Marmot dataset The
detec-tion results (correct, partial, missed and false categories) of the original
Faster R-CNN model for isolated, inline MEs are denoted in blue; the
detection results of proposed Faster R-CNN model are denoted in orange 923.15 The extracted features of isolated (a), inline expressions (b) and back
ground regions in the Marmot dataset by using the Resnet-50 are
il-lustrated in red and blue, respectively The visualization of the feature
distribution by using the t-SNE dimensional reduction with the
Maha-lanobis distance metric 963.16 Examples of the isolated expression detection in one column page of
Marmot dataset The detected and ground-truth expressions are marked
in blue and red, respectively 973.17 Examples of the isolated expression detection in two column page of
Marmot dataset The detected and ground-truth expressions are marked
in blue and red, respectively 973.18 Examples of the detection of isolated and inline expressions in two col-
umn page of Marmot dataset The detected isolated, inline and
ground-truth expressions are marked in blue, black and red, respectively 973.19 Examples of the ME detection in a page in the GTDB dataset The
detected and ground-truth MEs are marked in blue and red, respectively 983.20 Examples of the expression detection in two column page The isolated,
inline expressions detected by the proposed method and ground-truth
expressions are marked in blue mask, black and red, respectively 99
Trang 193.21 Examples of the expression detection in one column page The isolated,
inline expressions detected by the proposed method and ground-truth
expressions are marked in blue mask, black and red, respectively 1003.22 Examples of a page image (a); Page segmentation (b); The multistage
detection of expressions (c); The end-to-end detection of expressions (d)
The inline expressions detected by proposed system and ground-truth
expressions are marked in black and red, respectively 1013.23 Examples of the expression detection with (a) and without (b) DT
The isolated, inline expressions detected by the proposed method and
ground-truth expressions are marked in blue, black and red, respectively 1023.24 Examples of the missed (a) and false (b) detection of inline expressions
The inline expressions detected by proposed system and ground-truth
expressions are marked in black and red, respectively 1034.1 A traditional framework for ME recognition Segmentation, recognition
and layout analysis of symbols are performed to obtain final expression
content representing in Latex 1064.2 Architecture of a sequence to sequence model (a) and an image to
markup model (b) 1074.3 Flowchart of the detection and recognition of MEs in a document image
The detected MEs are recognized and represented in the Latex format 1084.4 Architecture of the WAP network for Printed Mathematical Expression
Recognition 1084.5 Architecture of the FCN for Watcher module Size (height, width) of
the input image is denoted as [HxW], output feature vector is [HxW] 1094.6 Architecture of the GRU for Parser module 1104.7 Examples of mathematical symbol labels for training the WAP network 1134.8 The loss function of the training WAP network 1144.9 Comparison of different systems on ME recognition 1154.10 Graphical user interface of the application for ME detection 1164.11 Examples of correct ME recognition Input ME image (a) has been
recognized and results have been represented by Latex sequences(b) 1174.12 Examples of incorrect ME recognition Input ME image (a), the ground
truth (b), the recognition result in Latex (c) Recognition results lack
characters (in blue) compared to ground truth sequences 1174.13 Examples of incorrect ME recognition Input ME image (a), the ground
truth (b), the recognition result in Latex (c) Recognition results missed
fraction characters compared to ground truth sequences 117
Trang 204.14 Examples of the incorrect ME recognition Input ME image (a), the
ground truth (b), the recognition result in Latex (c) Recognition results
lack characters (in blue) compared to ground truth sequences 1184.15 Examples of correct detection (a) and recognition of inline (b) and iso-
lated (c) MEs in a sample page 1184.16 Examples of errors in the detection and recognition of MEs in a sample
page Recognition results are in black, ground truth information is in
red Errors in the detection cause errors in the recognition of MEs 1184.17 Percentage of error types in ME recognition 1194.18 Comparison of the recognition of a ME using various systems The
input ME image (a) is recognized by various systems Wrong recognition
results are in red 119
Trang 21In this chapter, the introduction of mathematical expression (ME) detection andrecognition is firstly presented The scopes and goals of the thesis are also provided.Then, the main contributions of the thesis are summarized Finally, the structure ofthe thesis is described so that it can be easily followed
0.1 Motivation
Up to now, a huge number of scientific documents have been produced Scientificdocuments have provided valuable information for research community The docu-ments need to be digitized to allow users to retrieve information efficiently Therefore,the digitization of the documents has attracted more and more attention of many re-searchers over the years Documents have been mainly published in two formats: raster(image) and vector (PDF) formats The PDF format has been developed since 1990s[1] Recently, most documents have been published in the PDF format However, alarge number of documents have been still available in raster format Figure 1 showsexamples of printed historic documents [2] It is obvious that the PDF processingtechniques cannot be applied for such raster document images It is necessary to applyimage processing for the digitization of the document images The key steps of thedocument digitization are: document analysis, optical character recognition and con-tent searching [3] The digitization of rich text documents has considered as a solvedproblem However, the digitization of scientific documents that contained rich MEs is
a non trivial task [4, 5] Especially, the detection and recognition of MEs in documentimages have been considered as challenging tasks Actually, scientific documents usu-ally consist of heterogeneous components: tables, figures, texts and MEs In scientificdocuments, MEs may be mixed with various components and sizes, styles of MEs mayfrequently vary Therefore, the improvement of accuracy of the detection and recogni-tion of MEs is an important step of the digitization of scientific documents Inspired
by the above ideas, the thesis mainly aims to improve the accuracy of detection andrecognition of MEs in scientific document images
0.2 Hypotheses
In reality, there are various kinds of documents Techniques for the detection andrecognition of MEs highly depend on the nature of documents The hypotheses of thedocument images that are considered in the thesis are assumed as follows:
(1) In fact, MEs have been commonly used in different fields such as Mathematics,
Trang 22Physic, Engineering More than 500 mathematical symbols have been used in variousfields [6] The type styles of MEs have frequently varied in each scientific field Thethesis focuses on the detection and recognition of MEs in scientific document images.The thesis aims to detect MEs in the body of documents, the detection of MEs con-tained in other document components such as tables, figures are actually investigated
in other problems (table or figure detection) Moreover, the size of MEs should notpass the size of the whole documents
(2) Scientific documents can be generated in various ways: camera captured images,scanned format or PDF conversion Moreover, the detection accuracy highly depends
on the quality of the documents In reality, document images can be skew, low olution Like conventional methods in document analysis, the thesis focuses on thedetection and recognition of MEs in scientific document images that are scanned athigh resolution and non-skew For low resolution and skew documents, some existingmethods are applied to improve the quality of the detection
res-0.3 Objectives of the thesis
The thesis mainly aims to solve the following tasks:
• Firstly, the thesis extensively analyzes a wide range of existing approaches for the
ME detection in scientific document images Then, the thesis investigates andproposes novel methods to improve the detection accuracy of MEs
• After enhancing the detection accuracy of MEs, the thesis investigates and poses a framework to improve the accuracy of the recognition of MEs in scientificdocument images
pro-0.4 Introduction of the ME detection and recognition
0.4.1 Introduction of MEs
Mathematical notations have been used and well known in our human life pared to natural languages, MEs always represent specific and exact knowledge For-mally, MEs can be defined as follows :
Com-Definition 1 [7]: A mathematical expression is a mathematical phrase that bines numbers and/or variables using mathematical operations In an expression, vari-ables and/or numbers may appear alone or combine with operators
com-Definition 2 [8]: In mathematics, an expression or mathematical expression is
a finite combination of symbols that is well-formed according to rules that depend
on the context Mathematical symbols can designate numbers (constants), variables,operations, functions, brackets, punctuation, and grouping to help determine order ofoperations, and other aspects of logical syntax
Trang 23Figure 1: Examples of printed historic documents [2]
Compared to standard texts, the main characteristics of MEs can be described asfollows [9]:
(1) Standard texts are normally represented in the linear layout, meanwhile MEsare represented in non-linear spatial layout MEs normally consist of super-script,sub-script, fraction or grid symbols
(2) MEs consist of rich notations 26 symbols are often used for the representation
of standard texts However, hundreds of characters are used for the representation ofMEs
(3) The order of symbols play an important role in the meaning of MEs For instance,the meaning of expression "a + (b ÷ c)" is far different from that of "a ÷ (b + c)".Based on the nature, MEs can be existed in raster (image), vector (PDF) or hand-written format Figure 2 illustrates the MEs in handwritten (a), image (b) and pdf (c)format
Printed MEs are normally represented in images Handwritten MEs can be furtherdivided into online and offline categories Online handwritten MEs are produced inelectronic editors, meanwhile offline MEs are introduced in images Offline handwrittenMEs lack of temporal information compared with online handwritten MEs In general,the detection and recognition of printed MEs have obtained higher accuracy than those
of handwritten MEs
For printed MEs in images, MEs are typically represented by connected components
A connected component in an image can be defined as a set of adjacent pixels thatconnected through 4-pixel or 8-pixel connectivity For online handwritten MEs inelectronic editors, MEs are typically represented by strokes A stroke is the sequence
of points that users touch and lift from surfaces (e.g electronic editors) For instance,
Trang 24Figure 2: Examples of MEs in handwritten (a), image (b) and pdf (c) format.
in Figure 2(a), symbol "3" can be drawn by one stroke Meanwhile, symbol "+"may contain two strokes In Figure 2(b), symbol "D" may contain one connectedcomponent Meanwhile, symbol "=" can be composed of two connected components
0.4.2 Introduction of ME detection
The key steps of the document digitization are the detection and recognition ofexpression Figure 3 demonstrates the detection and recognition of expressions indocuments The detection of expressions aims to locate the expressions in documentimages In scientific documents, MEs are classified in two categories, i.e isolated(displayed) and inline (embedded) expressions Isolated expressions display in separatelines, meanwhile inline expressions are mixed with other components in documentpages, e.g texts and figures The detection of MEs in documents is considered as
a first step in the document digitization The detection of MEs typically consists ofthree main steps: page segmentation, detection of MEs and post-processing Figure 4illustrates some examples of isolated and inline expressions marked in red and blue,respectively
Formally, the ME detection task can be defined as follows [10]:
• Let I be an input document image
Trang 25• Let G be the ground truth of bounding boxes of MEs in the image Each boundingbox g ∈ G is represented by coordinates of the top-left corner and width, height
of the box (x,y,w,h)
• Let D be a set of bounding boxes of MEs that are detected in the image Eachbounding box d ∈ D is represented by coordinates of the top-left corner and width,height of the box (x,y,w,h)
• The goal is to find the injective (or one-to-one) function f : D → G that mapseach detected mathematical expression to a ground truth one with the highestoverlap ratio between two MEs
Figure 3: Example of the detection (a) and a detected ME in a document image (b) Isolated and inline MEs are denoted in red and blue, respectively Extracted ME is recognized and represented using Latex (c).
In the literature, the accuracy of the detection of isolated expressions has beengradually improved However, the detection of inline expressions remains low accuracy[11] There are many challenges in the detection of inline expression, including thevariety of mathematical symbols and the complex layout of mathematical structures Inpractical, inline expressions may consist of subscripts and superscripts associated withmathematical symbols or variables As shown in Figure 3, inline expressions consist
of mathematical operators (e.g P, R
, β, +, -, ×, ÷), functions (e.g cos, sin, log)and variables (e.g i, j ) The accuracy of detection of inline expressions can also beaffected by punctuation marks and noises
Based on the nature, input documents of the detection can be divided into threecategories: stroke (e.g handwritten), vector (e.g PDF) and raster (e.g image) format
Trang 26Figure 4: Examples of the isolated and inline expressions in a sample document page that are marked
in red and blue bounding boxes, respectively.
The applied techniques for the ME detection highly depend on the category of inputdocument Among these document formats, PDF is the format where the detection isconsidered the most precise one [1, 5, 12] The attribute and content extracted fromPDF document are more proper and there is less noises than those of other formats.These factors are useful for the detection of MEs The detection of expressions inthe handwritten is considered as the most challenging task The thesis presents theapproaches for the detection of MEs in document images that have been attracted themost researches in recent years
Formally, the ME recognition can be defined as follows [13]:
• Let I and S be an input image of ME and a ground truth sequence of characters
Trang 27(in Latex) I is an image with height H and width W S is the sequence consisting
of T tokens that represents the MEs in image I
• The ME recognition task is to find f so that f(x) → S In practice, we find thesequence prediction function f0 that approximates f
• For evaluation of the recognition, we apply f0(x) → S0 The performance tion is performed by measuring the similarity between S and S0
evalua-Main challenges of the recognition of MEs can be listed as follows [14]:
(1) Accurate recognition of a large number of mathematical symbols is a difficulttask For instance, some characters and symbols, such as i, j, and =, are composed ofmultiple components In more complicated situations, some symbols (e.g √a) consist
of other symbols
(2) Some symbols in MEs may play different roles in different contexts For instance,
a dot (.) in an expression can be a decimal point or a multiplication operator depending
on the position of the dot and its neighboring symbols
(3) Operator symbols can be explicit or implicit When consecutive operator bols exist in an expression, we can apply operator precedence rules to group the symbolsinto units However, when those operator symbols are not lined up, we have to usethe concept of operator dominance For instance, the expression f(x + y) can havetwo different interpretations: it can be considered as the variable f multiplied by theexpression (x + y); or the function f applied to the value (x + y)
sym-(4) In addition, mathematical notation has many dialects Similar to natural guages, it is impossible to design a system that can recognize all dialects As a result,our systems are developed based on a subset of the mathematical notation only
lan-An example of ME detection and recognition is illustrated in Figure 3 Actually,the detection and recognition of MEs in document images are closely related Theaccuracy of the detection allows to obtain accuracy of the recognition In contrast, theincorrect detection may cause errors in the recognition of MEs
0.5 Contributions of this thesis
The main scientific contributions of the thesis are threefold:
(1) First, a hybrid method of two stages has been proposed for the effective detection
of MEs At first stage, the layout analysis of entire document images is introduced toimprove the accuracy of text line and word segmentation At second stage, both isolatedand inline expressions in document images are detected Both hand-crafted and deeplearning features are extensively investigated and combined to improve the detectionaccuracy In the handcrafted feature extraction approach, the Fast Fourier Transform(FFT) is applied for text line images for the detection of isolated MEs The Gaussian
Trang 28parameters of projection profile are applied as the feature extraction for the detection
of inline MEs After the feature extraction, various machine learning classifiers havebeen fine tuned for the detection In the deep learning approach, the CNNs (Alexnetand ResNet) have been optimized for the detection of MEs The fusion of handcraftedand deep learning features based on the prediction scores has been applied The merit
of the method is that it can operate directly on the ME images without the employment
of character recognition
(2) Second, an end-to-end framework for mathematical expression detection in entific document images is proposed without using any Optical Character Recognition(OCR) or Document Analysis techniques as in conventional methods The distancetransform is firstly applied for input document images in order to take advantages ofthe distinguished features of spatial layout of MEs Then, the transformed images arefed into the Faster Region with Convolutional Neural Network (Faster R-CNN) thathas been optimized to improve the accuracy of the detection Specifically, the opti-mization and generation strategies of anchor boxes of the Region Proposal Networkhave been proposed to improve the accuracy of expression detection of various sizes.(3) Finally, the detection and recognition of MEs has been integrated in a system.The MEs in document images have been detected and recognized The recognitionresults are represented in Latex format[15] The application aims to support end users
sci-to use the detection and recognition of MEs in document images conveniently
0.6 Structure of this thesis
The structure of the rest of the thesis is described as follows:
Chapter Introduction firstly presents the basic information and definition of MEdetection and recognition Then, the scope of the thesis is presented The maincontributions of the thesis are also summarized in the chapter
In chapter 1, significant related works to the detection and recognition of MEs arereviewed Related works are analyzed to point out the existing solutions of the detectionand recognition of MEs Then, the limitations of related works are emphasized Based
on the current limitations, the contributions of the thesis are proposed
Chapter 2 presents the ME detection using the fusion technique of hand crafted anddeep learning features First, page segmentation technique is applied for documentimages to obtain text lines and words Then, the feature extraction based on the FFT
is proposed for the detection of isolated expressions The feature extraction based onthe Gaussian parameters of the projection profile (PP) of images has been proposedfor the detection of inline expressions Various classifiers including Support VectorMachine (SVM), k Nearest Neighbor (kNN), Random Forest (RF) are fine-tuned toimprove the accuracy of the detection For deep learning method, the detection ofMEs is performed using the transfer learning of CNNs those are Alexnet and Resnet-
Trang 2918 Finally, the fusion strategy of handcrafted and deep learning features is proposed
to improve the accuracy of the detection
Chapter 3 presents the ME detection using the combination of the Distance form (DT) of images and Faster R-CNN The framework allows to achieve high accuracy
Trans-of detection with an end-to-end way Comparing with multistep detection frameworksthat have been mentioned in chapter 2, much human effort is reduced by using theend-to-end framework
Chapter 4 presents the system of ME detection and recognition So far, manyresearches have proposed to solve the recognition of isolated expressions However, fewsystems are able to recognize MEs in document images Therefore, the ME detectionand recognition system is developed and presented in the Chapter
Chapter Conclusion gives the conclusion and future works of the thesis Firstly,the proposed solutions for ME detection and recognition are summarized Then, thelimitations of the ME detection and recognition are analyzed Finally, the strategiesfor improving the quality of ME detection and recognition in the future are pointedout
Trang 30Chapter 1 LITERATURE REVIEW
In this chapter, significant works of the detection and recognition of MEs in ment images are analysed Besides, the advantages and disadvantages of each methodare summarized By comparing various conventional methods, the current limitations
docu-of the detection and recognition docu-of ME are emphasized Structure docu-of the literaturereview has been illustrated in Figure 1.1 In next chapters, contributions of the thesishave been proposed to overcome the current limitations that are pointed out in thischapter
Figure 1.1: Structure of the literature review of the detection and recognition of MEs in document images
Trang 311.1 Document analysis
Traditional approaches for ME detection in document images normally consist oftwo steps [5, 16]: document analysis and ME detection The first step focuses on ob-taining text lines and words of text paragraphs Whereas, the second one focuses on theseparation of MEs and normal texts Document layout analysis can be defined as thetask of segmenting a given document into semantically meaningful regions [17] Pagesegmentation which is a well-researched topic of document analysis aims to specify re-gions in documents and classify them into physical components such as tables, figures,texts In recent years, the page segmentation is an active research topic and has at-tracted more and more researches [18] Firstly, the image preprocessing (noise removaland skew correction) is performed Then, each component (e.g text, figure, or table)
is separated based on their structure layout Traditional page segmentation techniquescan be divided into four types: top-down, bottom-up, multi-scale resolution and hybridmethod [19] Top-down methods split the page image into smaller components [20, 21]:
a page is split into blocks, text lines and words In general, top-down methods are ful in the segmentation of rectangular layout However, the methods are not mucheffective for the complex structure document Bottom-up methods analyze and mergelocal pixels in order to form larger components such as characters, words, text linesand paragraphs [22, 23] Comparing with top-down methods, bottom-up methods showhigher performance in page segmentation However, the methods have high compu-tational complexity The multi-scale resolution methods analyze page structure based
use-on the features of different resolutiuse-on levels of the document image [24, 25] Then, thefeatures are used for text and non-text classification Finally, text regions are split in totext lines by using a set of rules of number and intensity of pixels The difficulty of themethods is the estimation of distance parameters between components in a documentpage Hybrid methods combine the bottom-up and top-down techniques The methodsare effective for the segmentation of complex structure document [19, 26] Connectedcomponents and delimiters (white space, tap stop) in a document page are extracted,filtered and analyzed After that, various heuristic strategies are applied to reducepage segmentation errors For the purpose of mathematical expression detection, textregions in the body of document are focused to analyse A text region is segmentedinto text lines that are basic units for displayed expression detection Segmented wordsfrom a text line are basic units for inline expression detection
The text line segmentation usually obtains high accuracy [27, 28] for standard tual documents because there is not much variation in text styles In contrast, manyerrors exist in the text line segmentation for scientific documents One of the typicalerrors is that a large mathematical expression is split into many lines Figure 1.2 showsexamples of incorrect text line segmentation of large MEs In the Figure, large MEs are
Trang 32tex-split into two lines Therefore, additional techniques (e.g rule-based, learning-basedmethods) are integrated to improve the accuracy of text line segmentation [29, 30].The basic idea of the techniques is that all text lines are firstly split, then consecutivetext lines are merged to form the entire expression if they are belonging to components
of the ME The text lines are merged if the vertical distance between them is smallerthan a predefined threshold Similarly, consecutive words are merged in order to formthe entire expression if they belong to the expression The words are merged if thehorizontal distance between them is smaller than a predefined threshold
Figure 1.2: Examples of incorrect text line segmentation of large MEs Larges MEs are split into multi text lines
In recent years, deep learning approaches have been utilized for the page tion The advantage of the approaches is that the page segmentation task is performedwithout the prior knowledge of document structure The work in [31] has proposed asimple CNN with one layer to perform the page segmentation Input of the CNN is
segmenta-a hsegmenta-andwritten grsegmenta-ay scsegmenta-ale document imsegmenta-age The psegmenta-age segmentsegmenta-ation tsegmenta-ask is considered
as a pixel labeling Therefore, one fully connected layer of 100 neurons is followed theconvolution layer The last layer consists of a logistic regression with softmax whichoutputs the probability of each pixel to each class To train the CNN, document im-ages have been split into patches of sizes 28x28 Much human effort has been made fortraining the CNN because 3000 pixels for each document image have been manuallylabeled
The work [17] has employed a DNN based on Resnet-50 [32] to segment historicaldocument pages The overall method consists of two steps At first step, a Fully CNN
Trang 33has been applied to get features of each pixel and output prediction score At secondstep, the prediction scores are mapped to desired regions The method has shownhigh performance in the segmentation of simple layout documents, however, it is noteffective for complex layout documents.
Table 1.1: Results of document analysis of participating methods in competition 2019
Probabilistic Homogeneity [33] 95.9 % DSPH Text / non-text classification; text segmentation [19] 95.1 % MHS Connected component analysis and morphology [34] 91.30 % BINYAS Segmentation using CNN, post-processing based on polygons 83.30 % JBM Connected Component Analysis [28] 78 % Tesseract
In recent years, the international document analysis competition has been held topromote the research on the field Competition results of participating methods indocument analysis competition 2019 are shown in Table 1.1 Participating methodshave been tested on the Prima dataset [35] The dataset consists of 85 complex layoutimages Moreover, the ground truth that is stored in XML files has been provided forthe evaluation For the competition methods, the methods using the analysis of con-nected components and text/ non text classification obtained high accuracy Methodsusing the CNNs have obtained slightly lower accuracy The reason may come fromthat the datasets for training CNN models are not enough large
1.2 ME detection methods in document images
The ME detection has been researched for decades Various approaches for the MEdetection have been proposed The approaches can be divided into three categories:rule-based, handcrafted feature extraction and DNN methods
1.2.1 Rule-based detection
Early researches in ME detection have performed using different rules Proposedrules are normally proposed by the different layout, morphology of MEs in comparingwith text Many heuristic rules and predefined thresholds have been proposed for thedetection In general, researches in the early period of ME detection have been tested
in small private datasets The methods can detect MEs in some specific cases Manyerrors have existed in the detection of MEs in complex layout documents
The work in [36] has tried to separate connected components of document imagesinto two categories: math and text By checking the OCR results, each connectedcomponent has been assigned to math or text Actually, the work has just detectedmathematical notations It cannot detect entire expressions in document images
Trang 34The research in [37] analyses all text lines and words from left to right to obtainprimitive tokens After that, each token is determined whether it belongs to an inlineexpression by checking predefined expression forms.
The research in [38] has proposed a model for automatic mathematical text ing Input document images are segmented into ME and textual components In thesegmentation phase, images are top-down segmented into text blocks, text lines andwords After the segmentation, the rules of sizes of MEs are applied to detect isolatedMEs For inline ME detection, a set of special characters is defined The limitation
read-of the work is that the performance evaluation has been carried out on a very smalldocument dataset The private datasets consists of thirteen pages In the work, severalrules are designed to detect MEs that consist of Greek characters
The research in [39] has applied the segmentation technique for ME detection Inputdocuments are segmented the analyzing the connected components in images Infor-mation of connected components including position, size is extracted Then, extractedconnected components are grouped to form text lines Obtained text lines are deter-mined as isolated MEs by checking some morphology conditions Two conditions arechecked The first one is the aspect ratio of a text line The second one is the left andright margin of the text line For inline ME detection, the topographical information
of connected components is analysed Moreover, several rules of relationship betweenmathematical symbols are proposed A small dataset is prepared to validate the de-tection algorithms Documents are scanned by Scanjet scanner at 300 dpi The errorrate of the detection is high and too many parameters are set for the detection.The method reported in [40] employs results of two commercial optical characterrecognition (OCR) systems to extract inline formula First, existing OCR systemsare applied to obtain content of document images Then, sentences containing inlineexpressions are determined by computing word n-grams For each sentence, severalhand crafted features of a word are extracted to determine whether the word is a part
of inline expressions If some consecutive words in a sentence are determined as inlineexpressions, these words can be grouped to obtain an inline expression It is obviousthat above features highly depend on results of existing OCR systems
1.2.2 Handcrafted feature extraction methods for the ME
de-tection
The handcrafted feature extraction methods have designed a set of features for
ME detection Table 1.2 summarizes features that have been designed for isolated
ME detection [41, 42, 16] Meanwhile, Table 1.3 summarizes features that have beendesigned for inline ME detection After the feature extraction, various machine learningclassifiers such as k-Nearest Neighbor (k-NN), Support Vector Machine (SVM) havebeen fine-tuned to detect MEs
Trang 35Table 1.2: Summary of significant handcrafted features for isolated ME detection
Feature of text line Description
Density [16] The density of black pixel in a text line
Height/Width of text line [38] Ratio of height of a text line to the document
Left and right indent [16, 39] The distance between the text line and document bounders Variation of centers of characters [16] Variation of centers of characters in the text line
Below and above space [42] Space between the text line and adjacent text lines
Table 1.3: Summary of significant handcrafted features for inline ME detection
Feature of word Description
Special symbol [40] The word contains special symbols or not
Density [16] The density of black pixel in a word
Height/Width of word [16] Ratio of height of a word to the whole document Variation of centers of characters [16] Variation of centers of characters in a word
Space between characters [42] Inner space between characters in a word
The research in [43] has analysed the connected components for ME detection.Firstly, a large number of geometry features of connected components are extracted.After the extraction, the quadratic classifiers [44] have applied to detect MEs Themethod can only detect isolated MEs The inline MEs are not handled in the work.The method in [42] integrated the mathematical expression detection into the OCRo-pus [27] system After obtaining text lines using one CP (Column Projection) algorithmsupported by the OCRopus, the method extracts five layout features of text lines todetect isolated expressions The Support Vector Machine (SVM) classifier is appliedwith the features for the detection
The method in [16] attempts to detect both isolated and inline expressions fromdocument images The method firstly applies the low-cost text line segmentation tech-nique [45] for heterogeneous document images Then, the layout features of text linesand word are investigated to detect isolated and inline expressions The features ofwords are effective in the detection of special symbols but not accurate in the detection
of inline expressions The SVM is applied with the features to detect inline sions The precision and recall of the detection of inline MEs reported at 80% and48%, respectively The method has been tested on a private dataset that consists ofmathematical documents from Harvard books In the dataset, 100 and 96 pages areselected for training and testing, respectively
expres-1.2.3 Deep neural network for ME detection
1.2.3.1 Deep neural networks
Recently, DNNs have shown the outstanding performance in the object detection[46] Thus, several DNNs have been applied in the ME detection task Deep learninguses the representation learning, also known as feature learning, to map input features
Trang 36(analogues to prediction variables in traditional statistics) to an output [47] Thismapping process occurs inside multiple connected layers which each contain multipleneurons Each neuron is a mathematical processing unit which, combined with all otherneurons, is designed to learn the relationship between the input features and the out-put The first step in developing a DNN is to determine the type of problem that needs
to be solved Examples of problem types include clustering, regression, classification,prediction, optimization, usage of sensors and motor controls in robotics, and vision
If, for example, the purpose is to predict mortality one is dealing with a classificationproblem, while aiming to predict a future event represents a prediction problem Par-ticularly, DNNs have dramatically improved many complex artificial intelligence tasks:object detection and recognition, speech recognition, language translation Each ofthese types of problems requires a DNN designed specifically to solve it Table 1.4shows milestones in the development of DNNs [48]
Table 1.4: Milestones in the development of DNNs Year Contribution
1980 Neocogitron neural network is introduced that is the inspiration of CNN [49]
1986 RNN is introduced [50]
1990 LeNet is introduced that has shown the ability of deep neural networks in practice [51]
1997 LSTM is introduced [52]
2012 Alexnet is introduced that performed the classification task with high accuracy [53]
2016 Resnet is introduced that is deeper than previous CNNs [32]
Table 1.5: Parameters of Alexnet Input Output Layer Kernel Size In Out Parameters
Trang 37mathe-Table 1.6: Parameters of Resnet18 Input Output Layer Kernel In Out Parameters
Trang 38Figure 1.4: Residual block of Resnet
output of each neuron to a range between [0;1] or [-1; 1] Popular activate functionsused for CNNs are: tanh, Rectified Linear Unit (ReLU), Sigmoid, Leaky ReLU, Maxout,Exponential Linear Unit (ELU) Equations and graphs of these functions are shown inFigure 1.5
Figure 1.5: Popular activate functions of CNNs [54]
Pooling layer is added after convolutional layers to reduce the spatial size of therepresentation to reduce the amount of parameters and computation in the network.Pooling layer operates on each feature map independently Two common functions
Trang 39used in the pooling operation are:
• Average Pooling: Calculate the average value for each patch on the feature map
• Maximum Pooling (or Max Pooling): Calculate the maximum value for each patch
of the feature map
Fully connected layer is an essential component of CNNs that has been provenvery successful in recognizing and classifying images The layer takes the output ofconvolution/pooling and predicts the label of the image
Several CNN architectures have been successfully in object detection such as Alexnet[53], VGG [55], GoogLeNet [56], Resnet [32] In the thesis, the CNNs those are Alexnet[53] and Resnet [32] have been applied and optimised for the ME detection Actually,the two CNN models are popular and powerful in object detection The architectures
of Alexnet and Resnet have shown in Figures 1.4 and 1.5, respectively Parameters ofAlexnet and Resnet-18 have shown in Tables 1.5 and 1.6, respectively
Figure 1.6: Architecture of the R-CNN
Based on the CNN architectures, various end-to-end object detection models haverecently proposed The Regions with CNN features (R-CNN) [57] has been proposed todetect objects in three main steps Firstly, the framework generates candidate regions(called region proposals) for object Around 2000 regions in images are generated byemploying Region Proposal algorithm [58] Then, 4096 features are extracted for eachregion The Alexnet [53] is used as the backbone of the framework Finally, the SVMclassifier is applied for the classification of objects Figure 1.6 shows the architecture ofR-CNN The huge number of region proposals and high computational cost are drawbacks of the framework
Trang 40Figure 1.7: Architecture of the Fast R-CNN
Fast R-CNN [59] has been proposed to reduce the computational cost and trainingtime of R-CNN Firstly, the framework generates region proposals by employing Re-gion Proposal Algorithm like R-CNN Then, feature maps of each regions have beenextracted using CNNs The features of each region and features of the whole imageare fed into Region of Interest (RoI) pooling layer The fixed size features obtained byROI pooling layer are fed into fully connected layer Finally, features are fed into twooutput layers: Softmax and Regression layers The Softmax layer is for classification
of objects and the Regression layer is for refinement of object position information.Comparing with R-CNN, the training time of the Fast R-CNN is much reduced TheVGG16 is used as the backbone of the framework Figure 1.7 shows the architecture
of the Fast R-CNN
Faster R-CNN [60] consists of two sub networks: the RPN and the Detection