Inspired by the above ideas, the thesis mainly aimsto improve the accuracy of detection and recognition of MEs in scientific document images.. 2 Second, an end-to-end framework for mathem
Trang 1MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF SCIENCE AND TECHNOLOGY
BUI HAI PHONG
ENHANCING PERFORMANCE OF MATHEMATICAL
EXPRESSION DETECTION IN SCIENTIFIC
Trang 2This study is completed at:
Hanoi University of Science and Technology
Supervisors:
1 Assoc Prof Hoang Manh Thang
2 Assoc Prof Le Thi Lan
Reviewer 1:
Reviewer 2:
Reviewer 3:
This dissertation will be defended before approval commitee
at Hanoi University of Science and Technology:
Time , date month year 2021
This dissertation can be found at:
1 Ta Quang Buu Library - Hanoi University of Science and Technology
2 Vietnam National Library
Trang 3be mixed with various components and sizes, styles of MEs may frequently vary Therefore,the improvement of accuracy of the detection and recognition of MEs is an important step ofthe digitization of scientific documents Inspired by the above ideas, the thesis mainly aims
to improve the accuracy of detection and recognition of MEs in scientific document images
Introduction of ME detection and recognition in document images
In mathematics, an expression or mathematical expression is a finite combination ofsymbols that is well-formed according to rules that depend on the context [5] In scientificdocuments, MEs are classified in two categories, i.e isolated (displayed) and inline (embedded)expressions Isolated expressions display in separate lines, meanwhile inline expressions aremixed with other components in document pages, e.g texts and figures
The detection of expressions aims to locate MEs in document images Meanwhile, therecognition of MEs aims at converting expressions from image format to string (representation
in Latex) An example of ME detection and recognition is illustrated in Figure 1 Actually,the detection and recognition of MEs in document images are closely related The accuracy ofthe detection allows to obtain accuracy of the recognition In contrast, the incorrect detectionmay cause errors in the recognition of MEs
The hypotheses of the thesis are assumed as follows: (1) The thesis focuses on the tection and recognition of MEs in scientific document images that have been written in aformal way The thesis aims to detect MEs in the body of documents, the detection of MEscontained in other document components such as tables, figures are actually investigated inother problems (table or figure detection) Moreover, the size of MEs should not pass the size
de-of the whole documents (2) Scientific documents can be generated in various ways: camera
Trang 4Figure 1 Example of the detection (a) and a detected ME in a document image (b) Isolatedand inline MEs are denoted in red and blue, respectively Extracted ME is recognized andrepresented using Latex (c).
captured images, handwritten documents, scanned format or PDF conversion Moreover, thedetection accuracy highly depends on the quality of the documents Like conventional meth-ods in document analysis, the thesis focuses on the detection of MEs in document imagesthat are scanned at high resolution and non-skew (3) The detection of MEs is represented bybounding boxes Then the detected MEs are recognized and represented in Latex format [4].Main challenges of the recognition of MEs can be described as follows: (1) Accuraterecognition of a large number of mathematical symbols is a difficult task (2) Some symbols
in MEs may play different roles in different contexts (3) Operator symbols can be explicit orimplicit When consecutive operator symbols exist in an expression, we can apply operatorprecedence rules to group the symbols into units (4) In addition, mathematical notationhas many dialects Similar to natural languages, it is impossible to design a system thatcan recognize all dialects As a result, our systems are developed based on a subset of themathematical notation only
Contributions
The main scientific contributions of the thesis are threefold: (1) First, a hybrid method
of two stages has been proposed for the effective detection of MEs Both hand-crafted anddeep learning features are extensively investigated and combined to improve the detectionaccuracy The merit of the method is that it can operate directly on the ME images without theemployment of character recognition (2) Second, an end-to-end framework for mathematicalexpression detection in scientific document images is proposed without using any OpticalCharacter Recognition (OCR) or Document Analysis techniques as in conventional methods.The distance transform is firstly applied for input document images in order to take advantages
of the distinguished features of spatial layout of MEs Then, the transformed images are fed
Trang 5into the Faster Region with Convolutional Neural Network (Faster R-CNN) that has beenoptimized to improve the accuracy of the detection (3) Finally, the detection and recognition
of MEs has been integrated in a system The MEs in document images have been detectedand recognized The recognition results are represented in Latex
Thesis structure
Chapter Introduction firstly presents the basic information and definition of ME detectionand recognition Then, the scope of the thesis is presented The main contributions of thethesis are also summarized in the chapter In chapter 1, significant related works to the detec-tion and recognition of MEs are reviewed Based on the current limitations, the contributions
of the thesis are proposed Chapter 2 presents the ME detection using the fusion technique
of hand crafted and deep learning features Chapter 3 presents the ME detection using thecombination of the Distance Transform (DT) of images and Faster R-CNN The frameworkallows to achieve high accuracy of detection with an end-to-end way Chapter 4 presents thesystem of ME detection and recognition Chapter Conclusion gives the conclusion and futureworks of the thesis
Trang 6Table 1.1 Summary of significant handcrafted features for isolated ME detectionFeature of text line Description
Density [14] The density of black pixel in a text line
Height/Width of text line [19] Ratio of height of a text line to the documentLeft and right indent [14, 20] Left and right indent of text lines
Variation of centers of characters [14] Variation of centers of characters in the text lineBelow and above space [23] Space between the text line and adjacent text lines
1.2 ME detection methods in document images
Various approaches for the ME detection have been proposed The approaches can bedivided into three categories: rule-based, handcrafted feature extraction and DNN methods.1.2.1 Rule-based detection
Early researches in ME detection have performed using different rules Proposed rules arenormally proposed by the different layout, morphology of MEs in comparing with text Manyheuristic rules and predefined thresholds have been proposed for the detection In general,researches in the early period of ME detection have been tested in small private datasets Themethods can detect MEs in some specific cases Many errors have existed in the detection ofMEs in complex layout documents
1.2.2 Handcrafted feature extraction methods for the ME detection
The handcrafted feature extraction methods have designed a set of features for ME tection Table 1.1 summarizes features that have been designed for isolated ME detection[9, 23, 14] Meanwhile, Table 1.2 summarizes features that have been designed for inline MEdetection After the feature extraction, various machine learning classifiers such as k-NearestNeighbor (k-NN), Support Vector Machine (SVM) have been fine-tuned to detect MEs
de-Table 1.2 Summary of significant handcrafted features for inline ME detection
Feature of word Description
Special symbol [15] The word contains special symbols or not
Density [14] The density of black pixel in a word
Height/Width of word [14] Ratio of height of a word to the whole documentVariation of centers of characters [14] Variation of centers of characters in a word
Space between characters [23] Inner space between characters in a word
1.2.3 Deep neural network for ME detection
In recent years, DNNs have proved the outstanding performance in the recognition anddetection of mathematical expressions The work in [21] takes the advantages of CNNs in thedetection of isolated and inline expressions in document images A CNN architecture based onthe U-net is used for detecting mathematical expressions The post-processing is performed
to obtain accurate expressions For the CNN, the training on diverse datasets can improvethe detection accuracy Moreover, the accuracy of the detection depends on the size of imageblocks in the training of CNN The achieved precision and recall of the detection of MEs of the
Trang 7method are 95.2% and 91%, respectively The limitation of the method is that mathematicalsymbols are detected with high accuracy, however the layout analysis of symbols has not beensolved to construct complete expressions The works in [22] have applied the SSD-512 andYOLO v3 neural networks for ME detection.
1.3 ME recognition
1.3.1 Traditional approaches for ME recognition
The expression recognition has researched since 1960s In the literature, various proaches have been proposed for ME recognition based on three steps: Symbol segmentation,Symbol classification and Structure analysis The survey [1] has presented various traditionalmathematical expression recognition approaches based on: (1) Symbol segmentation (2) Sym-bol recognition (3) Structure analysis Techniques for segmentation of symbols have beenperformed based on the analysis of connected components or projection profile of images Theexisting segmentation techniques have difficulties when performing complex (e.g fraction,sum function) or touching (e.g exponential function) symbols The recognition of of symbolshas developed by using the the hand crafted feature extraction and various classifiers
ap-In summary, traditional approaches for the recognition of MEs have extensively gated three stages: symbol segmentation, symbol recognition and structure analysis Maindrawbacks of such methods are as follows: (1) The accuracy of the recognition of MEs is stilllow Errors in segmentation, recognition and structural analysis may cause errors in the finalrecognition (2) Much human efforts have been made for each stage In particular, the com-putation in structural analysis of symbols is complex (3) Recognition algorithms have beendesigned for specific ME datasets The algorithms are difficult to evaluate across datasets.1.3.2 Neural network approaches for ME recognition
investi-The work in [24] applied the combination of a CNN and a RNN to recognize isolated MEs
in an end-to-end way The CNN has been trained on isolated images captured by cameras.Recently, the combination of CNN and RNN model [25] has been designed for handwrittenexpression recognition So far, many researches have investigated the recognition of segmentedMEs The recognition of MEs that are embedded in document images needs to be investigated.The work in [3] has proposed the neural network based on scale paired adversarial learningfor ME recognition In recent years, several DNNs based on the encoder-decoder architecturehave proved the outstanding performance in the ME recognition task [25, 17, 6] These DNNshave designed to solve the challenges of recognition of complicated two-dimensional structures
Trang 8datasets have been used for performance evaluation of ME detection The Marmot [13] andGTDB [21] datasets have been recently published.
Table 1.3 Statistic of the Marmot and GTDB datasets
Datasets GTDB Marmot
Training Testing Training TestingNumber of pages 569 236 330 70
Number of isolated expressions 4218 2488 1322 253
Number of inline expressions 22178 9397 6951 956
Number of text fonts 30 18
Average number of MEs per page 47.55 23.70
The Marmot consists of 400 non-skew scientific document pages with 1575 isolated and
7907 inline expressions The resolution of each page image is around 500 dpi The GTDBdataset has recently been used for performance evaluation of researches [22] The datasetconsists of diverse font and mathematical symbol styles The training and testing datasets aredescribed in Table 1.3
1.4.2 Evaluation metrics
To evaluate the performance of the ME detection, two evaluation metrics have beenapplied The Precision (P), Recall (R) and F1 score have been used for the performanceevaluation of ME detection Precision is the proportion of the true positives against all thepositive results; Recall is the proportion of the true positives against all the true results andF1 score is the harmonic mean of precision and recall The Precision and Recall metricsare popularly used, however, in order to obtain the in-depth analysis of ME detection, theIntersection over Union (IoU ) metric has been applied in the thesis
The Word error rate (WER) and Expression error rate (ExpRate) evaluation metrics[25] are used to evaluate the accuracy of the ME recognition The ExpRate evaluates theproportion of recognition of MEs in Latex strings that match the ground truth Meanwhile,WER evaluates the number of actions (deletion, substitution or insertion) that are performed
to obtain correct strings
CHAPTER 2
Detection of MEs using the late fusion of handcrafted and
deep learning features
2.1 Introduction
Scientific document images usually consist of heterogeneous components (e.g., figures, bles, text and MEs) Conventional approaches have attempted the page segmentation and thehandcrafted feature extraction techniques for the ME detection in document images Conven-
Trang 9ta-tional methods have faced many difficulties in the detection of inline MEs Therefore, in thechapter, a hybrid method of two stages is proposed for the effective detection of mathematicalexpressions First, the layout analysis of entire document images is introduced to improve theaccuracy of text line and word segmentation Then, both isolated and inline expressions indocument images are detected Both hand-crafted and deep learning features are extensivelyinvestigated and combined to improve the detection accuracy The proposed system for the
ME detection is illustrated in Figure 2.1 The proposed system takes a binary document image
as input and outputs an image with position information of detected MEs Like documentanalysis and expression detection methods, input of the proposed method is a non-skew docu-ment image The algorithm can handle camera-captured and scanned document images Afterthe pre-processing, the document is analyzed to obtain text lines for isolated expression detec-tion Non-isolated expressions are segmented into words for inline expression detection Afterthe segmentation, the late fusion of handcrafted and deep learning features are applied for theisolated and inline expression detection modules Finally, the post-processing is performed inorder to obtain the accurate position information of MEs in document images
Figure 2.1 Overall description of the proposed system for mathematical expression detection
Trang 10technique is useful for the analysis of scanned documents.
2.3 Handcrafted feature extraction for ME detection
Figure 2.2 The flowchart of the isolated and inline expression detection by using handcraftedfeature extraction
The flowchart of the isolated and inline expression classification is described in Figure 2.2
In the handcrafted feature extraction approach, the powerful feature extraction and classifierare applied to improve the accuracy of the classification of both isolated and inline expressions.2.3.1 Handcrafted feature extraction for isolated ME detection
For isolated expression detection, a text line image is represented in the frequency domain
by using Fourier transformation Given an image a with the size of M × N and its DiscreteFourier Transform (DFT) A(Ω, ψ), the mathematical equation of DFT [7] is defined as follows:
A(Ω, ψ) =
MXm=1
NXn=1
a(m, n)e−j(Ωm+ψn) (2.1)
The FFT [11] has used to transform input document images to the frequency efficiently.For the detection of isolated MEs, FFT phase and magnitude are used as the features After
Trang 11that, SVM, kNN, Decision tree and RF are optimized as the classifiers This is a popularmachine learning model to solve binary classification.
2.3.2 Handcrafted feature extraction for inline ME detection
To determine an extracted word from a text line is an inline ME (variable, operator,function) or a textual word, a binary classification method is proposed A key step in theclassification is to extract the dominant features of observed words The important featurethat is used to discriminate inline MEs from textual words is the italic font style of images
In scientific documents, inline MEs are typically represented in italic font
For feature extraction, firstly vertical projection profile (VPP) and HPP of each variable
or textual word image is computed Then, peaks and valleys (troughs) of the VPP and HPPare determined In mathematical definition, peaks and valleys are local maxima and minima,respectively The feature extraction method is based on the Gaussian distribution of thepeaks and valleys The feature vector is formed by features of projection profile of variableand textual word images For each image, these features are described as follows:
(1) The number of peaks in the VPP and HPP
(2) The mean (average) of values of peaks in the VPP and HPP
(3) The standard deviation of values of peaks in the VPP and HPP
(4) The number of valleys in the VPP and HPP
(5) The mean (average) of values of valleys in the VPP and HPP
(6) The standard deviation of values of valleys in the VPP and HPP
For an image of size m × n, the complexity of the feature extraction is O(m) and O(n)for VPP and HPP, respectively because the feature extraction is performed by finding andanalyzing the peaks and valleys in the VPP and HPP of image
After the feature extraction process, in order to discriminate variables from textual words,different machine-learning based algorithms are used In the section, different classificationmodels are applied: SVM, kNN, Decision tree and RF For the classifiers, tuned parametersplay an important role to achieve high performance Therefore, different parameters of theeach classifier are considered in order to determine the optimal values for the classification
2.4 Deep learning method for ME detection
To improve the accuracy of the detection of both isolated and inline expressions, thetransfer learning technique of AlexNet and ResNet-18 [18] those are popular Neural Networksare employed Comparing with AlexNet, the architecture of ResNet-18 consists of deeperlayers and ResNet-18 normally shows better results in the classification task
Figure 2.3 illustrates the flowchart of the transfer learning of CNNs for isolated andinline expression detection module The dominant features are automatically extracted by
Trang 12Figure 2.3 The flowchart of the isolated and inline expression classification by using thetransfer learning of CNNs
the network without any domain specific knowledge Then, the classification is performed bysoftmax layer of the network
2.5 Fusion of handcrafted and deep learning features for ME
Trang 13Figure 2.4 The flowchart of the late fusion of handcrafted and deep learning features in theclassification of isolated and inline MEs.
2.6 Post-processing for ME detection
In the detection of MEs, it is not rare that large isolated expressions are split into severaltext lines The strategies have relied on the results of the character recognition to determine theconditions of merging successive text lines to become an expression Figure 2.5 demonstratesexample of the post-processing