Tóm tắt luận án enhancing performance of mathematical expression detection in scientific document images

Inspired by the above ideas, the thesis mainly aimsto improve the accuracy of detection and recognition of MEs in scientiﬁc document images.. 2 Second, an end-to-end framework for mathem

Trang 1

MINISTRY OF EDUCATION AND TRAINING

UNIVERSITY OF SCIENCE AND TECHNOLOGY

BUI HAI PHONG

ENHANCING PERFORMANCE OF MATHEMATICAL

EXPRESSION DETECTION IN SCIENTIFIC

Trang 2

This study is completed at:

Hanoi University of Science and Technology

Supervisors:

1 Assoc Prof Hoang Manh Thang

2 Assoc Prof Le Thi Lan

Reviewer 1:

Reviewer 2:

Reviewer 3:

This dissertation will be defended before approval commitee

at Hanoi University of Science and Technology:

Time , date month year 2021

This dissertation can be found at:

1 Ta Quang Buu Library - Hanoi University of Science and Technology

2 Vietnam National Library

Trang 3

be mixed with various components and sizes, styles of MEs may frequently vary Therefore,the improvement of accuracy of the detection and recognition of MEs is an important step ofthe digitization of scientiﬁc documents Inspired by the above ideas, the thesis mainly aims

to improve the accuracy of detection and recognition of MEs in scientiﬁc document images

Introduction of ME detection and recognition in document images

In mathematics, an expression or mathematical expression is a finite combination ofsymbols that is well-formed according to rules that depend on the context [5] In scientificdocuments, MEs are classified in two categories, i.e isolated (displayed) and inline (embedded)expressions Isolated expressions display in separate lines, meanwhile inline expressions aremixed with other components in document pages, e.g texts and figures

The detection of expressions aims to locate MEs in document images Meanwhile, therecognition of MEs aims at converting expressions from image format to string (representation

in Latex) An example of ME detection and recognition is illustrated in Figure 1 Actually,the detection and recognition of MEs in document images are closely related The accuracy ofthe detection allows to obtain accuracy of the recognition In contrast, the incorrect detectionmay cause errors in the recognition of MEs

The hypotheses of the thesis are assumed as follows: (1) The thesis focuses on the tection and recognition of MEs in scientific document images that have been written in aformal way The thesis aims to detect MEs in the body of documents, the detection of MEscontained in other document components such as tables, figures are actually investigated inother problems (table or figure detection) Moreover, the size of MEs should not pass the size

de-of the whole documents (2) Scientiﬁc documents can be generated in various ways: camera

Trang 4

Figure 1 Example of the detection (a) and a detected ME in a document image (b) Isolatedand inline MEs are denoted in red and blue, respectively Extracted ME is recognized andrepresented using Latex (c).

captured images, handwritten documents, scanned format or PDF conversion Moreover, thedetection accuracy highly depends on the quality of the documents Like conventional meth-ods in document analysis, the thesis focuses on the detection of MEs in document imagesthat are scanned at high resolution and non-skew (3) The detection of MEs is represented bybounding boxes Then the detected MEs are recognized and represented in Latex format [4].Main challenges of the recognition of MEs can be described as follows: (1) Accuraterecognition of a large number of mathematical symbols is a diﬃcult task (2) Some symbols

in MEs may play diﬀerent roles in diﬀerent contexts (3) Operator symbols can be explicit orimplicit When consecutive operator symbols exist in an expression, we can apply operatorprecedence rules to group the symbols into units (4) In addition, mathematical notationhas many dialects Similar to natural languages, it is impossible to design a system thatcan recognize all dialects As a result, our systems are developed based on a subset of themathematical notation only

Contributions

The main scientiﬁc contributions of the thesis are threefold: (1) First, a hybrid method

of two stages has been proposed for the effective detection of MEs Both hand-crafted anddeep learning features are extensively investigated and combined to improve the detectionaccuracy The merit of the method is that it can operate directly on the ME images without theemployment of character recognition (2) Second, an end-to-end framework for mathematicalexpression detection in scientific document images is proposed without using any OpticalCharacter Recognition (OCR) or Document Analysis techniques as in conventional methods.The distance transform is firstly applied for input document images in order to take advantages

of the distinguished features of spatial layout of MEs Then, the transformed images are fed

Trang 5

into the Faster Region with Convolutional Neural Network (Faster R-CNN) that has beenoptimized to improve the accuracy of the detection (3) Finally, the detection and recognition

of MEs has been integrated in a system The MEs in document images have been detectedand recognized The recognition results are represented in Latex

Thesis structure

Chapter Introduction firstly presents the basic information and definition of ME detectionand recognition Then, the scope of the thesis is presented The main contributions of thethesis are also summarized in the chapter In chapter 1, significant related works to the detec-tion and recognition of MEs are reviewed Based on the current limitations, the contributions

of the thesis are proposed Chapter 2 presents the ME detection using the fusion technique

of hand crafted and deep learning features Chapter 3 presents the ME detection using thecombination of the Distance Transform (DT) of images and Faster R-CNN The frameworkallows to achieve high accuracy of detection with an end-to-end way Chapter 4 presents thesystem of ME detection and recognition Chapter Conclusion gives the conclusion and futureworks of the thesis

Trang 6

Table 1.1 Summary of signiﬁcant handcrafted features for isolated ME detectionFeature of text line Description

Density [14] The density of black pixel in a text line

Height/Width of text line [19] Ratio of height of a text line to the documentLeft and right indent [14, 20] Left and right indent of text lines

Variation of centers of characters [14] Variation of centers of characters in the text lineBelow and above space [23] Space between the text line and adjacent text lines

1.2 ME detection methods in document images

Various approaches for the ME detection have been proposed The approaches can bedivided into three categories: rule-based, handcrafted feature extraction and DNN methods.1.2.1 Rule-based detection

Early researches in ME detection have performed using different rules Proposed rules arenormally proposed by the different layout, morphology of MEs in comparing with text Manyheuristic rules and predefined thresholds have been proposed for the detection In general,researches in the early period of ME detection have been tested in small private datasets Themethods can detect MEs in some specific cases Many errors have existed in the detection ofMEs in complex layout documents

1.2.2 Handcrafted feature extraction methods for the ME detection

The handcrafted feature extraction methods have designed a set of features for ME tection Table 1.1 summarizes features that have been designed for isolated ME detection[9, 23, 14] Meanwhile, Table 1.2 summarizes features that have been designed for inline MEdetection After the feature extraction, various machine learning classiﬁers such as k-NearestNeighbor (k-NN), Support Vector Machine (SVM) have been ﬁne-tuned to detect MEs

de-Table 1.2 Summary of signiﬁcant handcrafted features for inline ME detection

Feature of word Description

Special symbol [15] The word contains special symbols or not

Density [14] The density of black pixel in a word

Height/Width of word [14] Ratio of height of a word to the whole documentVariation of centers of characters [14] Variation of centers of characters in a word

Space between characters [23] Inner space between characters in a word

1.2.3 Deep neural network for ME detection

In recent years, DNNs have proved the outstanding performance in the recognition anddetection of mathematical expressions The work in [21] takes the advantages of CNNs in thedetection of isolated and inline expressions in document images A CNN architecture based onthe U-net is used for detecting mathematical expressions The post-processing is performed

to obtain accurate expressions For the CNN, the training on diverse datasets can improvethe detection accuracy Moreover, the accuracy of the detection depends on the size of imageblocks in the training of CNN The achieved precision and recall of the detection of MEs of the

Trang 7

method are 95.2% and 91%, respectively The limitation of the method is that mathematicalsymbols are detected with high accuracy, however the layout analysis of symbols has not beensolved to construct complete expressions The works in [22] have applied the SSD-512 andYOLO v3 neural networks for ME detection.

1.3 ME recognition

1.3.1 Traditional approaches for ME recognition

The expression recognition has researched since 1960s In the literature, various proaches have been proposed for ME recognition based on three steps: Symbol segmentation,Symbol classification and Structure analysis The survey [1] has presented various traditionalmathematical expression recognition approaches based on: (1) Symbol segmentation (2) Sym-bol recognition (3) Structure analysis Techniques for segmentation of symbols have beenperformed based on the analysis of connected components or projection profile of images Theexisting segmentation techniques have difficulties when performing complex (e.g fraction,sum function) or touching (e.g exponential function) symbols The recognition of of symbolshas developed by using the the hand crafted feature extraction and various classifiers

ap-In summary, traditional approaches for the recognition of MEs have extensively gated three stages: symbol segmentation, symbol recognition and structure analysis Maindrawbacks of such methods are as follows: (1) The accuracy of the recognition of MEs is stilllow Errors in segmentation, recognition and structural analysis may cause errors in the finalrecognition (2) Much human efforts have been made for each stage In particular, the com-putation in structural analysis of symbols is complex (3) Recognition algorithms have beendesigned for specific ME datasets The algorithms are difficult to evaluate across datasets.1.3.2 Neural network approaches for ME recognition

investi-The work in [24] applied the combination of a CNN and a RNN to recognize isolated MEs

in an end-to-end way The CNN has been trained on isolated images captured by cameras.Recently, the combination of CNN and RNN model [25] has been designed for handwrittenexpression recognition So far, many researches have investigated the recognition of segmentedMEs The recognition of MEs that are embedded in document images needs to be investigated.The work in [3] has proposed the neural network based on scale paired adversarial learningfor ME recognition In recent years, several DNNs based on the encoder-decoder architecturehave proved the outstanding performance in the ME recognition task [25, 17, 6] These DNNshave designed to solve the challenges of recognition of complicated two-dimensional structures

Trang 8

datasets have been used for performance evaluation of ME detection The Marmot [13] andGTDB [21] datasets have been recently published.

Table 1.3 Statistic of the Marmot and GTDB datasets

Datasets GTDB Marmot

Training Testing Training TestingNumber of pages 569 236 330 70

Number of isolated expressions 4218 2488 1322 253

Number of inline expressions 22178 9397 6951 956

Number of text fonts 30 18

Average number of MEs per page 47.55 23.70

The Marmot consists of 400 non-skew scientiﬁc document pages with 1575 isolated and

7907 inline expressions The resolution of each page image is around 500 dpi The GTDBdataset has recently been used for performance evaluation of researches [22] The datasetconsists of diverse font and mathematical symbol styles The training and testing datasets aredescribed in Table 1.3

1.4.2 Evaluation metrics

To evaluate the performance of the ME detection, two evaluation metrics have beenapplied The Precision (P), Recall (R) and F1 score have been used for the performanceevaluation of ME detection Precision is the proportion of the true positives against all thepositive results; Recall is the proportion of the true positives against all the true results andF1 score is the harmonic mean of precision and recall The Precision and Recall metricsare popularly used, however, in order to obtain the in-depth analysis of ME detection, theIntersection over Union (IoU ) metric has been applied in the thesis

The Word error rate (WER) and Expression error rate (ExpRate) evaluation metrics[25] are used to evaluate the accuracy of the ME recognition The ExpRate evaluates theproportion of recognition of MEs in Latex strings that match the ground truth Meanwhile,WER evaluates the number of actions (deletion, substitution or insertion) that are performed

to obtain correct strings

CHAPTER 2

Detection of MEs using the late fusion of handcrafted and

deep learning features

2.1 Introduction

Scientiﬁc document images usually consist of heterogeneous components (e.g., ﬁgures, bles, text and MEs) Conventional approaches have attempted the page segmentation and thehandcrafted feature extraction techniques for the ME detection in document images Conven-

Trang 9

ta-tional methods have faced many diﬃculties in the detection of inline MEs Therefore, in thechapter, a hybrid method of two stages is proposed for the eﬀective detection of mathematicalexpressions First, the layout analysis of entire document images is introduced to improve theaccuracy of text line and word segmentation Then, both isolated and inline expressions indocument images are detected Both hand-crafted and deep learning features are extensivelyinvestigated and combined to improve the detection accuracy The proposed system for the

ME detection is illustrated in Figure 2.1 The proposed system takes a binary document image

as input and outputs an image with position information of detected MEs Like documentanalysis and expression detection methods, input of the proposed method is a non-skew docu-ment image The algorithm can handle camera-captured and scanned document images Afterthe pre-processing, the document is analyzed to obtain text lines for isolated expression detec-tion Non-isolated expressions are segmented into words for inline expression detection Afterthe segmentation, the late fusion of handcrafted and deep learning features are applied for theisolated and inline expression detection modules Finally, the post-processing is performed inorder to obtain the accurate position information of MEs in document images

Figure 2.1 Overall description of the proposed system for mathematical expression detection

Trang 10

technique is useful for the analysis of scanned documents.

2.3 Handcrafted feature extraction for ME detection

Figure 2.2 The ﬂowchart of the isolated and inline expression detection by using handcraftedfeature extraction

The ﬂowchart of the isolated and inline expression classiﬁcation is described in Figure 2.2

In the handcrafted feature extraction approach, the powerful feature extraction and classiﬁerare applied to improve the accuracy of the classiﬁcation of both isolated and inline expressions.2.3.1 Handcrafted feature extraction for isolated ME detection

For isolated expression detection, a text line image is represented in the frequency domain

by using Fourier transformation Given an image a with the size of M × N and its DiscreteFourier Transform (DFT) A(Ω, ψ), the mathematical equation of DFT [7] is deﬁned as follows:

A(Ω, ψ) =

MXm=1

NXn=1

a(m, n)e−j(Ωm+ψn) (2.1)

The FFT [11] has used to transform input document images to the frequency eﬃciently.For the detection of isolated MEs, FFT phase and magnitude are used as the features After

Trang 11

that, SVM, kNN, Decision tree and RF are optimized as the classiﬁers This is a popularmachine learning model to solve binary classiﬁcation.

2.3.2 Handcrafted feature extraction for inline ME detection

To determine an extracted word from a text line is an inline ME (variable, operator,function) or a textual word, a binary classiﬁcation method is proposed A key step in theclassiﬁcation is to extract the dominant features of observed words The important featurethat is used to discriminate inline MEs from textual words is the italic font style of images

In scientiﬁc documents, inline MEs are typically represented in italic font

For feature extraction, ﬁrstly vertical projection proﬁle (VPP) and HPP of each variable

or textual word image is computed Then, peaks and valleys (troughs) of the VPP and HPPare determined In mathematical deﬁnition, peaks and valleys are local maxima and minima,respectively The feature extraction method is based on the Gaussian distribution of thepeaks and valleys The feature vector is formed by features of projection proﬁle of variableand textual word images For each image, these features are described as follows:

(1) The number of peaks in the VPP and HPP

(2) The mean (average) of values of peaks in the VPP and HPP

(3) The standard deviation of values of peaks in the VPP and HPP

(4) The number of valleys in the VPP and HPP

(5) The mean (average) of values of valleys in the VPP and HPP

(6) The standard deviation of values of valleys in the VPP and HPP

For an image of size m × n, the complexity of the feature extraction is O(m) and O(n)for VPP and HPP, respectively because the feature extraction is performed by ﬁnding andanalyzing the peaks and valleys in the VPP and HPP of image

After the feature extraction process, in order to discriminate variables from textual words,different machine-learning based algorithms are used In the section, different classificationmodels are applied: SVM, kNN, Decision tree and RF For the classifiers, tuned parametersplay an important role to achieve high performance Therefore, different parameters of theeach classifier are considered in order to determine the optimal values for the classification

2.4 Deep learning method for ME detection

To improve the accuracy of the detection of both isolated and inline expressions, thetransfer learning technique of AlexNet and ResNet-18 [18] those are popular Neural Networksare employed Comparing with AlexNet, the architecture of ResNet-18 consists of deeperlayers and ResNet-18 normally shows better results in the classiﬁcation task

Figure 2.3 illustrates the ﬂowchart of the transfer learning of CNNs for isolated andinline expression detection module The dominant features are automatically extracted by

Trang 12

Figure 2.3 The ﬂowchart of the isolated and inline expression classiﬁcation by using thetransfer learning of CNNs

the network without any domain speciﬁc knowledge Then, the classiﬁcation is performed bysoftmax layer of the network

2.5 Fusion of handcrafted and deep learning features for ME

Trang 13

Figure 2.4 The ﬂowchart of the late fusion of handcrafted and deep learning features in theclassiﬁcation of isolated and inline MEs.

2.6 Post-processing for ME detection

In the detection of MEs, it is not rare that large isolated expressions are split into severaltext lines The strategies have relied on the results of the character recognition to determine theconditions of merging successive text lines to become an expression Figure 2.5 demonstratesexample of the post-processing

Định dạng
Số trang	27
Dung lượng	811,39 KB